=Paper=
{{Paper
|id=Vol-2936/paper-104
|storemode=property
|title=Lijie at ImageCLEFmed VQA-Med 2021: Attention Model-based Efficient Interaction between
                        Multimodality
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-104.pdf
|volume=Vol-2936
|authors=Jie Li,Shengyan Liu
|dblpUrl=https://dblp.org/rec/conf/clef/LiL21
}}
==Lijie at ImageCLEFmed VQA-Med 2021: Attention Model-based Efficient Interaction between
                        Multimodality==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-104.pdf</pdf>
<pre>
Lijie at ImageCLEFmed VQA-Med 2021: Attention
Model-based Efficient Interaction between
Multimodality
Jie Li1 , Shengyan Liu2
1
    School of Information Science and Engineering, Yunnan University, Kunming 650091, P.R.China
2
    CSIC 750 proving ground,Yunnan Province, Kunming 650216, P.R.China


                                         Abstract
                                         In this paper, we describe the visual question answering (VQA medicine) task in the medical domain
                                         that we submitted on the ImageCLEF 2021 challenge. In terms of semantic feature extraction of question
                                         text, we use a more efficient method than BERT, which is processed through the pre-trained BioBERT
                                         model on the biomedical data set. Then the image and text features are merged and effectively inter-
                                         acted between multimodality through a more efficient MFH (High-order pooling) and co-attention than
                                         MFB (Multimodal factorized bilinear pooling), then we concatenate the various image features from
                                         the problem attention. Finally, the text features after multimodal interaction are mapped to the image
                                         feature vector space for the second fusion. In this way, the result is obtained by sending it to the fully
                                         connected layer and Softmax layer output after two effective fusions. In the ImageCLEF 2021 task, the
                                         overall_accuracy of our model is 0.316 and the BLEU is 0.352, ranking sixth among all participating
                                         teams this time.

                                         Keywords
                                         Multi-modal Factorized High-order Pooling, BioBERT, Co-attention, Visual Question Answering


1. Introduction
In recent years, artificial intelligence technology (AI) [1] has become more and more mature,
especially the rapid development of CV (computer vision) and NLP (natural language processing),
so that some difficult tasks have been mentioned again, and it has also collided with all walks of
life and produced fierce sparks, and gradually penetrated our daily lives. With the advancement
of deep learning [2] algorithms and big data computing power, a medical revolution triggered
by artificial intelligence has come quietly. VQA-Med (Visual Question Answering in Medical
Domain) is one of the most attractive tasks. That is to say, for many diseases, viewing and
analyzing medical images (CT, MRI, Ultrasound) will undoubtedly allow the doctor to inquire
about the patient’s physical condition clearly and intuitively than asking the patient’s feelings.
The same is true for the intelligent diagnosis and treatment system. If the questions raised by
the patient and the medical images provided by the patient can be combined, it can answer
the questions that the patient wants to know more accurately, and even answer some more

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" 782097233@qq.com (S. Liu)
 0000-0002-6590-6693 (J. Li); 0000-0003-4750-5033 (S. Liu)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
complex medical questions.
   For clinicians, the medical visual question answering system can enhance their confidence
in diagnosing the patient’s condition. For patients, the medical visual question answering
system can save them a lot of time and money. Instead of having to search for some unverified
information on the Internet to understand their condition, patients can learn more accurately
what they want to know.
   The concept of Visual Question Answering (VQA) first appeared in 2014. VQA also began to
develop gradually, then the 2018 ImageCLEF competition first proposed the VQA-Med task [3],
and the VQA-Med task of the ImageCLEF competition is still open to university research teams
every year and provides corresponding data sets, attracting the participation of a large number
of researchers. In 2018, the task was mainly to answer questions about abnormal medical
images. There were not many groups that participated at that time, so there were only 5 groups
and the method was relatively simple. In addition, the 2018 VQA-Med task is automatically
generated from the image caption before being manually checked by a human annotator. The
questions and basic answers are variable-length and free-form, which increases the difficulty of
answer generation. In VQA-Med in 2019, the classification task is more clarified, using only
radiographic images and asking questions from four aspects: image modalities, imaging planes,
visual organ systems, and abnormalities that can be detected in images. In the 2020 task, the
data set only contains questions about whether or not and what kind of questions. It continues
until this year’s 2021 VQA-Med competition has not changed to try to get better results.
   For the VQA-Med task in ImageCLEF in 2021 [4], Our model is modified by referring to
the combined model of MFB and Co-attention structure proposed by Zhou [5] and others in
ICCV 2017 which applied to general VQA tasks. Specific steps are as follows: 1. For question
text extraction, we use Bio-BERT’s pre-trained model on the medical data set to process. 2.
For image processing, we use vgg8 [6] for processing, but there is no such complex network
as ResNet152 to avoid problems such as excessive training parameters, running delays, and
overfitting. 3. MFH is used for efficient fusion during fusion, and the co-attention mechanism is
introduced during fusion to improve the effect.
   The other parts of this paper are organized as follows. The second section briefly describes
the literature review on VQA and VQA-Med. The third section introduces the VQA-Med task
and the detailed analysis of its data set. In the fourth part, we introduce the specific methods and
principles we used. In the fifth section, we introduce the model and specific steps we used in the
experiment. The sixth part introduces the results we submitted. Finally, the paper summarizes
and prospects in the seventh part.


2. Related Work
For the development of VQA tasks in the general field, in 2014, Malinowski et al. [7] initially
proposed the concept of “open-world" for visual question and answer, and designed a Bayesian
model frame model that combines image semantic scene segmentation and text symbol reasoning
two methods to realize automatic question and answer of natural language questions. Also
using the Bayesian structure model framework is the view put forward by Kushal et al. [8],
which transforms open questions into multi-classification problems, such as converting the
question “What color is this cat?" into the type of color recognition problem. In fact, most
of the initial approach is to use the CNN-RNN framework to process image and text features
separately, and almost all images are processed by CNN convolution, and text is processed by
RNN, and then the fusion between multi-modality uses the factorization of bilinear pooling,
such as MCB (multimodal compact bilinear pooling) [9] and MLB (Multimodal low-rank bilinear
pools) [10], or more advanced MFB (Multimodal factorized bilinear pooling) and MFH (High-
order pooling) [11]. Of course, there are other fusion methods, such as those that map to the
same vector space. The subsequent development is the introduction of the latest models and
the tuning process of various algorithms.
   For the medical VQA, the development is much slower. The main reason is that the data
set in the medical field is much more difficult to obtain than in the general field. Because this
requires labeling by professional medical staff and a lot of time to carefully select the quality
of the data set, such as the sharpness of the image. Because of the scarcity of data sets, the
VQA task has not been rapidly developed in the medical field. The ImageCLEF competition
is one of the few organizers that provide data sets in the medical field of VQA. In 2018, Peng
et al. [12] proposed a model based on collaborative attention mechanism and MFB feature
fusion, and their experimental results achieved first place in the ImageCLEF2018 VQA-Med task.
Zhou et al. [13] proposed a model based on Inception-Resnet-v2 [14] and Bi-LSTM [15] and
won the second place in the competition. Zachary et al. [16] proposed a model of SAN (two-
layer Attention mechanism layer stacking) structure, which ranked third in the competition.
In the second year of the ImageCLEF VQA-Med mission, the Zhejiang University team [17]
proposed a combination of Bert [18] and MFB. In particular, this model extracts image features
from the middle layer of ImageNet pre-trained VGG16, using Bert performs word embedding
on the text to extract features and then uses MFB to perform feature fusion, and the results
of the experiment won first place in the ImageCLEF2019 VQA-Med [19] and MFB task. The
ImageCLEF2020 VQA-Med [20] task competition has just come to an end. It can be seen from
the literature that researchers have made certain innovations to the traditional VQA depth
model. The University of Adelaide team [21] proposed Skeleton-based Sentence Mapping (SSM)
combined with a knowledge reasoning model and won first place in the competition. They
combined the knowledge reasoning method into VQA-Med for the first time. Medical VQA is
equivalent to start development based on the general VQA field, and the model draws on the
methods used in the general VQA development process, but its limitation is that the lack and
difference of data sets have led to many method limitations.
   In the 2021 ImageCLEF VQA-Med competition and drawing on the methods used in previous
competitions, we also made improvements and innovations. In data processing, image enhance-
ment methods such as ZCA (whitening technology image enhancement model) [22] are also
introduced to make feature extraction richer to get better output.


3. Task and Dataset
Compared with ImageCLEF VQA-Med 2020, the data set is not much different. Last year’s data
set consisted of 4,000 medical images and 4000 question-answer (QA) pairs in the training set,
500 medical images and 500 QA pairs in the validation set, and 500 questions and 500 medical
Figure 1: Three forms in ImageCLEF VQA-Med 2021 data set


images in the test set. This year’s training set contains 4000 radiological images and related
question and answer (QA) pairs. The verification set contains 500 radiological images and
related question (QA) pairs. The test set contains 500 radiological images and related question
pairs. In addition, the data set was classified into four categories (modal, plane, organ system,
and anomaly) in 2019.
   To improve the accuracy of the experiment we also used last year’s data set as an extended
data set for training, but erased the four classification information labels, and the 2020 data set is
added to the training set for training. In addition, before training this year’s data set separately,
image enhancement techniques such as ZCA (whitening technology image enhancement model)
were used for image enhancement. The specific images and question-and-answer pairs (QA) in
the ImageCLEF VQA-Med 2021 data set [23] are shown in Figure 1.


4. Methods
4.1. Image extraction feature
We first preprocessed the image and used the whitening technology image enhancement model
and the adaptive histogram equalization method to limit the contrast to effectively enhance the
details of the medical image. While subtly enhancing the contrast of medical images, it also plays
a role in suppressing noise. Then a simplified version of the VGG8 model based on the VGG16
model pre-trained on the ImageNet data set [24] was used to extract image features. Because
networks such as VGG16 or ResNet50 are too large and the amount of calculation is too large,
and the use of such large networks to extract image feature extraction is too redundant and
wasteful of resources. The actual experiment also proved that after the previous preprocessing,
as long as the small network can achieve the same extraction effect as these large network
models, it can effectively avoid overfitting and shorten the training time. So we reduced the
original 13-layer convolutional layer of VGG16 to 5 layers and reduced the number of nodes in
the following 3 layers of fully connected layers to 128.
4.2. Feature extraction and coding aspects of question text
We used BioBERT [25] which is better than BERT to extract the semantic features of the problem.
Since BERT performed well in the ImageCLEF VQA-Med competitions in previous years, we
also continued to use the pre-trained model for semantic feature extraction. BioBERT is pre-
trained in biomedical text. The network structure is the same as BERT. It inherits almost all the
advantages of BERT, and its performance in various biomedical text mining tasks is much better
than BERT and previous advanced models. We only need to modify the last layer to make it
average to more effectively represent the text features of the question sentence.

4.3. Feature fusion
Feature fusion is the same as image feature extraction and question text feature extraction,
which is the key point of whether the VQA task can perform well. To make the interaction
between different modalities more effective, we use the MFH which is more efficient than the
previous MFB. Because in the dimensionality reduction operation before multiplying between
multimodal matrices, MFH can be converted to a more suitable dimension for more effective
fusion. At the same time, co-attention is introduced to achieve the characteristics of the problem
text and pay more attention to the feature area of the image to improve the effect. Using question
text features to capture and attention image-specific area features, and a total of two effective
MFH fusions have achieved more accurate regional feature extraction.


5. Experiment
In the ImageCLEF 2021 VQA competition, the model we used is shown in Figure 2. Among
them, in terms of extracting image features, we use VGG8 which is a simplified VGG16 network.
Because with limited resources, through the number of image data sets after data enhancement,
VGG8 is effective enough for extracting image features. Compared with large-scale networks
(such as ResNet50), the effect gap is not too huge but the speed is greatly improved. In addition
to improving speed, it also prevents overfitting. The text features are pre-trained on BioBERT.
After the combined model of MFB and Co-attention fusion is performed, the image is weighted,
and then concatenating is performed. The next step is to re-extract features after performing
attention operations on the original image. Here we set 4 sets of attention values, because too
many extraction groups will ignore the relationship between the information in the image, and
too little will make it impossible to better extract the important features of the image So, in the
end, we chose 4 groups which is the best grouping.
   After having a better interaction between image features and question text features and
giving important features greater weight, once again, the two features are fused and output,
here we no longer use the MFH module for fusion. Because the previous 4 sets of features and
multiple pieces of training have made the interaction between the modalities sufficient, the
image feature information is mapped to the text feature information vector space for fusion,
which can effectively reduce the amount of calculation and save resources. Finally, through FC
Layer and Softmax layer output.
                                                                                                                                                                                                 Attenti
                                                                                                                                                                                                 on.Feat
                                                                                            Image Attention


                                                                                                                                                                                                   #4
                                                                                                                                                                                                 Attenti
                                                                                                                                                                                                 on.Feat
                                                                                                                                                                                                   #3
                                                                                                                                                             SoftMax


                                                                                                                                                                                                                                    Concat
                                                                                                   Conv


                                                                                                                 ReLU


                                                                                                                          Conv


                                                                                                                                   ReLU


                                                                                                                                               Conv
                                                                                      MFH
                                                              VGG8


                                                                                                                                                                                                 Attenti
                                                                                                                                                                                                 on.Feat
                                                                                                                                                                                                   #2
                                                                                                                                                                                                 Attenti
                                                                                                                                                                                                 on.Feat
                                                                                                                                                                                                   #1


                                                                                                                                                                                                                                  Mapping&
                                                                                                   Question Attention


                                                                                                                                                                                                                                   Concat


                                                                                                                                                                                                                                                       SoftMax
                                                                     BioBERT                                                                                                                                                                                     “renal cell carcinoma”


                                                                                                                                                                                                                                                  FC
                                                                                                                                                                                                                                             FC
                                                                                                                                                                        Attention.
 what   is   most   alarming   about   this   ct   Scan   ?


                                                                                                                                                                           Feat
                                                                                                                                     SoftMax
                                                                                                                                                      what             is      most   alarming   about     this   ct   Scan   ?

                                                                               Conv


                                                                                            ReLU


                                                                                                          Conv


                                                                                                                   ReLU


                                                                                                                            Conv
Figure 2: The model we used in the ImageCLEFmed VQA-Med 2021 competition


  In the experiment, the loss function we used is the binary cross-entropy loss function, the
optimizer is Adam, and the learning rate is 1e-5.


6. Results And Summary
6.1. Results
In the ImageCLEFmed VQA-Med 2021 competition, the overall _accuracy and BLEU are used as
the evaluation indicators for the final submission results ranking display. That is the proportion
of correct predictions, the similarity between the real answer and the predicted answer. Figure
3 shows the change curve of accuracy and loss during our training. The final results after we
submitted were 0.316 and 0.352 respectively, ranking 6th among valid submitters. Figure 4
shows the ranking page of this ImageCLEF VQA-Med 2021 medical competition.

6.2. Summary
Due to the limitations of hardware resources and other conditions, the main idea of this
experiment is to obtain the best results with the least resource cost, so the smallest possible
modified version of VGG8 is used to extract image features. The corresponding remedy is to
use image enhancement. And set an appropriate number of extraction groups in the subsequent
collaborative attention mechanism to improve the accuracy. After experimenting with the VGG8
model, we also tried to use larger networks such as VGG16 and ResNet50 to extract features. The
accuracy rate has indeed improved, but the time has been much longer and the space overhead
has also been much higher, so in the end, I chose a more cost-effective small network, which I
originally designed to achieve. If you only start with high-precision considerations, it is better
to use large-scale network training when there is ample time. Table 1 shows the comparison of
the accuracy results of the tried several models on the validation set.
Figure 3: The change curve of accuray and loss during training


Table 1
Comparison of several models on the validation set
                               Model                      Accuray on the validationSet
              VGG16+BioBERT+Co-Attention+MFB                          0.66
            ResNet50+BioBERT+Co-Attention+MFH                         0.69
            VGG8+BioBERT+Co-Attention+MFH+ZCA                         0.62


7. Perspectives For Future work
VQA technology mainly includes the solution of three problems: the extraction of image features,
the extraction of problem text features, and the effective fusion of multi-modal features. The
effectiveness of these three areas directly affects the quality of the results. In this experiment,
for the extraction and characterization of text features, we used Bio-Bert’s pre-training weights
on the biomedical data set, used the VGG8 model for image feature extraction, and used efficient
MFH for fusion. Considering the limited resources, in the image feature extraction, VGG8 with
a small number of layers is used, and in the second fusion, a faster mapping method is used
instead of matrix multiplication MFH and other methods. In other words, that is to reduce the
final result score to save more resources and time. In addition, the image pre-training model
Figure 4: 2021 VQA-Med leaderboard


is not pre-trained on a large medical data set and the training time is too long, which leads to
insufficient training times, and the parameters and models are not adjusted to the best, which
leads to the result It was not optimal.
   In addition, although we used the attention mechanism to align the question text with the
corresponding area of the image, after all, there is no refined feature mark for reference, and
the result is inevitably bad. In future work, we plan to use VisualBert [26], ImageBert [27], and
the Transformer structure model to achieve better performances. Try to migrate from a data set
marked with position coordinates, and introduce the method of target detection to make the
alignment of text and image more accurate, making the effect better.


References
 [1] A. Saffiotti, An ai view of the treatment of uncertainty, The Knowledge Engineering
     Review 2 (1987) 75–97.
 [2] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, volume 1, MIT press
     Cambridge, 2016.
 [3] S. A. Hasan, Y. Ling, O. Farri, J. Liu, H. Müller, M. P. Lungren, Overview of imageclef 2018
     medical domain visual question answering task., in: CLEF (Working Notes), 2018.
 [4] B. Ionescu, H. Müller, R. Peteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A.
     Hasan, V. Kovalev, S. Kozlovski, V. Liauchuk, Y. Dicente, O. Pelka, A. G. S. de Herrera,
     J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D.
     Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid,
     A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval
     in medical, nature, internet and social media applications, in: Experimental IR Meets
     Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International
     Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer
     Science, Springer, Bucharest, Romania, 2021.
 [5] Z. Yu, J. Yu, J. Fan, D. Tao, Multi-modal factorized bilinear pooling with co-attention learn-
     ing for visual question answering, in: Proceedings of the IEEE international conference
     on computer vision, 2017, pp. 1821–1830.
 [6] V. Liauchuk, Imageclef 2019: Projection-based ct image analysis for tb severity scoring
     and ct report generation., in: CLEF (Working Notes), 2019.
 [7] M. Malinowski, M. Fritz, A multi-world approach to question answering about real-world
     scenes based on uncertain input, arXiv preprint arXiv:1410.0210 (2014).
 [8] K. Kafle, C. Kanan, Answer-type prediction for visual question answering, in: Proceedings
     of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4976–4984.
 [9] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact
     bilinear pooling for visual question answering and visual grounding, arXiv preprint
     arXiv:1606.01847 (2016).
[10] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, B.-T. Zhang, Hadamard product for low-rank
     bilinear pooling, arXiv preprint arXiv:1610.04325 (2016).
[11] Z. Yu, J. Yu, C. Xiang, J. Fan, D. Tao, Beyond bilinear: Generalized multimodal factorized
     high-order pooling for visual question answering, IEEE transactions on neural networks
     and learning systems 29 (2018) 5947–5959.
[12] Y. Peng, F. Liu, M. P. Rosen, Umass at imageclef medical visual question answering
     (med-vqa) 2018 task., in: CLEF (Working Notes), 2018.
[13] Y. Zhou, X. Kang, F. Ren, Employing inception-resnet-v2 and bi-lstm for medical domain
     visual question answering., in: CLEF (Working Notes), 2018.
[14] C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, inception-resnet and the impact
     of residual connections on learning, in: Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 31, 2017.
[15] M. Schuster, K. K. Paliwal, Bidirectional recurrent neural networks, IEEE transactions on
     Signal Processing 45 (1997) 2673–2681.
[16] A. B. Abacha, S. Gayen, J. J. Lau, S. Rajaraman, D. Demner-Fushman, Nlm at imageclef
     2018 visual question answering in the medical domain., in: CLEF (Working Notes), 2018.
[17] X. Yan, L. Li, C. Xie, J. Xiao, L. Gu, Zhejiang university at imageclef 2019 visual question
     answering in the medical domain., in: CLEF (Working Notes), 2019.
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[19] A. B. Abacha, S. A. Hasan, V. V. Datla, J. Liu, D. Demner-Fushman, H. Müller, Vqa-med:
     Overview of the medical visual question answering task at imageclef 2019., in: CLEF
     (Working Notes), 2019.
[20] A. B. Abacha, V. V. Datla, S. A. Hasan, D. Demner-Fushman, H. Müller, Overview of the
     vqa-med task at imageclef 2020: Visual question answering and generation in the medical
     domain, CLEF 2020 Working Notes (2020) 22–25.
[21] Z. Liao, Q. Wu, C. Shen, A. van den Hengel, J. Verjans, Aiml at vqa-med 2020: Knowledge
     inference via a skeleton-based sentence mapping approach for medical domain visual
     question answering, CLEF, 2020.
[22] H. K. Verma, S. Sindhu Ramachandran, Harendrakv at vqa-med 2020: Sequential vqa with
     attention for medical visual question answering (2020).
[23] A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, H. Müller, Overview of the
     vqa-med task at imageclef 2021: Visual question answering and generation in the medical
     domain, in: CLEF 2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org,
     Bucharest, Romania, 2021.
[24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical
     image database, in: 2009 IEEE conference on computer vision and pattern recognition,
     Ieee, 2009, pp. 248–255.
[25] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics 36 (2020)
     1234–1240.
[26] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, Visualbert: A simple and performant
     baseline for vision and language, arXiv preprint arXiv:1908.03557 (2019).
[27] D. Qi, L. Su, J. Song, E. Cui, T. Bharti, A. Sacheti, Imagebert: Cross-modal pre-training
     with large-scale weak-supervised image-text data, arXiv preprint arXiv:2001.07966 (2020).

</pre>