=Paper=
{{Paper
|id=Vol-2696/paper_74
|storemode=property
|title=HCP-MIC at VQA-Med 2020: Effective Visual Representation for Medical Visual Question Answering
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_74.pdf
|volume=Vol-2696
|authors=Guanqi Chen,Haifan Gong,Guanbin Li
|dblpUrl=https://dblp.org/rec/conf/clef/ChenGL20
}}
==HCP-MIC at VQA-Med 2020: Effective Visual Representation for Medical Visual Question Answering==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_74.pdf</pdf>
<pre>
    HCP-MIC at VQA-Med 2020: Effective Visual
     Representation for Medical Visual Question
                     Answering

                   Guanqi Chen, Haifan Gong, and Guanbin Li?

    School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
               chengq26@mail2.sysu.edu.cn, haifangong@outlook.com,
                           liguanbin@mail.sysu.edu.cn


        Abstract. This paper describes our submission for the Medical Domain
        Visual Question Answering Task of ImageCLEF 2020. We desert com-
        plex cross-modal fusion strategies and concentrate on how to capture the
        effective visual representation, due to the information inequality between
        images and questions in this task. Based on the observation of long-tailed
        distribution in the training set, we utilize the bilateral-branch network
        with a cumulative learning strategy to tackle this issue. Besides, to alle-
        viate the issue of limited training data, we design an approach to extend
        the training set by Kullback-Leibler divergence. Our proposed method
        achieved the score with 0.426 in accuracy and 0.462 in BLEU, which
        ranked 4th in the competition. Our code is publicly available1 .


1      Introduction
Visual Question Answering (VQA) aims at answering questions according to the
content of corresponding images. In recent years, researchers have made great
progress in the VQA task with many effective methods and large-scale datasets.
With the purpose of supporting clinical decision making and improving patient
engagement, the VQA task is introduced into the medical field. To promote
the development of medical VQA, ImageCLEF [10] organizes 3rd edition of the
Medical Domain Visual Question Answering Task [3] (see examples in Figure 1).
Compared to general VQA, the valid medical data for training is limited in
ImageCLEF 2020 VQA-Med task. Besides, it focuses particularly on questions
about abnormalities, which is different from previous editions of the VQA-Med
task. We argue that the semantic information from questions is finite due to the
single theme of the ImageCLEF 2020 VQA-Med task. However, there are many
kinds of abnormal medical images, which need effective visual representation to
distinguish them.
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
  ber 2020, Thessaloniki, Greece.
?
  Corresponding author is Guanbin Li.
1
  Code: https://github.com/haifangong/HCP-MIC-at-ImageCLEF-VQA-Med-2020.
   Q:“Is the mri normal?”               Q:“Are there abnormalities   Q:“What is abnormal in
                                        in this ct scan?”            the x-ray?”
   A: “Yes”                             A: “No”                      A: “Vacterl syndrome”
              (a)                                      (b)                      (c)

Fig. 1. Three examples of image and corresponding question-answer pair in the Im-
ageCLEF 2020 VQA-Med training set.
                     Number of images


                                                 Abnormalities


  Fig. 2. Long-tailed distribution in the ImageCLEF 2020 VQA-Med training set.


    In this paper, we describe the method we developed to deal with the above
concerns. Based on the observation of the questions, we divide them into three
groups, and utilize a pre-trained BioBERT [12] to classify them. As for visual
representation, we map abnormalities to medical images, and discover the phe-
nomenon of the long-tailed distribution in the training set (as shown in Fig-
ure 2). Thus, we apply the bilateral-branch network with a cumulative learning
strategy [19] to obtain effective visual representation. In addition, we propose
a retrieval-based candidate answer selection algorithm to further improve the
performance. Last but not least, to alleviate the issue of limited training data,
we design an approach to expand the training set by Kullback-Leibler (KL)
divergence.
2   Related Work

The common framework for VQA systems is composed of four parts: an image
encoder, a question encoder, a cross-modal fusion strategy, and an answer predic-
tor. Many researchers highlight and explore the cross-modal fusion strategy for a
better combination of visual and linguistic information. Some works [6,11] utilize
compact bilinear pooling methods to capture the joint representation between
images and questions. Yang et al. [16], Cao et al. [4], and Anderson et al. [2]
exploited the question information to attend the corresponding sub-region of the
image. [17] proposed a co-attention mechanism between images and questions
to obtain better multi-modal alignment and representation. However, based on
the observation that the ImageCLEF 2020 VQA-Med task focuses particularly
on questions about abnormalities, we argue that the abnormalities rely on the
information from images rather than questions. Thus, we desert the complex
cross-modal fusion strategy due to the information inequality between images
and questions. And we concentrate on how to obtain an effective visual repre-
sentation.
    As for visual representation in VQA systems, the bottom-up feature repre-
sentation [2] based on deep CNNs is adopted by many works. [2] utilized Faster
R-CNN [15] to capture region-specific features in a bottom-up attention way,
which boosted the performance of VQA and image captioning tasks. However,
since not all radiology images contain object-level annotations, medical VQA
systems usually apply a CNN to extract grid-like feature maps as visual rep-
resentations. In the ImageCLEF 2020 VQA-Med task, we discover that there
exists a long-tailed distribution phenomenon in the training set. Therefore, we
adopt the bilateral-branch network with a cumulative learning strategy to obtain
effective visual representation.


3   Datasets

In ImageCLEF 2020 VQA-Med task, the dataset includes a training set of 4000
radiology images with 4000 question-answer (QA) pairs, a validation set of 500
radiology images with 500 QA pairs, and a test set of 500 radiology images with
500 QA pairs. The questions mainly focus on the abnormalities of medical im-
ages, and they can be divided into two forms. One is making inquiries about the
existence of abnormalities in the picture, and another is making inquiries about
the abnormal type. Figure 1 shows three examples in the VQA-Med dataset.
    The VQA-Med-2019 dataset [1] can be used as additional training data,
whose training set contains 3200 medical images associated with 12792 QA pairs.
However, different from the VQA-Med-2020 dataset, it focuses on four main
categories of questions: Modality, Plane, Organ system, and Abnormality. In
this paper, we only leverage its Abnormality subset to extend the VQA-Med-
2020 training set .
    What is the primary          Question semantic
    abnormality in this image?     classification


                                   Vision-based         Retrieval-based
                                 candidate answer      candidate answer     Adrenal adenoma
                                   classification          selection


                  Fig. 3. Overview of the proposed medical VQA framework.


4     Methodology

As shown in Figure 3, our medical VQA framework consists of three parts: ques-
tion semantic classification, vision-based candidate answer classification, and
retrieval-based candidate answer selection. We train the first two parts sepa-
rately and then connect all the components to predict the final answer in the
inference phase. Besides, we design a distribution-based algorithm to expand the
training set for further improving the performance of the model.


4.1     Question semantic classification

According to different answer forms, we divide all the questions into two cate-
gories: open-ended questions (e.g., Figure 1(c)), and closed-ended questions (e.g.,
Figure 1(a)(b)). Based on different semantic information, the closed-ended ques-
tions can be further separated into two classes: closed-ended abnormal questions
(e.g., Figure 1(b)) representing whether the image is abnormal, and closed-ended
normal questions (e.g., Figure 1(a)) denoting whether the image looks normal.
In all, we need to classify the question sentence into three categories: open-ended
questions, closed-ended abnormal questions, and closed-ended normal questions.
    For question semantic classification, a pre-trained BioBERT is adopted to
classify the questions. Unlike the conventional BERT [5], the BioBERT is a
domain-specific language representation model pre-trained on large-scale biomed-
ical corpora. Based on BioBert, we send the 768-dimensional vector, the output
of BioBERT, into a 2-layer MLP to obtain the classification score of the input
question.


4.2     Vision-based candidate answer classification

In this part, we need to classify the answer according to the radiology image and
the category of question. For the closed-ended questions, we apply the ResNet-
34 [7] to distinguish normal and abnormal medical images. Then, combining with
the fine-grained category information of closed-ended questions, we can simply
choose the answer from “yes” or “no”.
    As for the open-ended questions, we need to predict the specific abnormalities
of the input images in a classification way. Due to the long-tailed distribution
among the candidate answers in the training set, we apply the bilateral-branch
network (BBN) with a cumulative learning strategy to deal with this problem.
In BBN, there are two branches, one is called “conventional learning branch”,
and another is called “re-balancing branch”. The conventional learning branch
is for representation learning while the re-balancing is for classifier learning. In
the meanwhile, a novel cumulative learning strategy is proposed for adjusting
bilateral learning. It is worth noting that, inspired by the attention mechanism,
many advanced residual networks have been proposed, such as SE-Net [9], SK-
Net [13], NLCE-Net [8]. In this work, we replace the original ResNet in BBN
with ResNeSt [18].

4.3   Retrieval-based candidate answer selection
As for the open-ended questions, we discover that the top-5 score is about 10%
higher than the top-1 score for the open-ended questions the training procedure.
To alleviate this issue, we apply the retrieval-based top-5 answer selection to
further improve the performance. The schedule is designed into three steps. The
first step is to create a feature dictionary of each class based on the training set.
It is worth noting that those features are extracted from the BBN. The second
one is to calculate the feature-level cosine similarity between the input sample
and all the training samples belong to the top-5 categories. Then, we treat the
answer of the most similar training sample as the final prediction.

4.4   Expanding the training set by Kullback-Leibler divergence
Since the valid medical data for training is limited in ImageCLEF 2020 VQA-
Med task and external datasets are allowed to use, we expand the training set
with the data from the VQA-Med-2019 dataset. Before extending the training
set, we define the distribution of the VQA-Med-2020 training set as Ptr , which
is obtained by:
                                          nk
                                 Ptr = PC                                   (1)
                                         j=1 nj
where k and j are the indexes of category, C denotes the number of catergories,
and n represents the number of samples with same category. And we exploit
the same way to calculate the distribution of the validation set Pv . The KL
divergence between Ptr and Pv is defined as:
                                       X             Pv (k)
                     DKL (Pv ||Ptr ) =    Pv (k) log                       (2)
                                                     Ptr (k)
                                           k

Then we expand the training set by the following steps. For each sample in the
the Abnormality training subset of the VQA-Med-2019 dataset, we assume that
it is added to the VQA-Med-2020 training set. Then, we calculate the distribu-
tion of new training set P̂tr and the KL divergence DKL (Pv ||P̂tr ). Lastly, we ex-
tend the training set with the sample if DKL (Pv ||P̂tr ) is lower than DKL (Pv ||Ptr ).
          Table 1. Official results of the ImageCLEF 2020 VQA-Med task.

                       Teams               Accuracy       BLEU
                        z liao              0.496         0.542
                  TheInceptionTeam          0.480         0.511
                    bumjun jung             0.466         0.502
                        Ours                0.426         0.462
                        NLM                 0.400         0.441

           Table 2. Ablation study on the VQA-Med-2020 validation set.

                       Methods                          Accuracy          Boost
                        Baseline                         36.6%              -
                   +BBN-ResNet-34                        51.0%           +14.4%
       +Training Set Expansion by KL Divergence          54.0%           +3.0%
                   +BBN-ResNeSt-50                       55.0%           +1.0%
               +Image Center Cropping                    56.6%           +1.6%
      +Retrieval-based Candidate Answer Selection        57.2%           +0.6%


5     Experiments
5.1   Implementation details
As for training data, we leverage the whole VQA-Med-2020 training set with 4000
questions to train the BioBERT for question semantic classification. We leverage
the extended dataset to train the vision-based model. Among them, 303 images
are used to train a ResNet-34 to determine whether the images are abnormal or
not, and 4039 images are used to train a BBN to recognize the abnormalities.
Besides, a center cropping operation is applied to the input image.
    We train those models which are mentioned above separately with corre-
sponding cross-entropy losses. And the optimizer we used is SGD with momen-
tum which is set to 0.9. The initial learning rate is set to 0.08, and the weight
decay is 4e-4. We select the best model based on the performance on the valida-
tion set.

5.2   Evaluation
The VQA-Med competition uses accuracy and BLEU [14] as the evaluation met-
rics. Accuracy is calculated as the number of correct predicted answers over to-
tal answers. BLEU measures the similarity between the predicted answers and
ground truth answers. As shown in Table 1, we achieved an accuracy of 0.426
and a BLEU score of 0.462 in the VQA-Med-2020 test set, which won the 4th
place in the competition.

5.3   Abaltion study
In this section, we study some contributions of our proposed method on the VQA-
Med-2020 validation set, which is shown in Table 2. The baseline represents the
method that contains a BioBERT for question semantic classification and two
ResNet-34 models for vision-based candidate answer classification. And we train
the baseline with the original VQA-Med-2020 training set.
    Firstly, we replace a ResNet-34 with a BBN-ResNet-34 to better recognize the
abnormalities, which surpasses the baseline by 14.4%. We expand the training
set by KL divergence, which brings an improvement of 3.0%. The performance
is further boosted by 1.0%, using a powerful ResNeSt-50 backbone. Then, we
apply a center cropping operation to the input image for reducing noise, which
leads to 1.6% improvement. The strategy of retrieval-based candidate answer
selection brings a performance gain of 0.6%. Finally, we achieve 57.2% accuracy
on the VQA-Med-2020 validation set.


6    Conclusion

In this paper, we describe the method we submitted in ImageCLEF 2020 VQA-
Med task. Considering the information inequality between images and questions
in this task, we desert complex cross-modal fusion strategies. We adopt the
bilateral-branch network with a cumulative learning strategy to handle the long-
tailed problem for effective visual representation. Besides, to alleviate the issue
of limited training data, we design an approach to extend the training set by
Kullback-Leibler divergence. In addition, we propose a retrieval-based candidate
answer selection module to further boost the performance. Our proposed method
achieves great results with an accuracy of 0.426 and a BLEU score of 0.462.


References
 1. Abacha, A.B., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman, D., Müller, H.:
    Vqa-med: Overview of the medical visual question answering task at imageclef
    2019. In: CLEF (2019) 3
 2. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang,
    L.: Bottom-up and top-down attention for image captioning and visual question
    answering. In: Proceedings of the IEEE conference on computer vision and pattern
    recognition. pp. 6077–6086 (2018) 2
 3. Ben Abacha, A., Datla, V.V., Hasan, S.A., Demner-Fushman, D., Müller, H.:
    Overview of the vqa-med task at imageclef 2020: Visual question answering and
    generation in the medical domain. In: CLEF 2020 Working Notes. CEUR Work-
    shop Proceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020)
    1
 4. Cao, Q., Liang, X., Li, B., Li, G., Lin, L.: Visual question reasoning on general de-
    pendency tree. In: The IEEE Conference on Computer Vision and Pattern Recog-
    nition (CVPR) (June 2018) 2
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
    rectional transformers for language understanding. In: NAACL-HLT (2019) 4.1
 6. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multi-
    modal compact bilinear pooling for visual question answering and visual grounding.
    arXiv preprint arXiv:1606.01847 (2016) 2
 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
    Proceedings of the IEEE conference on computer vision and pattern recognition.
    pp. 770–778 (2016) 4.2
 8. He, X., Yang, S., Li, G., Li, H., Chang, H., Yu, Y.: Non-local context encoder: Ro-
    bust biomedical image segmentation against adversarial attacks. In: AAAI (2019)
    4.2
 9. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: IEEE Conference
    on Computer Vision and Pattern Recognition. pp. 7132–7141 (2018) 4.2
10. Ionescu, B., Müller, H., Péteri, R., Ben Abacha, A., Datla, V., Hasan, S.A.,
    Demner-Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka,
    O., Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L.,
    Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T.,
    Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu,
    M., Ştefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multimedia
    retrieval in lifelogging, medical, nature, and internet applications. In: Experimental
    IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th
    International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS
    Lecture Notes in Computer Science, Springer, Thessaloniki, Greece (September 22-
    25 2020) 1
11. Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neu-
    ral Information Processing Systems. pp. 1564–1574 (2018) 2
12. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a
    pre-trained biomedical language representation model for biomedical text mining.
    Bioinformatics (09 2019). https://doi.org/10.1093/bioinformatics/btz682 1
13. Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: IEEE Conference
    on Computer Vision and Pattern Recognition (2019) 4.2
14. Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic eval-
    uation of machine translation. In: Proceedings of the 40th Annual Meeting of the
    Association for Computational Linguistics. pp. 311–318. ACL (2002) 5.2
15. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
    tion with region proposal networks. In: Advances in neural information processing
    systems. pp. 91–99 (2015) 2
16. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for
    image question answering. In: Proceedings of the IEEE conference on computer
    vision and pattern recognition. pp. 21–29 (2016) 2
17. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks
    for visual question answering. In: The IEEE Conference on Computer Vision and
    Pattern Recognition (CVPR) (June 2019) 2
18. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., Sun, Y., He, T., Mueller,
    J., Manmatha, R., Li, M., Smola, A.J.: Resnest: Split-attention networks. CoRR
    abs/2004.08955 (2020) 4.2
19. Zhou, B., Cui, Q., Wei, X.S., Chen, Z.M.: Bbn: Bilateral-branch network with cu-
    mulative learning for long-tailed visual recognition. In: The IEEE/CVF Conference
    on Computer Vision and Pattern Recognition (CVPR) (June 2020) 1

</pre>