<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCP-MIC at VQA-Med 2020: E ective Visual Representation for Medical Visual Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guanqi Chen</string-name>
          <email>chengq26@mail2.sysu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haifan Gong</string-name>
          <email>haifangong@outlook.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guanbin Li?</string-name>
          <email>liguanbin@mail.sysu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Data and Computer Science, Sun Yat-sen University</institution>
          ,
          <addr-line>Guangzhou</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our submission for the Medical Domain Visual Question Answering Task of ImageCLEF 2020. We desert complex cross-modal fusion strategies and concentrate on how to capture the e ective visual representation, due to the information inequality between images and questions in this task. Based on the observation of long-tailed distribution in the training set, we utilize the bilateral-branch network with a cumulative learning strategy to tackle this issue. Besides, to alleviate the issue of limited training data, we design an approach to extend the training set by Kullback-Leibler divergence. Our proposed method achieved the score with 0.426 in accuracy and 0.462 in BLEU, which ranked 4th in the competition. Our code is publicly available1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <sec id="sec-1-1">
        <title>Q:“Is the mri normal?”</title>
        <p>A: “Yes”
(a)</p>
      </sec>
      <sec id="sec-1-2">
        <title>Q:“Are there abnormalities in this ct scan?”</title>
        <p>A: “No”
(b)</p>
      </sec>
      <sec id="sec-1-3">
        <title>Q:“What is abnormal in the x-ray?”</title>
        <p>A: “Vacterl syndrome”
(c)</p>
        <p>s
e
g
a
m
if
o
r
e
b
m
u
N</p>
        <p>Abnormalities</p>
        <p>In this paper, we describe the method we developed to deal with the above
concerns. Based on the observation of the questions, we divide them into three
groups, and utilize a pre-trained BioBERT [12] to classify them. As for visual
representation, we map abnormalities to medical images, and discover the
phenomenon of the long-tailed distribution in the training set (as shown in
Figure 2). Thus, we apply the bilateral-branch network with a cumulative learning
strategy [19] to obtain e ective visual representation. In addition, we propose
a retrieval-based candidate answer selection algorithm to further improve the
performance. Last but not least, to alleviate the issue of limited training data,
we design an approach to expand the training set by Kullback-Leibler (KL)
divergence.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The common framework for VQA systems is composed of four parts: an image
encoder, a question encoder, a cross-modal fusion strategy, and an answer
predictor. Many researchers highlight and explore the cross-modal fusion strategy for a
better combination of visual and linguistic information. Some works [
        <xref ref-type="bibr" rid="ref6">6,11</xref>
        ] utilize
compact bilinear pooling methods to capture the joint representation between
images and questions. Yang et al. [16], Cao et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and Anderson et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
exploited the question information to attend the corresponding sub-region of the
image. [17] proposed a co-attention mechanism between images and questions
to obtain better multi-modal alignment and representation. However, based on
the observation that the ImageCLEF 2020 VQA-Med task focuses particularly
on questions about abnormalities, we argue that the abnormalities rely on the
information from images rather than questions. Thus, we desert the complex
cross-modal fusion strategy due to the information inequality between images
and questions. And we concentrate on how to obtain an e ective visual
representation.
      </p>
      <p>
        As for visual representation in VQA systems, the bottom-up feature
representation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] based on deep CNNs is adopted by many works. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] utilized Faster
R-CNN [15] to capture region-speci c features in a bottom-up attention way,
which boosted the performance of VQA and image captioning tasks. However,
since not all radiology images contain object-level annotations, medical VQA
systems usually apply a CNN to extract grid-like feature maps as visual
representations. In the ImageCLEF 2020 VQA-Med task, we discover that there
exists a long-tailed distribution phenomenon in the training set. Therefore, we
adopt the bilateral-branch network with a cumulative learning strategy to obtain
e ective visual representation.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Datasets</title>
      <p>In ImageCLEF 2020 VQA-Med task, the dataset includes a training set of 4000
radiology images with 4000 question-answer (QA) pairs, a validation set of 500
radiology images with 500 QA pairs, and a test set of 500 radiology images with
500 QA pairs. The questions mainly focus on the abnormalities of medical
images, and they can be divided into two forms. One is making inquiries about the
existence of abnormalities in the picture, and another is making inquiries about
the abnormal type. Figure 1 shows three examples in the VQA-Med dataset.</p>
      <p>
        The VQA-Med-2019 dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] can be used as additional training data,
whose training set contains 3200 medical images associated with 12792 QA pairs.
However, di erent from the VQA-Med-2020 dataset, it focuses on four main
categories of questions: Modality, Plane, Organ system, and Abnormality. In
this paper, we only leverage its Abnormality subset to extend the
VQA-Med2020 training set .
      </p>
      <p>What is the primary
abnormality in this image?</p>
      <sec id="sec-3-1">
        <title>Question semantic classification</title>
      </sec>
      <sec id="sec-3-2">
        <title>Vision-based candidate answer classification</title>
      </sec>
      <sec id="sec-3-3">
        <title>Retrieval-based candidate answer selection</title>
        <p>Adrenal adenoma
As shown in Figure 3, our medical VQA framework consists of three parts:
question semantic classi cation, vision-based candidate answer classi cation, and
retrieval-based candidate answer selection. We train the rst two parts
separately and then connect all the components to predict the nal answer in the
inference phase. Besides, we design a distribution-based algorithm to expand the
training set for further improving the performance of the model.
According to di erent answer forms, we divide all the questions into two
categories: open-ended questions (e.g., Figure 1(c)), and closed-ended questions (e.g.,
Figure 1(a)(b)). Based on di erent semantic information, the closed-ended
questions can be further separated into two classes: closed-ended abnormal questions
(e.g., Figure 1(b)) representing whether the image is abnormal, and closed-ended
normal questions (e.g., Figure 1(a)) denoting whether the image looks normal.
In all, we need to classify the question sentence into three categories: open-ended
questions, closed-ended abnormal questions, and closed-ended normal questions.</p>
        <p>
          For question semantic classi cation, a pre-trained BioBERT is adopted to
classify the questions. Unlike the conventional BERT [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the BioBERT is a
domain-speci c language representation model pre-trained on large-scale
biomedical corpora. Based on BioBert, we send the 768-dimensional vector, the output
of BioBERT, into a 2-layer MLP to obtain the classi cation score of the input
question.
In this part, we need to classify the answer according to the radiology image and
the category of question. For the closed-ended questions, we apply the
ResNet34 [7] to distinguish normal and abnormal medical images. Then, combining with
the ne-grained category information of closed-ended questions, we can simply
choose the answer from \yes" or \no".
        </p>
        <p>As for the open-ended questions, we need to predict the speci c abnormalities
of the input images in a classi cation way. Due to the long-tailed distribution
among the candidate answers in the training set, we apply the bilateral-branch
network (BBN) with a cumulative learning strategy to deal with this problem.
In BBN, there are two branches, one is called \conventional learning branch",
and another is called \re-balancing branch". The conventional learning branch
is for representation learning while the re-balancing is for classi er learning. In
the meanwhile, a novel cumulative learning strategy is proposed for adjusting
bilateral learning. It is worth noting that, inspired by the attention mechanism,
many advanced residual networks have been proposed, such as SE-Net [9],
SKNet [13], NLCE-Net [8]. In this work, we replace the original ResNet in BBN
with ResNeSt [18].
4.3</p>
        <p>Retrieval-based candidate answer selection
As for the open-ended questions, we discover that the top-5 score is about 10%
higher than the top-1 score for the open-ended questions the training procedure.
To alleviate this issue, we apply the retrieval-based top-5 answer selection to
further improve the performance. The schedule is designed into three steps. The
rst step is to create a feature dictionary of each class based on the training set.
It is worth noting that those features are extracted from the BBN. The second
one is to calculate the feature-level cosine similarity between the input sample
and all the training samples belong to the top-5 categories. Then, we treat the
answer of the most similar training sample as the nal prediction.
4.4</p>
        <p>Expanding the training set by Kullback-Leibler divergence
Since the valid medical data for training is limited in ImageCLEF 2020
VQAMed task and external datasets are allowed to use, we expand the training set
with the data from the VQA-Med-2019 dataset. Before extending the training
set, we de ne the distribution of the VQA-Med-2020 training set as Ptr, which
is obtained by:</p>
        <p>Ptr =</p>
        <p>nk
PC
j=1 nj
where k and j are the indexes of category, C denotes the number of catergories,
and n represents the number of samples with same category. And we exploit
the same way to calculate the distribution of the validation set Pv. The KL
divergence between Ptr and Pv is de ned as:</p>
        <p>DKL(PvjjPtr) =</p>
        <p>X Pv(k) log
k</p>
        <p>Pv(k)
Ptr(k)
Then we expand the training set by the following steps. For each sample in the
the Abnormality training subset of the VQA-Med-2019 dataset, we assume that
it is added to the VQA-Med-2020 training set. Then, we calculate the
distribution of new training set P^tr and the KL divergence DKL(PvjjP^tr). Lastly, we
ex^
tend the training set with the sample if DKL(PvjjPtr) is lower than DKL(PvjjPtr).
(1)
(2)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>Implementation details
As for training data, we leverage the whole VQA-Med-2020 training set with 4000
questions to train the BioBERT for question semantic classi cation. We leverage
the extended dataset to train the vision-based model. Among them, 303 images
are used to train a ResNet-34 to determine whether the images are abnormal or
not, and 4039 images are used to train a BBN to recognize the abnormalities.
Besides, a center cropping operation is applied to the input image.</p>
      <p>We train those models which are mentioned above separately with
corresponding cross-entropy losses. And the optimizer we used is SGD with
momentum which is set to 0.9. The initial learning rate is set to 0.08, and the weight
decay is 4e-4. We select the best model based on the performance on the
validation set.
5.2</p>
      <p>Evaluation
The VQA-Med competition uses accuracy and BLEU [14] as the evaluation
metrics. Accuracy is calculated as the number of correct predicted answers over
total answers. BLEU measures the similarity between the predicted answers and
ground truth answers. As shown in Table 1, we achieved an accuracy of 0.426
and a BLEU score of 0.462 in the VQA-Med-2020 test set, which won the 4th
place in the competition.
5.3</p>
      <p>Abaltion study
In this section, we study some contributions of our proposed method on the
VQAMed-2020 validation set, which is shown in Table 2. The baseline represents the
method that contains a BioBERT for question semantic classi cation and two
ResNet-34 models for vision-based candidate answer classi cation. And we train
the baseline with the original VQA-Med-2020 training set.</p>
      <p>Firstly, we replace a ResNet-34 with a BBN-ResNet-34 to better recognize the
abnormalities, which surpasses the baseline by 14.4%. We expand the training
set by KL divergence, which brings an improvement of 3.0%. The performance
is further boosted by 1.0%, using a powerful ResNeSt-50 backbone. Then, we
apply a center cropping operation to the input image for reducing noise, which
leads to 1.6% improvement. The strategy of retrieval-based candidate answer
selection brings a performance gain of 0.6%. Finally, we achieve 57.2% accuracy
on the VQA-Med-2020 validation set.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we describe the method we submitted in ImageCLEF 2020
VQAMed task. Considering the information inequality between images and questions
in this task, we desert complex cross-modal fusion strategies. We adopt the
bilateral-branch network with a cumulative learning strategy to handle the
longtailed problem for e ective visual representation. Besides, to alleviate the issue
of limited training data, we design an approach to extend the training set by
Kullback-Leibler divergence. In addition, we propose a retrieval-based candidate
answer selection module to further boost the performance. Our proposed method
achieves great results with an accuracy of 0.426 and a BLEU score of 0.462.
7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770{778 (2016) 4.2
8. He, X., Yang, S., Li, G., Li, H., Chang, H., Yu, Y.: Non-local context encoder:
Robust biomedical image segmentation against adversarial attacks. In: AAAI (2019)
4.2
9. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: IEEE Conference
on Computer Vision and Pattern Recognition. pp. 7132{7141 (2018) 4.2
10. Ionescu, B., Muller, H., Peteri, R., Ben Abacha, A., Datla, V., Hasan, S.A.,
Demner-Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka,
O., Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L.,
Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T.,
Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu,
M., Stefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multimedia
retrieval in lifelogging, medical, nature, and internet applications. In: Experimental
IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th
International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS
Lecture Notes in Computer Science, Springer, Thessaloniki, Greece (September
2225 2020) 1
11. Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in
Neural Information Processing Systems. pp. 1564{1574 (2018) 2
12. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a
pre-trained biomedical language representation model for biomedical text mining.</p>
      <p>Bioinformatics (09 2019). https://doi.org/10.1093/bioinformatics/btz682 1
13. Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: IEEE Conference
on Computer Vision and Pattern Recognition (2019) 4.2
14. Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics. pp. 311{318. ACL (2002) 5.2
15. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object
detection with region proposal networks. In: Advances in neural information processing
systems. pp. 91{99 (2015) 2
16. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for
image question answering. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. pp. 21{29 (2016) 2
17. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks
for visual question answering. In: The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (June 2019) 2
18. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., Sun, Y., He, T., Mueller,
J., Manmatha, R., Li, M., Smola, A.J.: Resnest: Split-attention networks. CoRR
abs/2004.08955 (2020) 4.2
19. Zhou, B., Cui, Q., Wei, X.S., Chen, Z.M.: Bbn: Bilateral-branch network with
cumulative learning for long-tailed visual recognition. In: The IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) (June 2020) 1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datla</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Vqa-med: Overview of the medical visual question answering task at imageclef 2019</article-title>
          . In: CLEF (
          <year>2019</year>
          )
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buehler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teney</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Johnson,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gould</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          , L.:
          <article-title>Bottom-up and top-down attention for image captioning and visual question answering</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>6077</volume>
          {
          <issue>6086</issue>
          (
          <year>2018</year>
          )
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.V.</given-names>
            ,
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.A.</given-names>
            ,
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , Muller, H.:
          <article-title>Overview of the vqa-med task at imageclef 2020: Visual question answering and generation in the medical domain</article-title>
          .
          <source>In: CLEF 2020 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Thessaloniki,
          <source>Greece (September</source>
          <volume>22</volume>
          -25
          <year>2020</year>
          ) 1
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Visual question reasoning on general dependency tree</article-title>
          .
          <source>In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>June 2018</year>
          ) 2
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: NAACL-HLT</source>
          (
          <year>2019</year>
          )
          <article-title>4</article-title>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fukui</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multimodal compact bilinear pooling for visual question answering and visual grounding</article-title>
          .
          <source>arXiv preprint arXiv:1606</source>
          .
          <year>01847</year>
          (
          <year>2016</year>
          )
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>