<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adapting Pre-Trained Visual and Language Models for Medical Image Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Siqi Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenshuo Zhou</string-name>
          <email>ws.zhou@foxmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yehui Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haifeng Huang</string-name>
          <email>huanghaifeng@baidu.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiyu Ye</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tong Zhang</string-name>
          <email>zhangt02@pcl.ac.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dalu Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Baidu Intelligent Health Unit</institution>
          ,
          <addr-line>Beijing 100085</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Peng Cheng Laboratory</institution>
          ,
          <addr-line>Shenzhen 518055</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the work carried out by the "wsq4747" team in the ImageCLEFmedical2023 title for the visual Question Answering subtask. Medical image question answering presents unique challenges due to the specialized nature of the medical field. Not only does it require the model to generate accurate and coherent answers through the image and the question, but it also needs to capture the basic medical information conveyed by the image. In order to leverage the capabilities of pre-trained large image models, we utilized the state-of-the-art BLIP-2 combined with a giant visual transformer (vit-g) and an open pre-training transformer language model (GLM-6B) as the foundation for our title prediction subtask. To adapt this model to the medical field, we employed a two-stage fine-tuning process. During the entire training process, the pre-trained GLM-6B was kept fixed, and step-by-step fine-tuning was applied to the vit-g and Q-Former modules to better align with the features of medical data. Our team's approach produced promising results with an accuracy(ACC) of 0.7396, as our method achieved an ACC of over 0.8 on 6 questions, an ACC of over 0.7 on 10 questions, and an ACC of around 0.1 on two questions (due to our oversight) CLEF 2023: Conference and Labs of the Evaluation Forum, September 18-21, 2023, Thessaloniki, Greece * Corresponding author.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;ImageCLEF</kwd>
        <kwd>Visual Question Answering</kwd>
        <kwd>Blip-2</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        ImageCLEF[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], short for Image Retrieval Evaluation Campaign, is a component of CLEF
(CrossLanguage Evaluation Forum), a European project in the fields of computer science and
information retrieval that organizes various research challenges annually to evaluate the performance
of multilingual information retrieval. The objective of ImageCLEF is to propel the advancement
of computer vision and multimedia information retrieval technologies.The task encompasses
both natural language processing and image recognition. It provides a query, which could be a
question, a statement, or a description in another form, and then requires the system to find
images relevant to the query from a vast image library.
      </p>
      <p>
        The challenges presented in ImageCLEF comprise of multiple subtasks, including Visual
Question Answering (VQA)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and medical image description. VQA is a multimodal task in
the field of artificial intelligence, with the objective of developing models that can interpret
visual content such as images or videos, and provide responses to corresponding natural
language queries. This task becomes significantly more complex when applied to intricate
scenarios in medical imaging. In the medical image question-answering task, we focused on
colonoscopy images. Colonoscopy images are inherently complex and high-dimensional, with
intricate relationships between visual features and medical semantics. Efectively modeling
these relationships presents a significant challenge. Additionally, the language used in medical
queries is often laden with complex medical jargon, necessitating a deep understanding of
medical concepts that may not be encapsulated in general language models. Furthermore,
procuring large-scale annotated data for training such models poses a dificulty due to privacy
concerns and the requirement for expert annotations. Nevertheless, despite these challenges, the
potential of medical VQA is considerable. Ongoing research continues to push the boundaries
of our capabilities in this crucial area. In this work, our team primarily focuses on the task of
medical image question-answering.
      </p>
      <p>Some methods freeze the image encoder, including the early work which adopts a frozen
object detector to extract visual features [3, 4, 5], and the recent LiT [6] which uses a frozen
pre-trained image encoder for CLIP[7] pre-training. Some methods freeze the language model to
use the knowledge from LLMs for vision-to-language generation tasks [8, 9]. The key challenge
in using a frozen LLM is to align visual features to the text space. To achieve this, Frozen[8]
ifnetunes an image encoder whose outputs are directly used as soft prompts for the LLM</p>
      <p>In this task, we employed BLIP-2[10]. BLIP-2 is a recently proposed vision-language
pretraining method by Li et al[10]. Blip-2 is a generic and eficient pretraining strategy that
bootstraps vision-language pre-training from of-the-shelf frozen pre-trained image encoders
and frozen large language models(LLM). BLIP-2 bridges the modality gap with a lightweight
Querying Transformer, which is pretrained in two stages.building upon their previous work
of BLIP[10], and it has demonstrated superior performance compared to various other
visionlanguage</p>
      <p>pre-training methods, including Flamingo[11], across a range of vision-language tasks such as
visual question answering, image captioning, and image-text retrieval. In this paper, our method
is specifically introduced in Section 2, the experiments, data, and results are demonstrated in
Section 3 and a brief summary is given in Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <sec id="sec-2-1">
        <title>2.1. Architechture</title>
        <p>BLIP-2[10] is a sophisticated framework designed for vision-to-language tasks, comprised of
three main components: an image encoder, a Query Transformer (Q-Former), and a LLM.As
shown in Figure 1. The Q-Former as the trainable module to bridge the gap between a frozen
image encoder and a frozen LLM.</p>
        <p>During the pre-training generation phase, we connected the Q-Former, equipped with a frozen
image encoder, to the frozen Language Model (LLM) to leverage its language generation capacity.
As depicted in Figure 1, we employed a fully connected (FC) layer to linearly project the output
query embedding into the same dimensionality as the LLM’s text embeddings. This projected
query embedding was then added prior to the input text embeddings. Acting as soft visual
prompts, they placed the LLM upon the visual representation extracted by the Q-Former. Given
that the Q-Former had been pre-trained to extract visual representations carrying language
information, it efectively played the role of an information bottleneck, ofering the most useful
details to the LLM while discarding irrelevant visual data. This mitigated the burden of learning
visual-language alignment on the LLM, thereby alleviating the issue of catastrophic forgetting.</p>
        <p>In response to our task, we employed the BLIP-2 model for visual question answering and
selected vite-g/14 from EVA-CLIP[12] as our image encoder. For the LLM, we selected
GLM6b[13], a prefix-based, decoder-only model, to serve as our language language model (LLM).
.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Dataset</title>
        <p>The dataset encompasses images spanning the entirety of the gastrointestinal tract, from
the mouth to the anus. It encapsulates various instances, including abnormalities, surgical
instruments, and normal findings. The images are procured from diferent procedures such as
gastroscopy, colonoscopy, and capsule endoscopy. The distribution of image sizes within our
dataset is illustrated in Table 1, exhibiting a broad array of dimensions. A minimal fraction of
the images, exactly 21, have dimensions less than 500 pixels. The bulk of our images, amounting
to 1441, belong to the medium size category with pixel dimensions ranging from 500 to 1000.
Lastly, a significant subset of our data, constituting 538 images, features dimensions exceeding
1000 pixels. This heterogeneity in image size amplifies the diversity and intricacy of our dataset,
thereby increasing the challenge and comprehensiveness of the Visual Question Answering
task at hand. For both Task 1 (VQA) and Task 2 (Visual Question Generation, VQG), a minimum
of 2000 image samples have been provided, each accompanied by eighteen question-and-answer
pairs. It should be noted that not all questions are pertinent to the corresponding image.</p>
        <p>In Task1, since the data did not divide the training set and the validation set, we randomly
selected 10% of the image-text question-answer pairs as the validation set.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Training Strategy</title>
        <p>The training protocol for BLIP-2 is carried out in two distinct stages. In the first stage, a process
known as vision-language representation learning, the image encoder and language models are
frozen, allowing the model to tap into its inherent image understanding capabilities. The second
stage involves vision-to-language generative learning, where the LLM is frozen, maintaining
its existing text generation capabilities. When applying the BLIP-2 model to downstream
tasks, such as visual question answering, the LLM is kept frozen during the fine-tuning phase.
Meanwhile, the parameters of the image encoder and Q-Former are updated.</p>
        <p>Throughout these two stages, the language models are kept frozen to preserve their initial
functionalities. In contrast, the Q-Former is exclusively trained during this pre-training phase.
The role of the Q-Former is to efectively extract visual representations that align with the
corresponding textual information and to relay this information to the LLM. This focused
training approach allows BLIP-2 to achieve a higher level of correspondence between visual
and textual data.</p>
        <p>In our pre-training phase, we initialized our large-scale visual transformer (vit-g) and Query
Transformer (Q-Former) with weights from BLIP-2, which had been previously pre-trained
on the ImageNet[14] and COCO[15] datasets. However, our specific task focused on medical
imaging (endoscopic images) for the visual question answering task. It’s worth noting that
there is a significant domain shift between natural images and medical imaging data.</p>
        <p>In order to address this issue and to allow the visual encoder to extract image features more
delicately, we adopted a fine-tuning strategy. During this fine-tuning process, the parameters of
the LLM(GLM-6B) were kept frozen, while the vit-g and Q-Former were trained concurrently.
This strategy was designed to leverage the powerful visual representation capabilities of vit-g
and Q-Former, while also accommodating the specific characteristics and challenges of the
medical imaging domain.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Implementation Details</title>
        <p>Our framework was developed using PaddlePaddle1 version 2.4.2 and trained on 8 Ascend 910
NPUs. The adapter plug-in PaddleCustomDevice2 was utilized in order to be compatible with
the Ascend NPU. The entire process, encompassing two training stages, spanned a total duration
of four days. The input image size was set to 224 × 224, and the batch size was fixed at 16 for
both fine-tuning stages. The model underwent fine-tuning for 100 epochs in the first stage, and
50 epochs in the second stage. Optimization was performed using an AdamW optimizer with a
weight decay of 10− 4. The initial learning rate was set to 10− 4 and was incrementally adjusted
through a 1000-step warm-up phase. Additionally, we set the maximum output length of our
model to 10, as the majority of answers from the data clearly fell below this threshold.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental Settings</title>
        <p>In the endoscopic dataset, since there is no predefined division between training and validation
sets, we conducted a manual split to validate the performance of our model. Specifically, 10%
of the data, equating to 200 images with corresponding 3200 questions, was earmarked as the
validation set. Beyond this, we employed accuracy as our primary evaluation metric to gauge
the model’s efectiveness in predicting the correct responses.As demonstrated in Table 3, we
have evaluated our model performance on the validation set that was manually partitioned by
us. For the final submission, however, we leveraged the entire dataset for fine-tuning the model.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results on Validation Data</title>
        <p>In Figure 2, we provide an illustrative comparison between the performances of models under
diferent parameter sizes, and the conditions whether the models were kept frozen or allowed
to unfreeze. As shown in figure 2, there is an observable trend that larger models generally
outperform the smaller ones, suggesting that the number of parameters plays a significant
role in the accuracy of the VQA task. Furthermore, it is also evident that when the models
are unfrozen, allowing the parameters to adjust during training, the accuracy increases across
all model sizes. This underlines the importance of parameter fine-tuning in optimizing model
performance in the context of medical VQA tasks.</p>
        <p>Additionally, we analyzed the accuracy rates for individual questions, as shown in Table 3.
From the validation set, it was apparent that the model had poor performance in predicting
the size and type of polyps. We summarized the answers to these two questions in Table 4.
To address these shortcomings, for the question What is the size of the polyp?, we trained a
ifve-class classification model (ResNet34), and for the question What type of polyp is present?, we
trained a four-class classification model (ResNet34). Both models achieved an accuracy rate of
over 0.99 in the validation set, thereby outperforming the BLIP-2 model’s output. Consequently,
the overall accuracy on the validation set rose from 0.9105 to 0.9305. This was the result we
submitted in the end.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Results on Test Data</title>
        <p>Our submission results, as depicted in Table 5, were not as impressive as anticipated on the
test set, yielding an accuracy of 0.7396. On six questions, the accuracy exceeded 0.8, while on
ten questions, it surpassed 0.7. However, the accuracy was around 0.1 for the questions What
color is the abnormality? and Where in the image is the abnormality?, significantly dragging
down the overall average. We speculate that this may be due to the visual encoder’s inability
to extract detailed features related to color and position. As illustrated in Figure 3, we present
one example of the predicted results obtained from the validation set for the answer prediction
task. This figure visually demonstrates how our model performs in terms of predicting answers,
ofering an insight into the capabilities of our approach.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Summary</title>
      <p>This paper has presented the work of the "wsq4747" team in the Visual Question Answering
task for imageCLEFmedical VQA 2023. The model we utilized, which is a variant of BLIP-2
vit-g GLM-6b, underwent two stages of fine-tuning. Our team’s final accuracy was 0.7396,
demonstrating the efectiveness of our approach in generating high-quality question-answering
for medical images.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments References</title>
      <p>The computing resources of Pengcheng Laboratory Cloudbrain II are used in this research. We
acknowledge the support provided by OpenI Community (https://git.openi.org.cn).
[3] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter:
Universal image-text representation learning (2020). URL: http://dx.doi.org/10.1007/
978-3-030-58577-8_7. doi:10.1007/978-3-030-58577-8_7.
[4] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, X. Li, Y. Choi, J. Gao,
M. Corporation, W. Embeddings, Oscar: Object-semantics aligned pre-training for
visionlanguage tasks (2020).
[5] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, J. Gao, Vinvl: Making visual
representations matter in vision-language models (2021).
[6] X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, L. Beyer, Lit: Zero-shot
transfer with locked-image text tuning (2022). URL: http://dx.doi.org/10.1109/cvpr52688.
2022.01759. doi:10.1109/cvpr52688.2022.01759.
[7] A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from
natural language supervision, Cornell University - arXiv (2021).
[8] M. Tsimpoukelli, J. Menick, S. Cabi, S. Eslami, O. Vinyals, F. Hill, Multimodal few-shot
learning with frozen language models, Neural Information Processing Systems (2021).
[9] A. Tiong, J. Li, B. Li, S. Savarese, S. Hoi, Plug-and-play vqa: Zero-shot vqa by conjoining
large pretrained models with zero training (2022).
[10] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with
frozen image encoders and large language models (2023).
[11] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K.
Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M.
Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski,
R. Barreira, O. Vinyals, A. Zisserman, K. Simonyan, Flamingo: a visual language model for
few-shot learning (2022).
[12] Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, Y. Cao, Eva:</p>
      <p>Exploring the limits of masked visual representation learning at scale (2022).
[13] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al.,</p>
      <p>Glm-130b: An open bilingual pre-trained model, arXiv preprint arXiv:2210.02414 (2022).
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical
image database (2009). URL: http://dx.doi.org/10.1109/cvpr.2009.5206848. doi:10.1109/
cvpr.2009.5206848.
[15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick,
Microsoft coco: Common objects in context (2014) 740–755. URL: http://dx.doi.org/10.1007/
978-3-319-10602-1_48. doi:10.1007/978-3-319-10602-1_48.
[16] H. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani,
S. Brahma, A. Webson, S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang,
G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. Chi, J. Dean, J. Devlin,
A. Roberts, D. Zhou, Q. Le, J. Wei, Scaling instruction-finetuned language models (2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drăgulinescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Snider</surname>
          </string-name>
          , G. Adams,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcıa Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J. A. A. A. R. I. C. V. K. A. S. G. I. Nikolaos</given-names>
            <surname>Papachrysos</surname>
          </string-name>
          , Johanna Schöler,
          <string-name>
            <given-names>H.</given-names>
            <surname>Manguinhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ştefan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deshayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          , Overview of ImageCLEF 2023:
          <article-title>Multimedia retrieval in medical, socialmedia and recommender systems applications (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. H. T.</given-names>
            <surname>d. L. M. A. R. V. T. Steven</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          , Andrea Storås,
          <article-title>Overview of imageclefmedical 2023 - medical visual question answering for gastrointestinal tract (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>