<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bridging the Modality Gap Through CoT-Enhanced Multimodal Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shengjun Deng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guo Niu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiongfei Yao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huanlin Mo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tao Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuaiwei Jiao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper proposes a "Question Reconstruction before Answering" (QRA) prompting strategy for the ImageCLEF2025 multimodal reasoning task. The method first completes missing question stems using image information, then guides the language model through step-by-step reasoning and answering, thereby enhancing the model's comprehension and reasoning capabilities. On the EXAMS-V dataset, through our investigation of diferent prompts and their impact on accuracy, we found that the QRA prompting demonstrates strong cross-lingual adaptability compared to conventional Chain-of-Thought (CoT) prompting. Experimental results show that this method efectively improves visual question answering performance without requiring OCR or additional fine-tuning, ofering a new perspective for multimodal reasoning tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal reasoning</kwd>
        <kwd>Vision-language models</kwd>
        <kwd>Prompt engineering</kwd>
        <kwd>Chain-of-thought</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Visual-Language Models</title>
        <p>
          Visual-Language Models (VLMs) have made significant progress in multimodal understanding tasks in
recent years. CLIP[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] established the foundation for multimodal pretraining by constructing a general
image-text embedding space through contrastive learning of images and text. BLIP-2[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] introduced a
lightweight intermediate module to connect a frozen visual encoder with a language model, enhancing
image-text question answering and generation capabilities. LLaVA[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] combined CLIP and LLM, adding a
projection layer to improve the model’s understanding of images through visual instruction fine-tuning,
supporting various question answering and dialogue scenarios. VisionLLM[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] optimized the visual
attention mechanism based on BLIP-2, achieving more refined image-text alignment. Qwen-VL 2.5[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
further expanded the model’s perceptual capabilities by optimizing the projection layer and other
methods, demonstrating strong reasoning abilities with excellent performance on multiple benchmarks.
        </p>
        <p>Although these methods have made progress in image-text alignment and language generation,
their reasoning processes still perform limitedly under incomplete prompts. Our approach attempts
to address this shortcoming by reconstructing question text combined with CoT (Chain-of-Thought)
reasoning.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Chain-of-Thought Prompt</title>
        <p>
          Chain-of-Thought (CoT) prompting significantly enhances the reasoning capabilities of large language
models in complex tasks by guiding the model to generate intermediate reasoning steps [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In scenarios
such as mathematics and commonsense question answering, CoT helps the model decompose problems
step-by-step and generate coherent reasoning chains, thereby improving accuracy. However, applying
CoT to multimodal tasks still faces challenges. On one hand, CoT typically relies on explicit textual
prompts, but key information in multimodal tasks may exist in visual form, making it dificult for the
model to correctly understand the problem. On the other hand, visual features lack clear semantic
boundaries, and directly inputting them into the language model often leads to prompt interpretation
deviations due to the "modality gap," which in turn afects the completeness and logicality of the
reasoning chain.
        </p>
        <p>cnE iV
o su
red la</p>
        <sec id="sec-2-2-1">
          <title>Datasets</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Template: Image &lt;|image feature|&gt;</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>The &lt;|language|&gt; problem is About &lt;|subject|&gt;</title>
          <p>LLM</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>Answer</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>
        In image-only multimodal question answering tasks,such as the EXAMS-V dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], visual encoders
often lead to the loss of certain semantic information when abstractly representing images, particularly
the textual cues and detailed content in the images that are relevant to the question stem. This
information gap makes it dificult for language models to construct clear reasoning chains. In contrast,
when faced with obscured or incomplete questions, humans are usually able to reasonably complete
the missing information based on their existing background knowledge and contextual understanding,
thereby successfully completing the reasoning task.
      </p>
      <p>Inspired by this, we propose a "complete first, then reason" strategy, the Architecture shown in
Figure 1. This strategy first uses image features to guide the language model to complete the missing
question information, thereby reconstructing the complete question stem; subsequently, based on the
reconstruction results, a Chain-of-Thought (CoT) reasoning mechanism is introduced to enhance the
model’s cross-modal reasoning ability. This method not only enhances the model’s understanding of
the task context but also efectively alleviates the semantic disconnection caused by modal diferences.
Specifically, our method includes two key steps: 1) Question Background Information Prompt Embedding
, 2) Question-Reasoning-Answer Prompting.</p>
      <sec id="sec-3-1">
        <title>3.1. Question Background Information Prompt Embedding</title>
        <p>In practical multimodal question answering tasks, questions often involve specific languages and
subject backgrounds, with language expressions that are highly specialized and context-dependent.
Especially in scenarios containing only images, language models, lacking explicit context, are prone to
misunderstandings of the question stem.</p>
        <p>To address this issue, we introduce question background information embedding. Specifically, we
extract the language category (e.g., English, French, etc.) and subject labels (e.g., physics, chemistry, etc.)
of the question from the image’s metadata and use them as prior knowledge prompt words to guide the
language model in context modeling. This approach efectively mitigates semantic ambiguity caused
by language specificity, making the model more targeted and accurate when generating completion
content.</p>
        <p>Standard Prompting
There is a question in the
image, please provide the
correct answer.</p>
        <p>Upon rechecking, it
appears there might be a
typo in the problem or the
options. However, based
on the calculations, the
closest match is:
D(135±0.23)</p>
        <p>Chain-of-Thought Prompting
A conversation between User and
Assistant...</p>
        <p>Format:
&lt;think&gt; reasoning &lt;/think&gt;
&lt;answer&gt; final answer&lt;/answer&gt;
&lt;think&gt;
Step-by-Step Solution:
The balance condition for the
Wheatstone bridge is …
&lt;think/&gt;
&lt;answer&gt;A(60±0.15) &lt;/answer&gt;</p>
        <p>Question-Reasoning-Answer Prompting</p>
        <p>A conversation between User and
Assistant…
The problem you should complete,
the reasoning process, and the
answer are enclosed within the
following tags:
&lt;problem&gt; problem &lt;/problem&gt;
&lt;think&gt; reasoning&lt;/think&gt;
&lt;answer&gt; answer here &lt;/answer&gt;
&lt;problem&gt;
During an experiment with a metre
bridge, the galvanometer shows a
null point when the jockey is pressed
at 40.0cm using a standard resistance
of 90  …
&lt;/problem&gt;
&lt;think&gt;
This is a classic Wheatstone bridge
setup using a metre bridge. The
balance condition is given by:
……
&lt;/think&gt;
&lt;answer&gt; C (60±0.25)&lt;/answer&gt;</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Question Reconstruction before Answering Prompting</title>
        <p>After completing the question stem, the model still requires strong reasoning capabilities to correctly
perform the question-answering task. Traditional Chain-of-Thought (CoT) prompting, which guides
language models to generate intermediate reasoning steps, has achieved significant success in textual
reasoning tasks. However, directly applying the CoT mechanism to multimodal question-answering
tasks involving only images can lead to information confusion or insuficient semantic alignment,
resulting in the model’s inability to construct coherent and clear reasoning chains.</p>
        <p>To address this, we propose a structured "Question Reconstruction before Answering" guided
prompting strategy, aiming to explicitly separate the question comprehension process from the reasoning
process to enhance the model’s ability to build reasoning chains. Specifically, we design a unified prompt
template that introduces the &lt;Question&gt;...&lt;/Question&gt; tag to guide the model in first understanding
the question before engaging in step-by-step thinking and answering. We show the efects of three
types of prompts in Figure 2.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Comparative Experiments</title>
        <p>To validate the efectiveness of our proposed QRA Prompting strategy, we participated in the
ImageCLEF2025 multimodal reasoning task and submitted test results for both the Multilingual Track and the
English Track. Table 1 lists our performance on the multilingual test set.</p>
        <p>In the Multilingual Track, our method ranked 6th among all participating teams, achieving an
accuracy of 0.5195. Compared to the oficial baseline method (accuracy of 0.2701), our approach
improved performance by 24.9%, demonstrating the strong competitiveness of our method in practical
tasks. This significant improvement indicates that our proposed structured strategy of "first completing
the question, then reasoning" has clear advantages in alleviating inter-modal information misalignment
and enhancing cross-modal understanding.</p>
        <p>Notably, we achieved near-top-tier performance without relying on any additional OCR modules or
ifne-tuning the model for multilingual tasks. This demonstrates that QRA Prompting possesses strong
robustness and excellent transfer generalization capabilities, performing stably and reliably in complex
real-world multimodal reasoning scenarios.</p>
        <p>In the English Track, we also submitted model predictions based on QRA Prompting, achieving an
accuracy of 0.5371 and ranking 6th,As shown in Table 2. Our method consistently delivered strong
performance across both tasks, further validating its cross-language consistency.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ablation Study</title>
        <p>To systematically evaluate the contributions of each key component in QRA Prompting, we conducted
ablation experiments on the English validation set of the EXAMS-V dataset. The experiments used
QwenVL-2.5-32B as the base model, employed a zero-shot setting, and compared against standard Prompting
and Chain-of-Thought (CoT) Prompting methods. As shown in Table 3, the standard Prompting method
achieved an accuracy of 0.458, demonstrating relatively weak performance. The CoT Prompting method,
which guides the model through chain-of-thought reasoning, improved accuracy to 0.548. The QRA
Prompting strategy further enhanced this performance, achieving an accuracy of 0.582, which represents
a 12.4% improvement over the standard method and a 3.4% improvement over the CoT method.</p>
        <p>These results indicate that QRA Prompting not only inherits the advantages of chain-of-thought
reasoning from CoT but also efectively enhances the language model’s understanding of image semantics
through explicit question stem completion, significantly boosting the model’s performance in complex
reasoning tasks.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we addressed the Multimodal Reasoning task of the ImageCLEF2025 Multimodal Lab.
By employing the QRA strategy, we enhance the inference accuracy of models in visual question
answering tasks. Our approach involves constructing QRA prompt templates and integrating contextual
information. These strategies efectively address two key challenges faced by traditional
Chain-ofThought (CoT) in multimodal scenarios: they alleviate the "modality gap" problem caused by relying
solely on visual features and enhance the ability to reconstruct missing question text from visual data.</p>
      <p>Evaluation results demonstrate the feasibility and efectiveness of our method, achieving an accuracy
of 0.5195 on the multilingual version of the EXAMS-V test set. These findings indicate that our approach
provides a viable solution for visual question answering tasks that use only visual features, contributing
to the field of multimodal reasoning.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is supported by the Research Projects of OrdinaryUniversities in Guangdong Province
under Grant2023KTSCX133, the Guangdong Basic and Applied Basic Research Foundation under Grant
2022A1515140103.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used deepseek-v3 in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          , in: International conference on machine learning,
          <source>PmLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>19730</fpage>
          -
          <lpage>19742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Visual instruction tuning,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2304.08485. arXiv:
          <volume>2304</volume>
          .
          <fpage>08485</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          , Ştefan, LiviuDaniel, M.-G. Constantin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          , W.-W. Yim,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Hee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of imageclef 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jyoti Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahsan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paev</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of imageclef 2025 - multimodal reasoning</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D. I.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>I. Koychev</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , Exams-v:
          <article-title>A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2403.10378. arXiv:
          <volume>2403</volume>
          .
          <fpage>10378</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , G. Zeng,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          , Visionllm:
          <article-title>Large language model is also an open-ended decoder for vision-centric tasks</article-title>
          ,
          <year>2023</year>
          . URL: https: //arxiv.org/abs/2305.11175. arXiv:
          <volume>2305</volume>
          .
          <fpage>11175</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Z. Cheng, H. Zhang,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <year>Qwen2</year>
          .5-vl
          <source>technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.13923. arXiv:
          <volume>2502</volume>
          .
          <fpage>13923</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>