<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Reasoning in Multilingual Visual Question Answering: A Prompt-Tuned Qwen2.5-vl-plus Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Huanlin Mo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guo Niu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shengjun Deng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiongfei Yao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tao Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuaiwei Jiao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a prompt-tuning approach based on Qwen2.5-vl-plus for the MultimodalReason task at ImageCLEF 2025, which involves answering multiple-choice questions grounded in images across multiple languages and complex reasoning scenarios. Our method achieves an accuracy of 0.4418 on the benchmark, representing a 63% improvement over the baseline SmolVLM (0.2701). Further analysis indicates that well-designed prompt templates play a crucial role in enhancing the model's cross-lingual reasoning performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multilingual</kwd>
        <kwd>Multimodal Reasoning</kwd>
        <kwd>Vision-Language Models</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>ImageCLEF 2025</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Multimodal reasoning has become a key research focus in the field of artificial intelligence, particularly
due to its wide-ranging applications in tasks that integrate vision and language [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In recent years,
although large multimodal models have achieved significant progress in image-text understanding
tasks [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], they still face considerable challenges in modeling the complex semantic relationships
between images and text in real-world multilingual environments [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        To systematically evaluate models’ comprehensive capabilities in multilingual and multimodal
contexts, CLEF 2025 introduced the MultimodalReason task [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], which centers on Multilingual Visual
Question Answering (VQA). In this task, models are required to understand an image containing a
question along with four candidate answers and accurately identify the single correct option. This
setting demands the integration of image understanding, multilingual text processing, and logical
reasoning, closely reflecting real-world scenarios involving cross-language and cross-modal information
processing [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ].
      </p>
      <p>In this study, we adopt Qwen2.5-VL-Plus as our primary model. This model is capable of handling both
image and multilingual text inputs and has demonstrated strong performance in various multimodal
benchmarks [10]. Compared to the baseline model SmolVLM, which uses a single system prompt,
we further introduce a hybrid prompt design that combines system instructions with exemplar-based
few-shot prompting [11, 12] to better activate the model’s reasoning capabilities.</p>
      <p>Experimental results show that Qwen2.5-VL-Plus performs well in multilingual visual question
answering tasks, particularly in integrating visual cues with multilingual expressions. Our approach
achieved excellent results in the competition; however, there is still room for improvement when
handling more complex cross-modal reasoning samples. We hope this research provides practical
experience and theoretical insights for the development of multilingual multimodal models and serves
as a valuable reference for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Multimodal Vision-Language Models</title>
        <p>
          Recent advances in vision-language models (VLMs) have significantly improved performance on tasks
that require understanding both visual and textual inputs, such as image captioning, visual question
answering (VQA), and visual entailment. Foundational models like CLIP [13], Flamingo [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and BLIP [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
have demonstrated the efectiveness of joint vision-language pretraining. More recent models such as
MiniGPT-4 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and LLaVA [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] combine large language models (LLMs) with vision encoders to enable
open-ended multimodal reasoning.
        </p>
        <p>However, most of these models are primarily trained on English-centric datasets and often rely
on pattern recognition rather than deep reasoning. Their performance on complex logical inference,
especially in multilingual and real-world settings, remains limited.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multilingual Visual Question Answering</title>
        <p>
          Multilingual VQA aims to evaluate a model’s ability to understand and reason about images and text
across diferent languages. Prior work in this area is relatively sparse compared to English-only VQA
benchmarks such as VQAv2 [14] or GQA [15]. Some eforts, such as MaXM [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], explore multilingual
alignment, but many VLMs still underperform on low-resource or morphologically rich languages.
        </p>
        <p>The MultimodalReason task introduced by CLEF 2025 provides a more realistic and challenging
multilingual setting by requiring models to process questions presented in various languages (e.g.,
English, Chinese, Spanish) while reasoning over visual content and selecting one correct answer from
multiple options.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Multimodal Reasoning and Prompt Engineering</title>
        <p>Deep reasoning in multimodal contexts remains a major challenge. While recent LLM-augmented
VLMs (e.g., GPT-4V, Qwen-VL [10]) demonstrate better reasoning performance than early models, they
still struggle with tasks that involve hypothetical scenarios, abstract logic, or long-range dependencies
between visual and textual elements.</p>
        <p>Prompt engineering has emerged as an efective technique to steer model behavior without fine-tuning.
In-context learning via exemplars or task-specific instruction formatting can significantly enhance
performance on reasoning tasks [12, 11]. In multimodal settings, hybrid prompting strategies that
combine visual inputs with structured textual cues (e.g., few-shot examples, multilingual instructions)
have shown promise, but their impact in multilingual VQA is still under-explored.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Baseline System</title>
        <p>The oficial baseline for this task employs the SmolVLM model, a lightweight vision-language model
optimized for inference eficiency.</p>
        <p>Using only a default system prompt,which can be seen in Figure 1, this model achieved an overall
accuracy of 16% on the development set. The default prompt includes only minimal instruction (e.g.,
“You are a helpful assistant”) and lacks task-specific context or examples.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Our Approach: Prompt Engineering with Exemplars and Model Upgrade</title>
        <p>To improve performance, we adopt a two-pronged strategy:
• Model Upgrade: We replace the baseline SmolVLM with the more capable Qwen2.5-VL-Plus
model, which has demonstrated stronger reasoning capabilities across multiple vision-language
tasks.</p>
        <p>• Hybrid Prompt Design: We introduce a hybrid prompt structure that combines:
1. A system prompt that defines the model’s role and multilingual capabilities (Figure 2).
2. One or more in-context examples ("sample prompts") drawn from the training set (Figure 3).</p>
        <p>Each includes an image, a multilingual question, five candidate answers (A–E), and the
correct answer.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. System Architecture</title>
        <p>As shown in Figure 4, we design a pipeline centered around the upgraded Qwen2.5-VL-Plus model. The
original image-question sample is processed to extract key information and generate the corresponding
sample prompt. Together with the system prompt, this is fed into the model.</p>
        <p>The system prompt provides macro-level instructions, while the sample prompt ofers task-specific
context. The model parses the image content, task rules, and exemplar structure, and generates a
response. Finally, the answer is extracted via regular expressions and saved.</p>
        <p>This design ensures accurate and eficient multimodal reasoning and output generation.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Analysis of Prompt Strategies</title>
        <p>Prompt design plays a vital role in multimodal visual question answering tasks. We analyze the
limitations of the baseline prompt and advantages of our hybrid prompt strategy below.
Limitations of Baseline Prompt
• Generalization Issues: The baseline prompt lists general analysis steps without enforcing
output format. This lack of structure often leads to noisy or incomplete answers, especially in
multilingual contexts.
• No Structured Reasoning Guidance: The prompt fails to explicitly guide reasoning or require
intermediate steps. Thus, even when the model arrives at the right answer, it’s unclear whether
it followed a logical path or guessed .</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Implementation Details</title>
        <p>We preprocess all images to a resolution of 448×448 pixels. For text input, we use the oficial tokenizer
and image processor from the Qwen2.5-VL-Plus repository. Prompts are inserted in a zero-shot format
unless otherwise specified. The model response is decoded as free text, and the final prediction is
determined by matching it to one of the five answer choices (A–E).</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Evaluation Methodology</title>
        <p>The evaluation of the MultimodalReason task is centered around a straightforward yet crucial metric:
accuracy. Given that the task requires participants to identify the single correct answer from a set of
four options presented in an image - based question, accuracy serves as the primary indicator of a
model’s performance. It directly reflects the proportion of correctly answered questions out of the total
number of questions in the dataset.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>Our dataset, the cornerstone of the MultimodalReason task, is accessible via “Exams-V” [16]. It is
partitioned into 16,724 training instances and 4,208 development/validation instances. The test data
will be released subsequently.</p>
        <p>The EXAMS-V dataset is a meticulously curated, multi-disciplinary, multimodal, and multilingual
benchmark. It encompasses 20,932 multiple-choice questions from 20 disciplines, spanning natural
science, social science, and fields like religion, fine arts, and business.</p>
        <p>EXAMS-V stands out with its rich multimodal features, including text, images, tables, graphs, charts,
maps, scientific symbols, and equations. Questions are presented in 11 languages from 7 language
families.</p>
        <p>Unlike typical benchmarks, EXAMS-V is assembled from school exam questions across various
countries and educational systems. This diverse origin endows the dataset with complexity, requiring
models to navigate language barriers, understand question nuances, and apply region-specific knowledge
for reasoning.</p>
        <p>Here is a snapshot of the dataset’s statistics (languages ranked from high- to low-resource):</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Results</title>
        <p>The analysis across multiple languages strikingly demonstrates that our approach has achieved
remarkable enhancements in multilingual performance. Specifically, the "mhl2001 Score" for the multilingual
evaluation has soared from 0.2701 to 0.4418, marking a significant improvement. This showcases the
efectiveness of our system in handling diverse languages simultaneously.</p>
        <p>Notably, among individual languages, Chinese and German have witnessed substantial progress. In
Chinese, the score has leaped from 0.2678 to 0.5553, a remarkable 107% increase, while in German, it
has risen from 0.3101 to 0.4922, a 58.7% increase. These gains are attributed to the enhanced language
modeling capabilities of Qwen2.5 - VL - Plus and the meticulously crafted prompts that capture the
structural intricacies of multiple - choice reasoning questions.</p>
        <p>Even for relatively low - resource languages like Hungarian, our model still exhibits a notable
performance boost, with the score advancing from 0.2348 to 0.3563. This indicates the model’s proficiency
in cross - lingual generalization without the need for language - specific fine - tuning.</p>
        <p>Overall, these results unequivocally prove that our system not only elevates accuracy but also provides
a more robust and scalable solution for multimodal reasoning tasks across a wide spectrum of linguistic
contexts.</p>
        <p>Language
Multilingual
English
Chinese
German
Arabic
Hungarian</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper proposes a simple yet efective approach to the CLEF 2025 MultimodalReason task by
combining a stronger vision-language model, Qwen2.5-VL-Plus, with a carefully crafted hybrid prompt
that integrates multilingual system instructions and exemplar-based few-shot learning. Without any
task-specific fine-tuning, our method significantly improves overall accuracy from 0.2701 to 0.4418,
achieving consistent gains across all 11 languages in EXAMS-V, including notable improvements in
low-resource languages like Hungarian. The results confirm that model capability and prompt design
jointly play a crucial role in enhancing multilingual multimodal reasoning.</p>
      <p>While our method performs well on multiple-choice VQA tasks, challenges remain in handling
complex images with dense text, domain-specific knowledge questions, and languages beyond the
EXAMS-V set. Future work will explore dynamic exemplar selection, step-by-step rationale generation,
lightweight parameter tuning (e.g., LoRA), and knowledge grounding via external resources, aiming to
further boost performance in challenging multilingual settings.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is supported by the Research Projects of Ordinary Universities in Guangdong Province
under Grant 2023KTSCX133, the Guangdong Basic and Applied Basic Research Foundation under Grant
2022A1515140103</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used deepseek-v3 in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
[10] B. Inc., Qwen-vl: A multimodal foundation model for language, vision, and more, arXiv preprint
arXiv:2403.09047 (2024).
[11] T. Kojima, et al., Large language models are zero-shot reasoners, arXiv preprint arXiv:2205.11916
(2022).
[12] T. B. Brown, et al., Language models are few-shot learners, in: Advances in Neural Information</p>
      <p>Processing Systems (NeurIPS), 2020.
[13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual models from natural language supervision, in:
International Conference on Machine Learning, PMLR, 2021.
[14] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the v in vqa matter: Elevating the
role of image understanding in visual question answering, in: CVPR, 2017, pp. 6904–6913.
[15] D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual reasoning and compositional
question answering, in: CVPR, 2019, pp. 6700–6709.
[16] R. J. Das, S. E. Hristov, H. Li, D. I. Dimitrov, Ivan, Exams-v: A multi-discipline multilingual
multimodal exam benchmark for evaluating vision language, 2024. arXiv:2403.10378.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Baltrušaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.-P. Morency,</surname>
          </string-name>
          <article-title>Multimodal machine learning: A survey and taxonomy</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>41</volume>
          (
          <year>2018</year>
          )
          <fpage>423</fpage>
          -
          <lpage>443</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>J.-B. Alayrac</surname>
          </string-name>
          , et al.,
          <article-title>Flamingo: a visual language model for few-shot learning</article-title>
          ,
          <source>arXiv preprint arXiv:2204.14198</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation</article-title>
          ,
          <source>Proceedings of the International Conference on Machine Learning (ICML)</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.,
          <article-title>Maximizing multilingual multimodal learning with prompt engineering</article-title>
          ,
          <source>arXiv preprint arXiv:2306.05450</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.-Y.</given-names>
            <surname>Dou</surname>
          </string-name>
          , et al.,
          <article-title>Coarse-to-fine vision-language pre-training with fusion in the backbone</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>16650</fpage>
          -
          <lpage>16663</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jyoti Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahsan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paev</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of imageclef 2025 - multimodal reasoning</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          , Ştefan, LiviuDaniel, M.-G. Constantin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          , W.-W. Yim,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>H. M.</given-names>
          </string-name>
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of imageclef 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Visual instruction tuning</article-title>
          ,
          <source>arXiv preprint arXiv:2304.08485</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , et al.,
          <article-title>Minigpt-4: Enhancing vision-language understanding with advanced large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2304.10592</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>