<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Multilingual VQA with Structured Prompts and Vision-Language Alignment⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiongfei Yao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guo Niu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tao Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huanlin Mo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shengjun Deng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuaiwei Jiao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Multilingual Visual Question Answering (VQA) represents a crucial research direction in vision-language models (VLM). Although existing VLMs demonstrate strong performance in tasks like image captioning and simple visual dialogue, they still face significant challenges in reasoning over complex logical relationships and hypothetical scenarios. To systematically evaluate large language models' (LLM) reasoning capabilities in multilingual and multimodal contexts, we participated in CLEF 2025's newly proposed MultimodalReason task. This task requires models to identify the unique correct answer from multiple candidates based on an image and a question posed in multiple languages, including English and Chinese. We propose a novel reasoning approach that integrates image compression, structured prompt construction, and vision-language alignment, and embed it into an advanced chat model. Our method, optimized especially for English and Chinese, achieves leading performance in these sub-tasks, demonstrating its efectiveness for complex multilingual multimodal reasoning.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Vision-Language Models</kwd>
        <kwd>Structured Prompting</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Multimodal Reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Vision-Language Models (LVLMs) [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] represent a pivotal breakthrough in the field of artificial
intelligence, marking a transformation in multimodal understanding and interaction. Multilingual
Visual Question Answering (VQA) is an important research direction within the domain of Visual
Language Models (VLMs). Enhancements in visual encoders [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and improvements in resolution scaling
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have played a crucial role in advancing the quality of practical visual understanding.
      </p>
      <p>
        This study focuses on the ImageCLEF25 task [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], which requires models to identify the single
correct answer from a question that includes an image and 3–5 candidate options. The questions span
multiple languages, including English and Chinese, and cover a wide range of disciplines and contexts.
On the EXAMS-V dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we propose a reasoning approach that integrates image compression,
structured prompt construction, and visual-language alignment, and incorporate it into an advanced
conversational model. Given our strong proficiency in English and Chinese, we specifically optimize
our method for these two language subtasks. Experimental results demonstrate that our approach
achieves leading performance in both language tracks.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Researchers have proposed a new multimodal understanding task, named MultimodalReason, aimed at
enhancing models’ joint semantic reasoning capabilities between images and text. This task requires
models to determine whether a given textual description of an image is reasonable, representing a key
challenge in the field of multimodal natural language understanding.</p>
      <p>As early as 2022, the Visual Commonsense Reasoning (VCR) task was introduced, emphasizing causal
reasoning and multiple-choice answering based on image content. It served as an early representative
of multimodal reasoning tasks. Subsequently, tasks such as NLVR2 and GQA further raised the bar for
models by requiring stronger consistency between image and text and fine-grained visual understanding.</p>
      <p>In 2023, researchers proposed the MaMMUT framework, which unified various multimodal tasks
under a language modeling paradigm. By jointly training on large-scale image-text data, the model
achieved outstanding performance on multimodal reasoning tasks, including MultimodalReason.
Meanwhile, models like BLIP-2 and MiniGPT-4 explored the decoupled integration of visual encoders and
large language models (LLMs), demonstrating strong generalization and scalability in tasks such as
image-text reasoning and question answering. To further enhance reasoning capabilities, LLaVA
introduced a method of integrating visual information into the language model context via prompts,
enabling joint vision-language understanding without modifying the parameters of the language model.</p>
      <p>To streamline the overall network structure, Qwen2.5-VL also aligns the architecture of the Vision
Transformer (ViT) more closely with the design principles of LLMs. Specifically, it adopts RMSNorm [ 9]
for normalization and SwiGLU [10] as the activation function. These design choices not only improve
computational eficiency but also enhance compatibility between the vision and language components,
further improving the model’s multimodal understanding performance.</p>
      <p>Inspired by the strong cross-task generalization capabilities of LLMs, the MultimodalReason task has
begun incorporating multimodal models similar to ChatGPT. In terms of prompt design, structured
prompts can efectively guide the model to attend to image regions relevant to key textual elements,
thereby improving the accuracy of semantic reasoning. Studies show that structured prompts play an
equally important guiding role in multimodal tasks, especially under few-shot learning or open-domain
conditions, where they demonstrate superior generalization.</p>
      <p>In task construction, researchers have attempted to enhance models’ multimodal understanding by
converting image content into text and then inputting both modalities into the model.</p>
      <p>Building on these advancements, we propose a multimodal reasoning task framework that integrates
image-text paired data, structured prompt design, and the strong language understanding capabilities of
models like Qwen2.5-VL, aiming to improve their performance on multimodal reasoning tasks involving
complex semantic relationships.</p>
    </sec>
    <sec id="sec-3">
      <title>3. EXAMS-V Dataset</title>
      <p>EXAMS-V is a comprehensive multilingual and multimodal dataset designed to evaluate the visual
reasoning capabilities of AI systems, particularly Visual Language Models (VLMs). The dataset contains
24,856 multiple-choice questions. It supports 13 languages—namely English, Arabic, Chinese, German,
Bulgarian, Italian, Spanish, Urdu, Polish, Hungarian, Serbian, and Croatian—and covers a wide range of
academic subjects. The questions are drawn from real educational curricula across diferent regions
and countries, enhancing the dataset’s diversity, authenticity, and level of dificulty. The dataset is
divided into 16,500 training samples, 4,000 validation samples, and 3,570 test samples. Successfully
answering questions in EXAMS-V requires not only text comprehension but also the ability to interpret
visual layouts, analyze tables and charts, and perform multimodal reasoning that integrates visual and
linguistic information.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Qwen-VL is a powerful multimodal large model released by the Tongyi Qianwen team, equipped with
capabilities for joint image-text understanding and generation. The model typically employs advanced
vision encoders (such as Vision Transformer) for image representation, which are integrated with the
Qwen series of large language models that possess strong natural language understanding abilities. By
incorporating cross-modal alignment mechanisms and multi-turn dialogue capabilities, Qwen-VL is able
to deeply analyze various types of visual information—including image details, text, and charts—and
respond to natural language prompts with high-quality performance in visual question answering,
image-text reasoning, and language generation tasks.</p>
      <p>In this study, we adopt the Qwen-VL model to enhance cross-modal reasoning capabilities. Each
sample in the EXAMS-V dataset (an image and its corresponding multiple-choice question) is converted
into a structured prompt following the ChatML format, with a unified template designed to improve
consistency and stability in model behavior. As illustrated in Figure 1, the prompt design follows a
structured workflow consisting of three key steps: first, clearly identify the user’s question; second,
systematically analyze the image content (including objects, text, and background elements); and finally,
provide a concise answer based on the reasoning process. Additionally, a system prompt is used to
define the task goals and behavioral constraints, ensuring that the model operates in the role of an AI
assistant and adheres to a standardized reasoning procedure.</p>
      <p>To comply with API upload limits, images are preprocessed using the Python Imaging Library (PIL),
compressed to under 9MB, and encoded in base64 format. Inference is conducted via Alibaba Cloud’s
DashScope platform using an OpenAI-compatible API. The model returns plain-text responses, from
which the system extracts answers using regular expressions targeting tags like A. In cases where the
output is malformed or the API request fails, the system logs the error and defaults to a randomly
selected answer as a fallback.</p>
      <p>The system supports batch inference using PyTorch’s Dataset and DataLoader interfaces, and
incorporates a checkpoint resumption mechanism to skip previously processed samples. All intermediate
results are streamed in real time to JSONL files, ensuring stability, robustness, and reproducibility in
large-scale evaluation tasks. With this structured prompting strategy and system-level optimizations,
Qwen-VL demonstrates enhanced accuracy and consistency in complex image-text reasoning tasks.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Settings</title>
      <p>This study constructs a robust inference system centered on Qwen-VL-Plus, integrating image
compression, structured prompt design, batch processing, and checkpoint resumption mechanisms to enable
large-scale and stable evaluation on the EXAMS-V dataset. The system adopts a three-stage prompting
strategy: the system prompt defines the task objectives and behavioral constraints, while the user
prompt includes a base64-encoded image and multiple-choice question, guiding the model to perform
step-by-step reasoning and generate the final answer. All prompts follow the ChatML format to ensure
consistency and stability during API calls.</p>
      <p>Images are preprocessed using a custom compression function to keep their size under 9MB, thereby
improving upload success rates and computational eficiency. Inference outputs are returned in the
format &lt;answer&gt;. . . &lt;/answer&gt;, from which the system extracts answers using regular expressions.
In cases of invalid responses, a fallback strategy is employed by selecting a random answer. All
predictions are streamed in real time to JSONL files, and completed samples are automatically skipped
via checkpoint resumption, ensuring robustness and reusability in long-running evaluations. Batch
processing is implemented using PyTorch’s Dataset and DataLoader interfaces, supporting scalable and
interruption-tolerant evaluations across large test sets.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Result</title>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This paper presents an in-depth study on the application efectiveness of the Qwen-2.5-VL model
in multilingual visual-language understanding tasks. We systematically evaluate its performance
across multiple language subsets—including English, Chinese, German, and Urdu—using the
EXAMSV multiple-choice question dataset. The experimental results show that Qwen-2.5-VL significantly
outperforms the baseline model across all tasks, achieving an average accuracy improvement of 16.75%.
In particular, the German subset demonstrates a notable gain of nearly 18.6%, fully showcasing the
model’s powerful capabilities in multimodal modeling and cross-lingual transfer.</p>
      <p>To ensure the validity and eficiency of the evaluation, we design and implement a highly robust
evaluation pipeline that encompasses modules such as data organization, image compression, API
request construction, exception handling, and checkpoint resumption. This pipeline not only enhances
system stability and scalability but also efectively reduces resource consumption and the need for
manual intervention, highlighting its strong engineering practicality.</p>
      <p>Based on the comprehensive experimental results and systematic pipeline optimization, we draw the
following conclusions: Qwen-2.5-VL exhibits excellent generalization capabilities in multilingual
visuallanguage understanding tasks; high-quality prompt construction and image processing mechanisms
have a significant impact on model performance; and at the deployment level, engineering optimizations
can greatly improve the usability of large-scale multimodal models.</p>
      <p>Future work will further explore the role of automatic prompt generation, domain knowledge
injection, and multi-stage vision-language interaction strategies in enhancing reasoning capabilities.
Moreover, the emergence of more high-quality multilingual multimodal datasets is expected to provide
broader opportunities for research on large models. As multimodal large models increasingly penetrate
critical application domains such as education, healthcare, and public services, future research will
also focus on improving model interpretability, safety, and fairness. Promising directions include the
construction of controllable multimodal reasoning frameworks with causal inference abilities, the
development of low-resource fine-tuning techniques adaptable to minority languages and edge cases,
and the design of transparent and debuggable visual-language alignment mechanisms. We further
anticipate advancements in open-domain multilingual evaluation benchmarks, multi-task collaborative
learning frameworks, and cross-modal knowledge representation methods, which will ofer a solid
foundation for the practical deployment of general-purpose multimodal intelligence.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is supported by the Research Projects of Ordinary Universities in Guangdong Province
under Grant 2023KTSCX133, the Guangdong Basic and Applied Basic Research Foundation under Grant
2022A1515140103</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling
check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.
[9] B. Zhang, R. Sennrich, Root mean square layer normalization, in: Advances in Neural Information</p>
      <p>Processing Systems (NeurIPS), 2019.
[10] Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks,
in: Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70,
PMLR, 2017, pp. 933–941.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] OpenAI, Chatml documents, https://github.com/openai/openai-python/blob/main/chatml.md,
          <year>2024</year>
          . Accessed:
          <fpage>2025</fpage>
          -06-29.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[2] Anthropic, Claude</source>
          <volume>3</volume>
          .5 sonnet, https://www.anthropic.com/news/claude-3-5-sonnet,
          <year>2024</year>
          . Accessed:
          <fpage>2025</fpage>
          -06-29.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Soricut</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schalkwyk</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hauth</surname>
          </string-name>
          , et al.,
          <article-title>Gemini: A family of highly capable multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2312.11805</source>
          (
          <year>2023</year>
          ). arXiv:
          <volume>2312</volume>
          .
          <fpage>11805</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Su</surname>
          </string-name>
          , G. Chen,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          , Internvl:
          <article-title>Scaling up vision foundation models and aligning for generic visual-linguistic tasks</article-title>
          ,
          <source>arXiv preprint arXiv:2312.14238</source>
          (
          <year>2023</year>
          ). arXiv:
          <volume>2312</volume>
          .
          <fpage>14238</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>Screenspot-pro: Gui grounding for professional high-resolution computer use</article-title>
          , https://likaixin2000.github.io/papers/ScreenSpot_ Pro.pdf,
          <year>2025</year>
          . Preprint, Accessed:
          <fpage>2025</fpage>
          -06-29.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-D. Ştefan</surname>
            ,
            <given-names>M.</given-names>
            -G. Constantin, M.
          </string-name>
          <string-name>
            <surname>Dogariu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Damm</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rückert</surname>
            ,
            <given-names>D. B.</given-names>
          </string-name>
          <string-name>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bloch</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Brüngel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Idrissi-Yaghir</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schäfer</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>T. M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Pakull</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Bracke</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Eryilmaz</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Becker</surname>
            , W.-W. Yim,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Codella</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <string-name>
            <surname>Novoa</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Malvehy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Hee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of imageclef 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ahsan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Paev</surname>
            ,
            <given-names>I. Koychev</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of imageclef 2025 - multimodal reasoning</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D. I.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>I. Koychev</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , Exams-v:
          <article-title>A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models</article-title>
          ,
          <source>arXiv preprint arXiv:2403.10378</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>