<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of ImageCLEF 2025 - Multimodal Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dimitar Dimitrov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ming Shan Hee</string-name>
          <email>mingshan.hee@mbzuai.ac.ae</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhuohan Xie</string-name>
          <email>zhuohan.xie@mbzuai.ac.ae</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rocktim Jyoti Das</string-name>
          <email>zrocktim.jyotidas@mbzuai.ac.ae</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Momina Ahsan</string-name>
          <email>momina.ahsan@mbzuai.ac.ae</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarfraz Ahmad</string-name>
          <email>sarfraz.ahmad@mbzuai.ac.ae</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolay Paev</string-name>
          <email>paev@uni-sofia.bg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Koychev</string-name>
          <email>koychev@fmi.uni-sofia.bg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Preslav Nakov</string-name>
          <email>preslav.nakov@mbzuai.ac.ae</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski"</institution>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mohamed bin Zayed University of Artificial Intelligence</institution>
          ,
          <addr-line>UAE</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present an overview of the first edition of the ImageCLEF Multimodal Reasoning Lab at the 2025 iteration of the Conference and Labs of the Evaluation Forum (CLEF). The goal of the task is to evaluate how well vision-language models can reason over complex visual and textual examination material. The test dataset consists of 3,565 questions in 13 diferent languages. Participants received an image of a question, which included answer choices and metadata outlining the nature of the visual content within the image. Their objective was to choose one correct answer from a group of three to five options. The task had moderate participation with a total of 51 registered teams. Of these, 11 teams submitted results on the test set across all 13 languages and the multilingual leaderboard, with 129 graded submissions overall. The teams mainly used zero-shot approaches, while some chose few-shot methods or fine-tuning. Qwen-VL was the most commonly used model, followed by Gemini. Participants focused on prompt engineering, mostly using variations of instruction prompts that guided the models through processing steps to reach a final answer. Some teams approached the task from an optimization perspective, showing that well-optimized models can achieve competitive performance with fewer parameters and faster inference times. This task contributes to the broader efort of expanding resources for vision-language reasoning evaluation, particularly in low-resource languages. The dataset has been publicly released, along with the gold labels for the test set. We hope this resource will support future research on multilingual and multimodal understanding and foster the development of better and more eficient vision-language models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Understanding and reasoning over both images and text has long been recognised as a core challenge
for artificial intelligence. Early datasets such as VQA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and CLEVR [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] revealed how people naturally
combine language and vision when answering exam-style questions, interpreting charts, or solving
textbook problems. Pioneering captioning models like Show-and-Tell [3] and Show, Attend and Tell
[4] demonstrated the feasibility of bridging the two modalities end-to-end, while vision–language
pre-training frameworks including ViLBERT [5], LXMERT [6] and UNITER [7] established the joint
representations that underpin many modern systems. Scaling these ideas to the large language model
regime has produced today’s Multimodal Large Language Models (MLLMs), for example, CLIP [8],
Flamingo [9], and PaLM-E [10]. Over the past year, such MLLMs have achieved strong results on
visual question answering, open-ended captioning, multimodal dialogue, and even step-by-step math
reasoning. Nonetheless, rigorous audits show that their compositional reasoning abilities remain limited,
especially when complex textual cues must be tightly integrated with visual evidence [
        <xref ref-type="bibr" rid="ref3">11, 12, 13</xref>
        ].
      </p>
      <p>
        Over the past decade, a diverse suite of text-only reasoning benchmarks has driven significant
progress in the development of language models capable of more structured and transparent problem
solving. These benchmarks span a range of reasoning paradigms, including numerical reasoning [
        <xref ref-type="bibr" rid="ref4">14</xref>
        ],
multi-hop commonsense inference [
        <xref ref-type="bibr" rid="ref5 ref6">15, 16</xref>
        ], deductive logic [
        <xref ref-type="bibr" rid="ref7 ref8">17, 18</xref>
        ], mathematical reasoning [
        <xref ref-type="bibr" rid="ref10 ref9">19, 20</xref>
        ],
spatial planning [
        <xref ref-type="bibr" rid="ref11">21</xref>
        ], and domain-specific tasks in finance [
        <xref ref-type="bibr" rid="ref12 ref13">22, 23</xref>
        ], collectively establishing increasingly
rigorous benchmarks for evaluating reasoning capabilities in large language models. However,
languageonly evaluation captures only part of human problem-solving. Tasks such as interpreting geometric
proofs, analyzing circuit schematics, or optimizing supply chains frequently require not only linguistic
understanding but also the ability to extract and reason over structured visual information. This
realization has led to a parallel surge of multimodal reasoning benchmarks: MATH-V adds diagrams
to word problems [12]; NP-HardEval4V provides graph instances for knapsack and shortest-path
tasks to isolate algorithmic reasoning [11]; MMMU spans 30 academic subjects to expose gaps between
perception and domain expertise [
        <xref ref-type="bibr" rid="ref14">24</xref>
        ]; MLLM-COMPBench stresses pairwise relational comparisons [
        <xref ref-type="bibr" rid="ref3">13</xref>
        ];
and ScienceQA mixes visuals with natural-language questions, albeit only in English [
        <xref ref-type="bibr" rid="ref15">25</xref>
        ]. Taken
together, the field is progressing from text-centric reasoning to integrative multimodal evaluation; yet,
current suites still underrepresent the linguistic and structural diversity of real educational assessments,
highlighting ample room for broader, more authentic benchmarks.
      </p>
      <p>
        Recent benchmarks have taken steps to evaluate the reasoning abilities of multimodal models more
precisely. Specifically, MATH-V tests how models handle math questions that include diagrams [ 12],
while NPHardEval4V focuses on algorithmic tasks such as shortest paths and knapsack problems. It
separates reasoning from recognition and instruction-following to test models in isolation [11]. MMMU
introduces a broad set of problems across academic disciplines and highlights the gap between visual
understanding and subject-specific reasoning [
        <xref ref-type="bibr" rid="ref14">24</xref>
        ]. MLLM-COMPBench, on the other hand, evaluates
how well models handle comparisons between image pairs across multiple types of relativity, such as
emotion, spatiality, and quantity, but does not consider structured exam-style questions or integrate
text alongside image comparisons [
        <xref ref-type="bibr" rid="ref3">13</xref>
        ]. ScienceQA includes some visual elements and subject variation,
but it is limited to English and lacks consistent metadata or support for multilingual evaluation [
        <xref ref-type="bibr" rid="ref15">25</xref>
        ]. As
a result, current benchmarks either focus too narrowly on specific visual tasks or lack the language and
structural variety found in real educational assessments.
      </p>
      <p>
        In this paper, we describe the ImageCLEF 2025 Multimodal Reasoning task under the ImageCLEF 2025
Lab [
        <xref ref-type="bibr" rid="ref16">26</xref>
        ]. The task is a shared benchmark designed to evaluate model performance on multiple-choice
questions (MCQs) that may include both text and visual content. The dataset used is EXAMS-V [
        <xref ref-type="bibr" rid="ref17">27</xref>
        ],
further enriched by a new test set. The combined corpus spans 14 languages and 44 academic and
vocational subjects, covering science fields such as biology, physics, and chemistry, as well as disciplines
like history, informatics, fine arts, and business. The languages represented include Arabic, Chinese,
Urdu, Kazakh, and several European languages such as French, German, Bulgarian, and Polish. This wide
linguistic coverage enables the benchmark to evaluate models in both high-resource and low-resource
settings, making it well-suited for real-world multilingual education scenarios. Each example includes a
question with three to five answer options with a single correct answer, and may also contain a visual
element such as a graph, figure, table, or chemical diagram, depending on the question type as shown
in Figure 1. The task reflects a realistic educational setting, especially for low-resource languages and
disciplines, and supports model evaluation using a straightforward accuracy metric.
      </p>
      <p>One of the primary objectives of this task is to evaluate systems in low-resource settings. This
includes languages such as Arabic and Urdu, as well as scientific content that requires diagrams or
tables to answer the question. Each sample in the dataset contains metadata, including subject, grade
level, and the type of visual elements. This enables a deeper analysis of system performance and allows
for comparisons of results across languages, grades, and question types.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Multimodal Reasoning.</title>
        <p>
          Multimodal reasoning has emerged as a critical research area at the intersection of vision and language,
enabling models to integrate and interpret heterogeneous information sources. This capability is
essential for real-world applications such as educational assessments [
          <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20">27, 28, 29, 30</xref>
          ], online content
moderation [
          <xref ref-type="bibr" rid="ref21 ref22 ref23">31, 32, 33</xref>
          ], and scientific analysis [
          <xref ref-type="bibr" rid="ref15 ref24">25, 34</xref>
          ], where information often includes structured
tables, data charts, and annotated figures. In the education domain, students frequently engage with
classroom materials that combine visual and textual information, ranging from annotated diagrams to
chemical symbols and structures. As students increasingly use AI tools to tackle dificult questions,
it becomes essential to evaluate multimodal models of structured educational content that reflect
real-world curricula, ensuring that their responses are accurate and reliable.
        </p>
        <p>
          Several evaluation benchmarks have been introduced across diferent languages and disciplines to
assess the performance of models on examination questions. EXAMS-V [
          <xref ref-type="bibr" rid="ref17">27</xref>
          ] presents a multidisciplinary,
multimodal, and multilingual benchmark consisting of 20,932 multiple-choice questions. These questions
may include tables, figures, diagrams, maps, scientific symbols, and equations, and span more than 20
academic disciplines in 11 languages. Kaleidoscope [
          <xref ref-type="bibr" rid="ref18">28</xref>
          ] provides 20,911 exam questions in 18 languages
and 14 subjects, ensuring linguistic and cultural authenticity through contributions from a diverse
group of researchers around the world. More recently, MDK12-Bench [
          <xref ref-type="bibr" rid="ref20">30</xref>
          ] introduced more than 140,000
examples drawn from K-12 exams in six disciplines, enriched with dificulty labels and explanation
rationales to support fine-grained reasoning evaluation. These benchmarks collectively emphasize
the growing need to evaluate multimodal models through realistic, diverse, and linguistically rich
educational scenarios. This ImageCLEF 2025 - Multimodal Reasoning task further contributes to this
line of research by inviting participants to tackle real-world exam question challenges, with a test set
comprising 3,565 questions to evaluate the models’ capabilities.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Low-Resource Multilingual Languages.</title>
        <p>
          Although current multimodal and language models have demonstrated remarkable capabilities in
highresource languages (i.e., languages with abundant data), many widely spoken languages in Southeast
Asia, Central Asia, and Africa remain underrepresented due to limited data availability (i.e., low-resource
languages). These low-resource languages pose a significant challenge for existing AI models, as they
difer substantially in structure, grammar, and cultural context [
          <xref ref-type="bibr" rid="ref25">35, 36</xref>
          ]. Recent eforts have sought to
bridge this gap by curating multilingual and culturally representative datasets. For example, KazMMLU
[37] adapts the MMLU benchmark to Kazakh to assess general knowledge reasoning within Central Asian
contexts. SGHateCheck [38] focuses on detecting hate language in Singaporean English, Malay, and
Mandarin, with an emphasis on linguistic code-switching and localized harms. Similarly, IndoNLP [39]
uses crowdsourcing to build benchmarks for several under-resourced Indonesian languages. Collectively,
these datasets highlight the linguistic diversity and cultural specificity that are often overlooked in the
mainstream benchmarks.
        </p>
        <p>The ImageCLEF 2025 Multimodal Reasoning Task aims to address this gap by introducing a curated
multimodal benchmark on K-12 examination questions that covers a diverse range of languages, scripts,
and question types. The task releases a test set that extends the 11 languages in EXAMS-V with
three new languages: Kazakh, Urdu, and Spanish, exam-style questions evaluating the robustness
and generalizability of multimodal models in multilingual contexts, thereby promoting research that
inclusively supports a broader spectrum of linguistic communities.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <sec id="sec-3-1">
        <title>3.1. Data Collection</title>
        <p>The newly developed test set contains questions from the languages present in the EXAMS-V dataset, as
well as questions in three additional languages: Kazakh, Urdu, and Spanish. The questions are sourced
from PDFs available online, mostly from annual school exams. For some of the languages in the original
dataset, we used the same sources, but we took more recent versions of the yearly exams. For Urdu
and Spanish, we considered PDFs available online, while for Kazakh, questions were scanned from
textbooks. After the PDFs are collected, we run a processing step that converts the PDF pages into a
series of images using the pdf2image1 Python package.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Annotations</title>
        <p>The question extraction and annotation follow the steps used to create the original EXAMS-V dataset.
We use open-source software2 to annotate the bounding boxes of the questions that align with the
annotation guidelines of the original EXAMS-V dataset - only multiple-choice questions with 3 to 5
options and exactly one correct answer are considered. The bounding box of each question encompasses
all possible answers, as well as the images, tables, and other necessary information provided. In some
languages, examination formats included questions that overflowed into multiple pages, which we
discarded to simplify our annotation process. After annotating the bounding boxes, an automatic script
extracts each question as a separate image. The next step of the pipeline is the creation of a metadata
ifle that contains a unique ID, the path to the cropped image, the label for the correct answer, and
annotations for the type of visual elements included. The source PDFs provide the correct gold label,
reducing the risk of errors in annotation; therefore, a single annotator per record is suficient.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Data Stats</title>
        <p>
          The dataset for the Multimodal Reasoning task is divided into three subsets: train, validation, and test.
Table 1 reports the statistics for the training and validation portions of the EXAMS-V dataset [
          <xref ref-type="bibr" rid="ref17">27</xref>
          ]. This
dataset comprises 20,932 questions across 20 subjects, covering grades 4 to 12 in 11 languages, providing
a solid foundation for training and evaluation. The newly developed test set includes a total of 3,565
new questions from recent public high school examinations. Table 2 presents detailed statistics on
the new data. The test set includes all languages from EXAMS-V, except for French, and introduces
three additional languages: Urdu, Kazakh, and Spanish. As with EXAMS-V, the new data preserves
the same diversity in languages and subjects, as well as question complexity, and includes 203 parallel
questions in three languages: Croatian, Serbian, and Italian. In terms of visual representation, the
test data provides a higher proportion of questions with visual features in most languages as seen in
Tables 2 and 3. We also observe that Figures and Graphs are the most common visual features across all
languages.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Framework</title>
      <sec id="sec-4-1">
        <title>4.1. Task organization</title>
        <p>The task was conducted in two phases: an exploration phase, during which participants familiarized
themselves with the publicly available training and validation data [40], followed by a test phase. In the
1https://pypi.org/project/pdf2image/
2https://opencv.org/</p>
        <p>Family</p>
        <p>Germanic
Sino-Tibetan</p>
        <p>Romance
Germanic
Romance
Semitic</p>
        <p>Slavic
Finno-Ugric</p>
        <p>Slavic
Slavic
Slavic
Family</p>
        <p>Germanic
Sino-Tibetan</p>
        <p>Germanic
Semitic</p>
        <p>Slavic
Finno-Ugric</p>
        <p>Slavic
Slavic</p>
        <p>Slavic
Romance
Indo-Aryan</p>
        <p>Turkic
Romance
test phase, images of the questions from the test dataset were released along with metadata describing
the visual components within the questions. Participants submitted results to 14 diferent leaderboards:
one multilingual leaderboard and 13 individual leaderboards, one for each language. Participants were
allowed to make multiple submissions during this phase, but no feedback was provided. Final rankings
were determined based on each participant’s last submission at the end of the test phase.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation Measure</title>
        <p>We use accuracy as the evaluation metric for the task. Accuracy is calculated as the percentage of
questions where the model’s selected answer matches the correct option. Since each question has a
single correct answer, accuracy provides a simple and reliable way to compare model performance
across diferent languages and subject areas.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Baselines</title>
        <p>For our baseline experiments, we used the Instruct variants of four models: SmolLM-Instruct,
SmolVLMInstruct, OLMO-Instruct, and MOLMO-Instruct. These models span a range of modalities and sizes,
providing a strong starting point for building an initial understanding of the task.</p>
        <p>We grouped the models based on their modality. SmolLM-Instruct and OLMO-Instruct are text-only
models, while SmolVLM-Instruct and MOLMO-Instruct are multimodal models capable of processing
both images and text. All models were evaluated in a zero-shot setting. For the language-only models
(SmolLM and OLMO), we provided image captions as input instead of raw images. For the
visionlanguage models (SmolVLM and MOLMO), the original image was used as input.</p>
        <p>Each model group was tested with two types of prompts, i.e., Prompt 1: A short, direct instruction
asking the model to select the correct answer based on the input (caption or image). Prompt 2: A
more detailed, step-by-step reasoning prompt encouraging the model to extract and analyze all relevant
elements, such as multilingual content, tables, or diagrams, before selecting an answer. For example,
Prompt 1 for VLMs instructed the model to analyze the image and reply with just the letter of the
correct option, while Prompt 2 guided the model to extract the question, options, and visual cues before
answering. A similar prompt pair was used for LLMs but applied to the caption text.</p>
        <p>We used the same two prompts across all subjects and languages, and made no further tuning or
task-specific engineering. This setup allowed us to evaluate how well these models could generalize to
the ImageCLEF MCQ format under consistent conditions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Overview of the Systems and Results</title>
      <sec id="sec-5-1">
        <title>5.1. Competition Results</title>
        <p>Table 4 shows participant results on the test set on all 14 leaderboards. The most popular leaderboards
were English, Multilingual, and Chinese, with 10, 9, and 7 teams participating, respectively. Some teams
participated in multiple leaderboards, with two submitting to all 14 and another two submitting to 13.
All teams significantly outperformed the baseline, except for elenat in the Multilingual and Bulgarian
leaderboards. The task proved to be moderately dificult, with some teams achieving over 90% accuracy
using the most recent commercial VLMs. Team MSA performed exceptionally well across the board,
securing first place in 11 of the 13 leaderboards they entered.</p>
        <p>Participating teams utilized a combination of proprietary and open-source large VLMs, including
Qwen2.5-VL, Gemini, SmolLM, and Deepseek. The majority of approaches employed zero-shot or
fewshot techniques, leveraging metainformation about visual elements. The most common and successful
prompt strategies consisted of few-shot prompts that leverage image descriptions generated by diferent
VLMs. Most notably, MSA used Gemini 2.5 Flash to generate captions of the input image, which
is then further validated and refined before passing it in a zero-shot prompt using Gemini 2.5 Pro.
ContextDrift employed a diferent prompting strategy, which relied on a sophisticated pipeline for
1-shot inference using Gemini 2.5 Flash, complemented by OCR-extracted textual content. Team
ymgclef applied a multi-prompt ensemble by combining a base prompt, Chain-of-Thought prompt, and
a Role-Playing prompt and achieved competitive results with this approach, despite using Qwen-VL,
which underperformed compared to the Gemini models.</p>
        <p>A few teams experimented with model fine-tuning; however, they achieved lower rankings, primarily
due to the use of smaller models constrained by limited computational resources. lekshmiscopevit
opted for parameter-eficient fine-tuning of Qwen2.5-VL with LoRa, achieving competitive results in
the Multilingual leaderboard while reducing memory requirements by 75%. Team plutohbj also chose
a fine-tuning approach, using LoRa for eficient tuning. They added cross-modal attention to enhance
multimodal performance and applied stable optimization methods.</p>
        <p>Overall, the top-performing systems, utilized by the leading two to three teams in most leaderboards,
were from the Gemini family—specifically, Gemini 1.5 Pro and Gemini 2.5 Flash. Following in rankings
was Qwen2.5-VL, which was the only model used for fine-tuning. Many participants combined diferent
models or prompt strategies in ensembles to further boost their scores. An interesting observation is
that participant models show comparable performance on parallel questions in Italian and Croatian,
likely due to their shared use of the Latin script. In contrast, performance on Serbian is significantly
lower (by approximately 20%), suggesting that models face greater dificulty with languages written in
Cyrillic script. This performance gap is likely attributable to the lower representation of such languages
in the training data.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Detailed System Descriptions</title>
        <p>(Keywords: Gemini-1.5, DeepSeek-R1-Distill-LLaMA, Structured Parsing, Modular Pipeline, Zero-shot)</p>
        <p>The authors propose a modular two-stage pipeline combining structured visual parsing with
languagebased reasoning. First, Gemini-1.5 Flash is used to decompose images into structured JSON outputs
containing fields like question text, options, diagrams, labels, and tables. This is achieved through a
carefully crafted zero-shot prompt that suppresses reasoning and ensures accurate multilingual layout
parsing. The structured outputs are then passed to DeepSeek-R1-Distill-LLaMA, which uses a strict
answer-only prompt to select the correct option. The entire pipeline is zero-shot, language-agnostic,
and operates without fine-tuning, relying on prompt design for robustness and consistency.
bingezzzleep [42]
(Keywords: Prompt-Enhancement, Feature Alignment, Qwen-VL)</p>
        <p>The authors propose a three-part system that constructs textual and visual features from the input
image and fuses them into the prompt of the LLM. The prompt consists of three parts. The first part
is a prompt-tuned standardized prompt, explaining the task. The second part is an encoding of the
input image via Vision Transformer that is compressed into a smaller feature set by the cross-attention
module. The third part is an LLM-processed description of the image, done by VLM. The three-part
prompt is given to Qwen-VL-Max for question answering. They apply a prompt-tuning strategy for
optimization. The results show that this prompt-enhancing strategy performs better than the direct
model use.</p>
        <p>ContextDrift [49]
(Keywords: Gemini-2.5, LRM, LMM, Thinking Budget, Few-Shot, Zero-Shot)</p>
        <p>The authors propose a pipeline that performs few-shot inference using Gemini-2.5-Flash (Thinking)
with extracted textual content. Textual content is extracted using OCRSpace, a cloud-based OCR service,
and is passed along with the image to the Gemini 2.5 Flash (Thinking) model for answer classification.
Their extensive experiments on the validation set reveal two critical findings. First, the impact of
performing an OCR augmentation is marginal and varies depending on the language. Although the
augmentation marginally improves model performance in the English dataset, it slightly decreases
performance on the Bulgarian dataset. Second, the number of thinking tokens improves the overall
performance of the model. However, a higher number of thinking tokens can have adverse efects,
particularly on Biology and Chemistry questions containing graphical elements.
deng113abc [43]
(Keywords: Qwen-VL-2.5, Prompt engineering, Chain-of-thought prompting)</p>
        <p>The authors propose a two-step prompting strategy called "Question Reconstruction before
An1
1
2
3
3
4
5
6
1
2
3
4
5
6
7
1
2
3
3
4
1
2
3
4
5
1
2
3
4
5</p>
        <p>Team</p>
        <p>Bulgarian
ContextDrift †
ContextDrift †
ymgclef
bingezzzleep
MSA
plutohbj
baseline
elenat</p>
        <p>German
MSA
ymgclef
bingezzzleep
plutohbj
yaozihang
mhl2001
baseline</p>
        <p>Urdu
MSA
ymgclef
bingezzzleep
yaozihang
baseline</p>
        <p>Croatian
MSA
bingezzzleep
ymgclef
plutohbj
baseline</p>
        <p>Serbian
MSA
bingezzzleep
ymgclef
plutohbj
baseline</p>
        <p>Acc
swering" (QRA) for multimodal question answering. In the first step, the model uses image features
and metadata such as language and subject to complete missing parts of the question. This helps the
model better understand the problem. In the second step, the completed question is passed through a
Chain-of-Thought prompting format, guiding the model to reason step-by-step and give an answer.
The method works in a zero-shot setting, without OCR or fine-tuning. The team achieved 6th place in
both multilingual and English tracks, showing strong performance across languages and improving
significantly over the oficial baseline.
elenat [44]
(Keywords: BLIP, Prompt Ablation, SmolLM-360M, Zero-shot)</p>
        <p>The authors propose a zero-shot pipeline that integrates image captioning with compact language
model reasoning. First, images are processed using BLIP (Base or Large) to generate captions under
three diferent prompt conditions: no prompt, “A photo of”, and “Describe what you see”. These prompts
are designed to influence the verbosity and descriptiveness of the generated captions. The resulting
caption is then inserted into a fixed-format prompt and passed to SmolLM-360M, which is a 360M
parameter transformer optimized for low-resource inference. It selects the correct answer from multiple
choices. The model is used without fine-tuning. This setup enables eficient multimodal reasoning
under minimal compute, and allows the authors to assess how prompt formulation afects downstream
answer prediction accuracy.
lekshmiscopevit [45]
(Keywords: Qwen2.5-VL, LoRa, Model optimization, Eficient inference, Quantization )</p>
        <p>The authors propose an approach that utilizes the Qwen2.5-VL-72B Instruct model. They apply
parameter-eficient fine-tuning technique, such as LoRA or QLoRA, explicitly optimized for the
EXAMSV dataset. Initial experiments suggest that fine-tuning with as few as 0.1% of parameters could yield
accuracy improvements of 5–8% while maintaining generalization capabilities. To address performance
disparities across languages, the authors used language-specific adapter modules that are dynamically
integrated with the base model, with a focus on enhancing performance for underrepresented languages.
Through 4-bit quantization, specialized prompting techniques, and robust answer extraction methods,
the team achieved comparable performance across all languages and subjects while reducing memory
requirements by up to 75%.
(Keywords: Qwen2.5-VL, Few-shot, Prompt engineering)</p>
        <p>The authors present a prompt-tuned pipeline leveraging Qwen2.5-VL-Plus to address the ImageCLEF
2025 Multimodal Reasoning task. Their system upgrades the baseline SmolVLM-2.5-1.7B model with
Qwen2.5-VL-Plus and introduces a hybrid prompt design composed of multilingual system instructions
and exemplar-based few-shot samples. The final prompt includes both macro-level role definition and
task-specific examples, aiming to improve reasoning and answer classification. Their pipeline processes
image-question pairs and generates structured responses using regex-based extraction. Evaluation on
the EXAMS-V benchmark reveals a 63% performance gain over the baseline (from 0.2701 to 0.4418)
on the Multilingual dataset, with other notable improvements in Chinese and German. Their method
demonstrates robust multilingual generalization without task-specific fine-tuning.</p>
        <p>MSA [47]
(Keywords: Gemini 2.5, Gemini 1.5, Ensemble, Zero-shot)</p>
        <p>The authors propose a two-stage ensemble pipeline fully based on the use of a proprietary family of
models - Gemini. During the first stage, the authors use Gemini 2.5 Flash to generate detailed descriptions
of the input image. They use a 1-shot prompt to encourage the model to preserve uniformity. The
generated caption, together with the input image, is passed to Gemini 1.5 Pro, which verifies the format
in the correct language, inferred from the input. Then, the generated caption, together with the input
image, is passed to Gemini 1.5 Pro, which verifies the correctness of the labels and translates any stray
text into the declared language. In the second stage, the authors employ Gemini 2.5 Pro with a zero-shot
prompt to solve the question and output the correct answer. The system achieves very strong results,
placing 1st in 11 of 13 tracks.
plutohbj [48]
(Keywords: LoRa, Meta learning, Qwen2.5-VL)</p>
        <p>The authors propose the Meta-LoRa framework, which enhances Qwen2.5-VL through three training
techniques: (1) they use dynamic parameter adaptation to improve performance and eficiency during
training (LoRa) (2), multimodal feature fusion by computing cross-modal attention to enhance
understanding between both visual and textual modalities (3), stable optimization using cosine annealing,
gradient clipping, and KL regularization. The model is trained on the Exams-V dataset and achieves
competitive results on the leaderboard.
yaozihang [50]
(Keywords: Qwen-VL, Prompt Engineering, Zero-shot)</p>
        <p>The authors propose an approach that utilizes the Qwen2.5-VL model in a zero-shot setting. They
craft a prompt that ensures consistent analysis of the input images, containing questions, and outputs a
concise answer in the correct format required by the task. To ensure the validity and eficiency of the
experimental evaluation, this paper designs and implements a pipeline, comprising modules such as
data organization compression, API request construction, and exception handling.
ymgclef [51]
(Keywords: Multi-Prompt Ensemble, Qwen-VL-Plus, GPT-4.1-2025-04-14)</p>
        <p>The authors proposed a zero-shot ensemble pipeline in which the final classification is obtained by
performing model inference using three diferent types of prompt templates: Base Prompt,
Chain-ofThought Prompt, and Role-Playing Prompt. The multi-prompt ensemble strategy enables the model to
focus on informational features at diferent levels of abstraction, which improves the model’s multimodal
reasoning capabilities. Their experiments on an open-source model, Qwen-VL-Plus, and a proprietary
model, GPT-4.1-2025-04-14, demonstrate the efectiveness of the multi-prompt strategy.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>We presented an overview of the Multimodal Reasoning task, part of the ImageCLEF lab at CLEF 2025.
This task aimed to assess the reasoning abilities of large vision-language models when applied to
complex visual and textual examination content. Participants were provided with an image containing
a question and 3 to 5 answer choices. The goal was to select the single correct answer from the given
options.</p>
      <p>Most submissions employed zero-shot large multilingual vision-language models, primarily Qwen and
Gemini, often incorporating extensive prompt engineering. A smaller subset of submissions explored
few-shot learning and fine-tuning approaches, emphasizing model optimization. These approaches
demonstrated that models with fewer parameters can achieve performance levels comparable to their
larger counterparts. The top-performing systems were based on the proprietary Gemini model family,
achieving over 80% accuracy overall and surpassing 90% in some languages. Systems using the
opensource Qwen models ranked closely behind, suggesting that open-source approaches have significantly
reduced the performance gap with proprietary models, particularly in low-resource languages. However,
performance diferences remain in favor of commercial models for high-resource languages.</p>
      <p>In the future, we plan to expand both the linguistic and visual complexity of the evaluation setting.
We plan to introduce additional languages, incorporate more diverse and challenging examination
materials, and include more complex visual elements. Furthermore, the task will be extended to cover
assessments from university-level examinations, broadening the scope and dificulty of the reasoning
challenges.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgement</title>
      <p>The work of Dimitar Dimitrov and Ivan Koychev is partially funded by the EU NextGenerationEU
project, through the National Recovery and Resilience Plan of the Republic of Bulgaria, project SUMMIT,
No BG-RRP-2.004-0008.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors did not employ any Generative AI tools in the preparation of this manuscript.
[3] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural image caption generator, in:
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA,
June 7-12, 2015, IEEE Computer Society, 2015, pp. 3156–3164. URL: https://doi.org/10.1109/CVPR.
2015.7298935. doi:10.1109/CVPR.2015.7298935.
[4] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio, Show,
attend and tell: Neural image caption generation with visual attention, in: F. R. Bach, D. M. Blei
(Eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille,
France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, JMLR.org, 2015,
pp. 2048–2057. URL: http://proceedings.mlr.press/v37/xuc15.html.
[5] J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic visiolinguistic representations
for vision-and-language tasks, in: H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,
E. B. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems 32: Annual
Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019,
Vancouver, BC, Canada, 2019, pp. 13–23. URL: https://proceedings.neurips.cc/paper/2019/hash/
c74d97b01eae257e44aa9d5bade97baf-Abstract.html.
[6] H. Tan, M. Bansal, LXMERT: Learning cross-modality encoder representations from transformers,
in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019,
pp. 5100–5111. URL: https://aclanthology.org/D19-1514. doi:10.18653/v1/D19-1514.
[7] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Universal image-text
representation learning, in: European conference on computer vision, Springer, 2020, pp. 104–120.
[8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language
supervision, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference
on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of
Machine Learning Research, PMLR, 2021, pp. 8748–8763. URL: http://proceedings.mlr.press/v139/
radford21a.html.
[9] J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K.
Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M.
Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski,
R. Barreira, O. Vinyals, A. Zisserman, K. Simonyan, Flamingo: a visual language model
for few-shot learning, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh
(Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on
Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,
November 28 - December 9, 2022, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/
960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html.
[10] D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson,
Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke,
K. Hausman, M. Toussaint, K. Gref, A. Zeng, I. Mordatch, P. Florence, Palm-e: An embodied
multimodal language model, in: A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett
(Eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu,
Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, PMLR, 2023, pp. 8469–8488.</p>
      <p>URL: https://proceedings.mlr.press/v202/driess23a.html.
[11] L. Fan, W. Hua, X. Li, K. Zhu, M. Jin, L. Li, H. Ling, J. Chi, J. Wang, X. Ma, et al., Nphardeval4v: A
dynamic reasoning benchmark of multimodal large language models, ArXiv preprint abs/2403.01777
(2024). URL: https://arxiv.org/abs/2403.01777.
[12] K. Wang, J. Pan, W. Shi, Z. Lu, H. Ren, A. Zhou, M. Zhan, H. Li, Measuring multimodal
mathematical reasoning with math-vision dataset, in: A. Globersons, L. Mackey, D. Belgrave, A. Fan,
U. Paquet, J. M. Tomczak, C. Zhang (Eds.), Advances in Neural Information Processing Systems 38:
Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver,
BC, Canada, December 10 - 15, 2024, 2024. URL: http://papers.nips.cc/paper_files/paper/2024/hash/
[36] R. Ng, T. N. Nguyen, Y. Huang, N. C. Tai, W. Y. Leong, W. Q. Leong, X. Yong, J. G. Ngui, Y. Susanto,
N. Cheng, et al., Sea-lion: Southeast asian languages in one network, ArXiv preprint abs/2504.05747
(2025). URL: https://arxiv.org/abs/2504.05747.
[37] M. Togmanov, N. Mukhituly, D. Turmakhan, J. Mansurov, M. Goloburda, A. Sakip, Z. Xie, Y. Wang,
B. Syzdykov, N. Laiyk, et al., Kazmmlu: Evaluating language models on kazakh, russian, and
regional knowledge of kazakhstan, ArXiv preprint abs/2502.12829 (2025). URL: https://arxiv.org/
abs/2502.12829.
[38] R. C. Ng, N. Prakash, M. S. Hee, K. T. W. Choo, R. K.-w. Lee, SGHateCheck: Functional tests for
detecting hate speech in low-resource languages of Singapore, in: Y.-L. Chung, Z. Talat, D. Nozza,
F. M. Plaza-del Arco, P. Röttger, A. Mostafazadeh Davani, A. Calabrese (Eds.), Proceedings of
the 8th Workshop on Online Abuse and Harms (WOAH 2024), Association for Computational
Linguistics, Mexico City, Mexico, 2024, pp. 312–327. URL: https://aclanthology.org/2024.woah-1.24.
[39] S. Cahyawijaya, H. Lovenia, J. R. A. Moniz, T. H. Wong, M. R. Farhansyah, T. T. Maung, F. Hudi,
D. Anugraha, M. R. S. Habibi, M. R. Qorib, et al., Crowdsource, crawl, or generate? creating sea-vl,
a multicultural vision-language dataset for southeast asia, ArXiv preprint abs/2503.07920 (2025).</p>
      <p>URL: https://arxiv.org/abs/2503.07920.
[40] R. Das, S. Hristov, H. Li, D. Dimitrov, I. Koychev, P. Nakov, EXAMS-V: A multi-discipline
multilingual multimodal exam benchmark for evaluating vision language models, in: L.-W. Ku,
A. Martins, V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Association for Computational
Linguistics, Bangkok, Thailand, 2024, pp. 7768–7791. URL: https://aclanthology.org/2024.acl-long.420/.
doi:10.18653/v1/2024.acl-long.420.
[41] A. Amjad, F. Seemab, S. Kausar, S. Latif, M. Fatima, ayeshaamjad at ImageCLEF 2025 Multimodal
Reasoning: visual question answering with structured data extraction and robust reasoning, in:
[52], 2025.
[42] Q. Wu, L. Kong, J. Yan, J. Li, Team bingezzzleep at ImageCLEF 2025 Multimodal Reasoning: a
multimodal feature alignment prompt-enhanced method for multimodal reasoning, in: [52], 2025.
[43] S. Deng, G. Niu, X. Yao, H. Mo, T. Li, S. Jiao, Bridging the modality gap through cot-enhanced
multimodal reasoning, in: [52], 2025.
[44] E. Tosheva, D. Dimitrov, I. Koychev, P. Nakov, Elenat at Image CLEF 2025 Multimodal Reasoning:</p>
      <p>Zero-shot reasoning with blip and smollm, in: [52], 2025.
[45] T. Srikumar, S. Kesavan, A. M B, D. Samuel, K. Maneesh Ram, G. E, V. K. Singh, L. Kalinathan,
Leveraging qwen2.5-vl-72b-instruct for visual question answering: A Study on the EXAMS-V
Benchmark in ImageCLEF 2025, in: [52], 2025.
[46] H. Mo, G. Niu, S. Deng, X. Yao, T. Li, S. Jiao, Multimodal Reasoning in Multilingual Visual Question</p>
      <p>Answering: A prompt-tuned qwen2.5-vl-plus approach, in: [52], 2025.
[47] A. Seif, M. Younes, A. Moustafa, A. Allam, H. Moustafa, MSA at ImageCLEF 2025 Multimodal
Reasoning: multilingual multimodal reasoning with ensemble vision-language models, in: [52],
2025.
[48] B. Huang, C. Zhong, K. Yan, Team plutohbj at ImageCLEF 2025 Multimodal Reasoning:
metalearning lora fine-tuning for multimodal reasoning, in: [52], 2025.
[49] V. Krazheva, D. Markova, D. Dimitrov, I. Koychev, P. Nakov, ContextDrift at ImageCLEF 2025
Multimodal Reasoning: Evaluating vlms’ multimodal, multilingual and multidomain reasoning
capabilities via thinking budget variations and textual augmentation, in: [52], 2025.
[50] X. Yao, G. Niu, T. Li, H. Mo, S. Deng, S. Jiao, Enhancing Multilingual VQA with structured prompts
and vision-language alignment, in: [52], 2025.
[51] J. Yan, L. Kong, Q. Wu, J. Li, Multi-prompt ensemble reasoning for multimodalreasoning, in: [52],
2025.
[52] G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs
of the Evaluation Forum, CLEF 2025, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Antol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>VQA: visual question answering</article-title>
          ,
          <source>in: 2015 IEEE International Conference on Computer Vision</source>
          , ICCV 2015, Santiago, Chile, December 7-
          <issue>13</issue>
          ,
          <year>2015</year>
          , IEEE Computer Society,
          <year>2015</year>
          , pp.
          <fpage>2425</fpage>
          -
          <lpage>2433</lpage>
          . URL: https://doi.org/10. 1109/ICCV.
          <year>2015</year>
          .
          <volume>279</volume>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2015</year>
          .
          <volume>279</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          , L. van der Maaten, L.
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>R. B.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning</article-title>
          ,
          <source>in: 2017 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2017</year>
          ,
          <article-title>Honolulu</article-title>
          ,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA, July
          <volume>21</volume>
          -
          <issue>26</issue>
          ,
          <year>2017</year>
          , IEEE Computer Society,
          <year>2017</year>
          , pp.
          <fpage>1988</fpage>
          -
          <lpage>1997</lpage>
          . URL: https://doi.org/10.1109/CVPR.
          <year>2017</year>
          .
          <volume>215</volume>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2017</year>
          .
          <volume>215</volume>
          .
          <string-name>
            <surname>ad0edc7d5fa1a783f063646968b7315b-Abstract-Datasets</surname>
          </string-name>
          _and_Benchmarks_Track.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          , K. Cheng, L.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , W. Chao,
          <article-title>Mllm-compbench: A comparative reasoning benchmark for multimodal llms</article-title>
          , in: A.
          <string-name>
            <surname>Globersons</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Mackey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Belgrave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Paquet</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Tomczak</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Zhang (Eds.),
          <source>Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems</source>
          <year>2024</year>
          , NeurIPS
          <year>2024</year>
          , Vancouver, BC, Canada,
          <source>December 10 - 15</source>
          ,
          <year>2024</year>
          ,
          <year>2024</year>
          . URL: http://papers.nips.cc/paper_files/paper/ 2024/hash/32923df09f75cf1974c145764a523e2-Abstract-Datasets_and_Benchmarks_Track.html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dasigi</surname>
          </string-name>
          , G. Stanovsky,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Gardner, DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>2368</fpage>
          -
          <lpage>2378</lpage>
          . URL: https://aclanthology.org/N19-1246. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1246.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>HotpotQA: A dataset for diverse, explainable multi-hop question answering</article-title>
          , in: E.
          <string-name>
            <surname>Rilof</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hockenmaier</surname>
          </string-name>
          , J. Tsujii (Eds.),
          <source>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>2369</fpage>
          -
          <lpage>2380</lpage>
          . URL: https://aclanthology.org/D18-1259. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D18</fpage>
          -1259.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Geva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khashabi</surname>
          </string-name>
          , E. Segal,
          <string-name>
            <given-names>T.</given-names>
            <surname>Khot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          ,
          <article-title>Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies, Transactions of the Association for Computational Linguistics 9 (</article-title>
          <year>2021</year>
          )
          <fpage>346</fpage>
          -
          <lpage>361</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .tacl-
          <volume>1</volume>
          .21. doi:
          <volume>10</volume>
          . 1162/tacl_a_
          <fpage>00370</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <article-title>Reclor: A reading comprehension dataset requiring logical reasoning</article-title>
          ,
          <source>in: 8th International Conference on Learning Representations, ICLR</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Addis</given-names>
            <surname>Ababa</surname>
          </string-name>
          , Ethiopia,
          <source>April 26-30</source>
          ,
          <year>2020</year>
          , OpenReview.net,
          <year>2020</year>
          . URL: https://openreview.net/forum?id=
          <fpage>HJgJtT4tvB</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Logiqa: A challenge dataset for machine reading comprehension with logical reasoning</article-title>
          , in: C.
          <string-name>
            <surname>Bessiere</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2020</year>
          ,
          <article-title>ijcai</article-title>
          .org,
          <year>2020</year>
          , pp.
          <fpage>3622</fpage>
          -
          <lpage>3628</lpage>
          . URL: https://doi.org/10.24963/ijcai.
          <year>2020</year>
          /501. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2020</year>
          /501.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bavarian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plappert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tworek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nakano</surname>
          </string-name>
          , et al.,
          <article-title>Training verifiers to solve math word problems</article-title>
          ,
          <source>ArXiv preprint abs/2110</source>
          .14168 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2110.14168.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kadavath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <article-title>Measuring mathematical problem solving with the math dataset</article-title>
          ,
          <source>ArXiv preprint abs/2103</source>
          .03874 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2103.03874.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>O.</given-names>
            <surname>Choukrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Malek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Orel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Iklassov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Takáč</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lahlou</surname>
          </string-name>
          ,
          <article-title>Llm-babybench: Understanding and evaluating grounded planning and reasoning in llms</article-title>
          ,
          <source>ArXiv preprint abs/2505</source>
          .12135 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2505.12135.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Smiley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Borova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Langdon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Moussa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Routledge</surname>
          </string-name>
          , W. Y. Wang,
          <article-title>FinQA: A dataset of numerical reasoning over financial data</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>3697</fpage>
          -
          <lpage>3711</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          . emnlp-main.
          <volume>300</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>300</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sahnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          , G. Georgiev,
          <string-name>
            <given-names>R.</given-names>
            <surname>Thareja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madmoun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xing</surname>
          </string-name>
          , et al.,
          <article-title>Finchain: A symbolic benchmark for verifiable chain-of-thought financial reasoning</article-title>
          ,
          <source>ArXiv preprint abs/2506</source>
          .02515 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2506.02515.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , R. Liu, G. Zhang,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          , W. Chen,
          <article-title>MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI</article-title>
          , in: IEEE/CVF Conference on Computer Vision and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2024</year>
          , Seattle, WA, USA, June 16-22,
          <year>2024</year>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>9556</fpage>
          -
          <lpage>9567</lpage>
          . URL: https://doi.org/10.1109/CVPR52733.
          <year>2024</year>
          .
          <volume>00913</volume>
          . doi:
          <volume>10</volume>
          .1109/CVPR52733.
          <year>2024</year>
          .
          <volume>00913</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tafjord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalyan</surname>
          </string-name>
          ,
          <article-title>Learn to explain: Multimodal reasoning via thought chains for science question answering</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems</source>
          <year>2022</year>
          , NeurIPS
          <year>2022</year>
          , New Orleans, LA, USA, November 28 - December 9,
          <year>2022</year>
          ,
          <year>2022</year>
          . URL: http://papers.nips.cc/paper_ ifles/paper/2022/hash/11332b6b6cf4485b84afadb1352d3a9a-Abstract-Conference.html.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          , Ştefan, LiviuDaniel, M.-G. Constantin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          , W.-W. Yim,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Hee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of ImageCLEF 2025:
          <article-title>Multimedia Retrieval in Medical, Social Media and Content Recommendation Applications</article-title>
          , in: Experimental IR Meets Multilinguality, Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [27]
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>I. Koychev</given-names>
          </string-name>
          , P. Nakov,
          <string-name>
            <surname>EXAMS-V:</surname>
          </string-name>
          <article-title>A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7768</fpage>
          -
          <lpage>7791</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>I.</given-names>
            <surname>Salazar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Burda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Moakhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Farestam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romanou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Boiko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khullar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Kaleidoscope: In-language exams for massively multilingual vision evaluation</article-title>
          ,
          <source>ArXiv preprint abs/2504</source>
          .07072 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2504.07072.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Park</surname>
          </string-name>
          , G. Kim,
          <article-title>Evaluating multimodal generative ai with korean educational standards</article-title>
          ,
          <source>ArXiv preprint abs/2502</source>
          .15422 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2502.15422.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          , et al.,
          <article-title>Mdk12-bench: A multi-discipline benchmark for evaluating reasoning in multimodal large language models</article-title>
          ,
          <source>ArXiv preprint abs/2504</source>
          .05782 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2504.05782.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Decoding the underlying meaning of multimodal hateful memes</article-title>
          ,
          <source>in: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2023</year>
          ,
          <fpage>19th</fpage>
          -25th
          <source>August</source>
          <year>2023</year>
          , Macao,
          <string-name>
            <surname>SAR</surname>
          </string-name>
          , China, ijcai.org,
          <year>2023</year>
          , pp.
          <fpage>5995</fpage>
          -
          <lpage>6003</lpage>
          . URL: https://doi.org/10.24963/ijcai.
          <year>2023</year>
          /665. doi:
          <volume>10</volume>
          .24963/IJCAI.
          <year>2023</year>
          /665.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Recent advances in online hate speech moderation: Multimodality and the role of large models, Findings of the Association for Computational Linguistics: EMNLP</article-title>
          <year>2024</year>
          (
          <year>2024</year>
          )
          <fpage>4407</fpage>
          -
          <lpage>4419</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hasnat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , G. Da San Martino,
          <article-title>SemEval2024 task 4: Multilingual detection of persuasion techniques in memes</article-title>
          , in: A.
          <string-name>
            <surname>K. Ojha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          <string-name>
            <surname>Doğruöz</surname>
            ,
            <given-names>H. Tayyar</given-names>
          </string-name>
          <string-name>
            <surname>Madabushi</surname>
            , G. Da San Martino, S. Rosenthal,
            <given-names>A</given-names>
          </string-name>
          . Rosá (Eds.),
          <source>Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>2009</fpage>
          -
          <lpage>2026</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>275</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ma</surname>
          </string-name>
          , J. Ding,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Song</surname>
          </string-name>
          , et al.,
          <article-title>Sci-reason: A dataset with chain-of-thought rationales for complex multimodal reasoning in academic areas</article-title>
          ,
          <source>ArXiv preprint abs/2504</source>
          .06637 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2504.06637.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Susanto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Hulagadri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Montalan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Ngui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. B.</given-names>
            <surname>Yong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Leong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rengarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Limkonchotiwat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Tjhi</surname>
          </string-name>
          ,
          <article-title>Sea-helm: Southeast asian holistic evaluation of language models</article-title>
          ,
          <source>ArXiv preprint abs/2502</source>
          .14301 (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2502.14301.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>