<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ayeshaamjad at ImageCLEF 2025 Multimodal Reasoning: Visual Question Answering with Structured Data Extraction and Robust Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ayesha Amjad</string-name>
          <email>aamjad.msai24seecs@seecs.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fatima Seemab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saima Kausar</string-name>
          <email>skausar.msai24seecs@seecs.edu.pk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seemab Latif</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehwish Fatima</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National University of Sciences and Technology</institution>
          ,
          <addr-line>Islamabad</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Visual Question Answering (VQA) is a powerful tool for evaluating the generalization and reasoning capabilities of artificial intelligence (AI) models in educational contexts. However, VQA has a variety of challenges, including a wide range of visual components, a large number of question types, and a multitude of visual elements. This work proposes a multimodal visual question answering framework (MVQF) for exam-style VQA that combines Gemini's structured data extraction with DeepSeek's robust reasoning capabilities, aiming to overcome the persistent challenges in multilingual, multimodal question answering that these studies have collectively identified. The framework focuses on the EXAMS-V 2025 challenge in supports in English, Arabic, and Chinese. Our model navigates the dataset's diverse visuals and multimodal demands like a seasoned scholar. We compare qualitative results with alternative models and provide an in-depth analysis of the performance of subject and visual elements.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Visual Question Answering (VQA) has emerged as a critical multimodal task that combines visual and
textual information to generate precise answers to queries grounded in images. This capability mimics
human-like cognitive reasoning and has transformative potential in domains including autonomous
systems, medical diagnostics, and educational tools. This is demonstrated by the ImageCLEF Multimodal
Visual Question Answering task, which tests systems’ ability to analyze exam-style images with text,
diagrams, or tables and provide precise answers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Its impact lies in advancing AI’s ability to handle
real-world scenarios, such as analyzing medical images or educational content, enhancing
decisionmaking and accessibility.
      </p>
      <p>
        The intricacy of merging visual and linguistic data makes VQA assignments extremely dificult. Robust
visual processing is necessary to extract pertinent features from a variety of image elements, such as
schematics, labeled arrows, or scientific notations. Current systems often have trouble processing noisy
or insuficient visual input, generalizing across a variety of domains, and making appropriate decisions
without overfitting to particular datasets. In high-stakes real-world applications, these dificulties
restrict the dependability of VQA systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Recent VQA systems leverage deep learning architectures such as transformer-based models for
language understanding and convolutional or vision transformer (ViT) backbones for image feature
extraction. Architectures such as ViLBERT, LXMERT, and CLIP have demonstrated eficacy on
generalpurpose datasets by aligning visual and textual embeddings in a shared space. However, these models
often struggle with domain-specific challenges like those presented in exam-style VQA. Limitations
include poor generalization to structured visuals (e.g., tables, schematics), inadequate handling of sparse
or noisy input, and overreliance on large-scale pretraining that fails to transfer well to specialized
datasets such as MBZUAI/EXAMS-V.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
      <p>To address these deficiencies, we propose a modular VQA pipeline tailored for the ImageCLEF
2025 challenge. Our approach integrates Gemini, a state-of-the-art visual extraction model capable
of structured parsing of complex image elements, with DeepSeek, a reasoning engine designed for
logical inference over structured data. Gemini processes each image to produce a JSON representation
encapsulating textual elements, labels, arrows, diagrams, and tabular data. This structured intermediate
representation enables DeepSeek to perform targeted reasoning and generate well-informed responses
with minimal hallucination or contextual drift.</p>
      <sec id="sec-1-1">
        <title>Contributions</title>
        <p>
          This work makes the following contributions to the field of multimodal question answering:
• Development of a novel two-stage VQA pipeline that combines Gemini for structured visual
data extraction with DeepSeek for reasoning, significantly improving performance on images
style exam.
• Integration of a language-filtering preprocessing module to isolate English-language
samples in the MBZUAI/EXAMS-V dataset[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], enhancing the precision of multilingual benchmarking.
• Robust handling of complex image structures such as tables, labels, and multilingual
annotations through structured JSON outputs, facilitating more accurate downstream reasoning.
• Demonstration of domain generalizability and scalability in educational VQA applications,
addressing key limitations of prior VQA systems in structured visual environments.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Multimodal question answering (QA) has progressed from basic image-question pairs toward more
advanced tasks involving complex visual structures, diverse subject matter, and multilingual text. Early
benchmarks such as the VQA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] laid the foundation for this field by introducing image-question-answer
triplets for natural scene understanding. While influential, VQA lacked structural diversity and was not
suited for reasoning tasks involving diagrams, charts, or domain-specific educational content.
      </p>
      <p>
        To address this limitation, the [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] introduced annotated science diagrams that required models to
reason over labeled structures. The initial models used rule-based parsing and handcrafted matching
techniques, but these approaches were noisy and lacked scalability. This shift toward layout-sensitive
visual tasks was expanded by Infographic VQA.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which introduced visually complex content like charts, posters, and data tables. Pipelines for this
dataset often combined OCR tools (such as Tesseract) with transformer-based models like BERT and T5.
However, these hybrid models while slightly better still struggled with numeric reasoning and aligning
textual elements accurately.
      </p>
      <p>
        Lu et al. introduced ScienceQA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which combined images, questions, and answers with supporting
explanations. They used chain-of-thought (CoT) prompting, leveraging models like UnifiedQA and
GPT3. This led to a noticeable performance boost in few-shot and zero-shot tasks. However, ScienceQA’s
reliance on clean visual input meant that performance dropped sharply in the presence of noisy diagrams
or poor OCR, emphasizing the fragility of current approaches.
      </p>
      <p>
        While ScienceQA emphasized reasoning through CoT in English science education, M3Exam expanded
the scope toward multilinguality and script diversity. In an attempt to broaden the multilingual scope
of multimodal QA, Liang et al. presented M3Exam [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a benchmark that spanned nine languages. This
paper tested popular models like GPT-4, ChatGPT-3.5, Claude, and Google Bard. Results revealed that
these models significantly underperformed on scripts like Arabic, Thai, and Hindi due to unreliable OCR
outputs from tools like Google Vision and PaddleOCR, leading to reasoning failures and hallucinations.
      </p>
      <p>
        Zhang et al. proposed MMMU [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], targeting domain-diverse QA spanning 30 academic subjects. They
evaluated 14 vision-language models, including BLIP-2, LLaVA, and GPT-4V. Despite GPT-4V leading
with a 55.7% score, the models underperformed on structured visual inputs like diagrams or flowcharts
underscoring limitations in spatial reasoning.
      </p>
      <p>
        Yue et al. addressed the issue of shortcut learning in MMMU-Pro [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], where all content was
embedded into images and answer locations were randomized. Models were forced to rely on visual
layout understanding rather than text matching. Despite this, top models like GPT-4V and Claude
achieved less than 30% accuracy, revealing a clear gap in layout parsing and robust visual reasoning. Li
et al. tackled math-specific challenges in MathVista [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a dataset with over 6,000 visual math questions.
Despite integrating symbolic solvers like Mathpix and SymPy with GPT-4 and CoT prompting, models
only achieved approximately 34%. Errors often arose from misreading axes, legends, or interpreting
geometric relationships again pointing to weak visual grounding.
      </p>
      <p>
        Das et al. culminated these eforts with EXAMS-V [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a multilingual, multimodal benchmark featuring
24,000 questions across 13 languages and domains. The models evaluated including GPT-4V, Claude,
Gemini, and Bard performed poorly on layout-heavy and noisy visual inputs. Despite leveraging
large-scale models and prompt engineering, they struggled with low-resource languages, diagram
parsing, and multi-hop reasoning.
      </p>
      <p>These findings highlight that despite architectural scaling, vision-language models fall short in
layout reasoning, multilingual robustness, and structural interpretation. Table 1 provides a
comparative overview of major multimodal QA benchmarks, the nature of QA tasks, modeling approaches,
performance trends, and persistent limitations in layout reasoning and multilingual robustness.</p>
      <p>In conclusion, while recent models and prompting methods have improved, current vision-language
systems still face major challenges in understanding structured visuals and supporting multiple
languages. These weaknesses afect their ability to answer questions accurately in real-world educational
settings. To overcome these issues, we propose a modular architecture that separates visual parsing
from multilingual reasoning. This approach improves the understanding of the layout, makes error
diagnosis easier, and leads to more reliable and transparent question answering in diverse visual and
language formats.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        The EXAMS-V dataset([
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), hosted on Hugging Face, is a comprehensive multimodal and multilingual
benchmark designed for the ImageCLEF Visual Question Answering ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) task. It is based on actual
high school exam questions from several nations and represents a range of curriculum and educational
systems. The EXAMS-V dataset comprises 20,932 samples across 20 subjects, covering grades 4–12,
and includes 11 languages from 7 language families, making it a diverse resource for multimodal and
multilingual assessment of large language models (LLMs) and vision-language models (VLMs). Key
highlights include:
      </p>
      <p>- Language Diversity: The dataset features high-resource languages (e.g., English: 724 questions,
Chinese: 2,635 questions) and low-resource languages (e.g., Bulgarian: 2,132 questions, Croatian: 3,969
questions, Serbian: 1,434 questions). It spans Germanic, Slavic, Romance, Sino-Tibetan, Semitic, and
Finno-Ugric language families, with Arabic introducing right-to-left script. This diversity supports
evaluating closely related languages and multilingual capabilities.</p>
      <p>- Parallel Questions: The dataset includes parallel question sets for Croatian exams in Serbian
(1,207 questions) and Italian (1,147 questions), and for Arabic exams in English (262 questions across
Science, Physics, Chemistry, and Biology), enabling cross-lingual analysis.</p>
      <p>- Subject Diversity: Initially, 83 subjects were collected, but after aggregation to address naming
inconsistencies, 20 subjects were grouped into three categories: Natural Sciences (53.02%), Social
Sciences (27.15%), and Others (Applied Studies, Arts, Religion, etc., 19.82%).</p>
      <p>- Question Complexity: Questions, primarily from high school exams, vary by subject. Natural
Sciences (e.g., Physics, Chemistry, Biology, Mathematics) require foundational knowledge and complex
reasoning. Geography and History demand region-specific knowledge. The Polish section includes
55 professional exam questions across fields like accounting and motor vehicle services, requiring
precise professional understanding. - Question Types: The dataset includes both multimodal (visual)
questions (e.g. 700 in Croatian, 1,991 in Chinese) and text-only questions (e.g., 3,269 in Croatian, 644 in
Chinese), with varying distributions across languages.</p>
      <p>The data set was divided into training sets (66%), validation sets (20%), and test sets (14%) to facilitate
model learning, hyperparameter tuning, and objective evaluation. The data set contains no duplicates
and no missing details as it was pre-processed by the organizers. The data set was filtered to separate
samples in English, Arabic, and Chinese language. The multilingual, multimodal, and diverse subject
coverage of this data set makes it ideal for testing the reasoning and generalizability of AI models in
educational contexts.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Methodology</title>
      <p>We propose a Multimodal Visual Question Answering Framework (MVQF) for multilingual exam-style
VQA. The framework targets the ImageCLEF 2025 challenge across supports in English, Arabic, and
Chinese. At a high level, our framework operates in two main stages.</p>
      <p>1. An Image Description Module using Gemini-1.5 Flash for structured content extraction from
exam-style question images.
2. An Answer Generation Module powered by DeepSeek-R1-Distill-LLaMA for reasoning and answer
selection.</p>
      <sec id="sec-4-1">
        <title>4.1. Prompt-Guided Visual Decomposition</title>
        <p>This module uses a zero-shot high-precision instructional prompt which directs Gemini-1.5 Flash to
decompose images into structured output fields (e.g., question_text, diagram_caption) while suppressing
reasoning. This separation ensures modularity and avoids any bias from early reasoning. We design
the prompt to handle multilingual inputs. Gemini correctly identifies the language of the question
and processes it accordingly. The output is returned in a JSON format that encodes all relevant visual
information in structured form.</p>
        <p>EXAMS-V Dataset
(Multimodal Questions</p>
        <p>Preprocessing
(Filtering English, Arabic
and Chinese Samples)</p>
        <p>Image Understanding</p>
        <p>Structured Extraction</p>
        <p>Prompt
You are given a multiple-choice question extracted from an exam. The
question is: {caption} Identify the question and all answer options
(even if there are more than four), and any relevant data related to
graphs or tables. Choose the correct answer and reply with just the
letter of the correct option, no explanation.</p>
        <p>Reasoning
Model</p>
        <p>Our Prompt
"You are answering a multiple-choice question."
"Return **only** the correct uppercase letter (A, B, C, D, etc)."
"Do not explain."
"Do not write any reasoning."
"Do not add punctuation or extra text.\n"
"Respond with only one letter."
"Example 1:"
"Question: Which number is even?"
"Options: A. 3 B. 5 C. 8 D. 7"
"Correct Answer: C"
"Example 2:"
"Question: What is the capital of Japan?"
"Options: A. Seoul B. Beijing C. Tokyo D. Bangkok\n"
"Correct Answer: C"
"Now answer this:"
f"{extracted_text}"
"Correct Answer:"</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Answer Generation Module</title>
        <p>The Answer Generation Module takes the extracted content and selects the correct option. We use the
DS-R1-Distill-LLaMA 70B model for this task. We construct a Strict Single-Letter Extraction (SSLE)
Prompt that embeds the question and its options. The model is instructed to return only the correct
option letter. We prevent it from generating explanations or additional text.</p>
        <p>This minimalistic format ensures consistency across languages. The model reads the question context
and reasons over the textual description, producing a single-letter answer.</p>
        <p>A key strength of MVQF is its language independence. No separate models or prompt translations
were needed for:
• English
• Arabic
• Chinese
The same pipeline and prompts were used across all languages, and Gemini and DeepSeek handled
multilingual input natively. This makes our solution scalable and easily extensible to additional languages
without retraining or fine-tuning.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Design</title>
      <p>
        This section presents our evaluation strategy, model configurations, and experimental setup for the
EXAMS-V benchmark [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We broke the reasoning pipeline into intermediate steps and performed
experiments separately for each stage. This helped us pinpoint performance bottlenecks, whether they
originated from visual parsing (OCR or captioning) or from the reasoning model itself. Due to hardware
limitations, all evaluations were conducted on curated subsets of the dataset.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Models</title>
        <p>5.1.1. English
We explored two system types: multimodal models and modular pipelines. Multimodal models attempt
end-to-end question answering directly from images. Modular pipelines, on the other hand, separate
the process into two stages visual content extraction (via OCR or captioning), followed by text-based
reasoning. This setup allowed us to diagnose weaknesses at each stage more efectively. Our experiments
span three EXAMS-V languages: English, Arabic, and Chinese.</p>
        <p>The English subset of EXAMS-V included a variety of question formats such as tables, graphs, labeled
diagrams, and multi-step reasoning questions. The questions were chosen to test both visual perception
and logical reasoning. We tested several combinations:
• Multimodal Models: Mistral LLaVA was tested on 50 English MCQs to assess reasoning directly
from images. DeepSeek-VL was also considered, but could not be executed due to its extremely
high memory requirements (&gt;80GB VRAM).
• Visual Parsing: We evaluated three tools. Tesseract OCR was used for basic text extraction
on clean layouts but performed poorly on rotated, dense, or complex visuals. BLIP, used for
generating image captions, struggled with detailed or scientific content and often missed key
elements. In contrast, Gemini 1.5 Flash provided layout-aware OCR and structured parsing, and
was ultimately selected as the preferred parser for its ability to accurately retain spatial structure
and handle diverse visual formats across languages.
• Modular Pipelines:
– Gemini + DeepSeek-R1 Distill LLaMA (Proposed): Gemini extracted structured content
from images; DeepSeek-R1 handled reasoning.
– Gemini + Mistral-7B: Used for a simpler pipeline pairing Gemini with Mistral-7B.
– BLIP + Mistral-7B: Used BLIP captions as input to the reasoning model.</p>
        <p>Gemini + DeepSeek-R1 was selected as the final approach for English due to its consistent performance
and better structural handling.
5.1.2. Arabic
The Arabic subset of EXAMS-V posed unique challenges such as right-to-left formatting, variable fonts,
and missing answer labels. These issues made parsing and reasoning more dificult.</p>
        <p>Experiment Performed are:
• Multimodal Models: Qwen-VL was tested on a small number of Arabic MCQs. It supports
Arabic input and multilingual reasoning but can only be evaluated in small batches due to high
memory demands.
• Visual Parsing: We evaluated three tools. Tesseract OCR was applied for Arabic text extraction
but showed poor support for Right to Left formatting and frequently reversed question structure
or answer order. BLIP, used in Arabic captioning mode, often dropped key scientific terms and
failed to generate coherent sentence structure. Gemini 1.5 Flash demonstrated stronger Right
To Left alignment, more accurate sentence preservation, and domain-aware parsing, making it
more reliable for downstream reasoning.
• Modular Pipeline:
– Gemini + DeepSeek-R1 Distill LLaMA (Proposed): Gemini was used to extract Right to
left aware structured text, while DeepSeek-R1 handled reasoning with prompt adjustments
to maintain sentence clarity.</p>
        <p>Gemini + DeepSeek-R1 was selected as the final approach for improved layout parsing, and consistent
reasoning performance.
5.1.3. Chinese
The Chinese subset of EXAMS-V presented the greatest dificulty across all languages. These MCQs are
based on Gaokao exams and feature complex diagrams, scientific plots, mathematical tables, and
domainspecific terminology. The visual complexity and abstract reasoning required make them particularly
challenging for vision-language models.We tested several configurations:
• Multimodal Models: Qwen-VL was evaluated on a small subset due to high computational
requirements. While it supports Chinese input and demonstrated strong multilingual capabilities,
we could not scale its evaluation across the full test set.
• Visual Parsing: Gemini 1.5 Flash preserved structural layout more efectively, extracted
numerical information reliably, and was better suited to the visual density of Chinese MCQs.
• Modular Pipeline:
– Gemini + DeepSeek-R1 Distill LLaMA (Proposed): Gemini was used for structured
OCR and layout parsing, while DeepSeek-R1 served as the reasoning engine for handling
scientific and numeric logic.</p>
        <p>Gemini + DeepSeek-R1 was selected as the final approach for Chinese due to its stability, structural
accuracy, and robustness in handling visually complex, domain-specific content.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Experimental Setup</title>
        <p>We run our experiments using a combination of APIs, inference servers, and cloud-based notebooks.
Our final approach integrates two independent modules for parsing and reasoning.
• Visual Parsing: We use Gemini 1.5 Flash1 via Google Cloud Vertex AI for OCR and
layoutaware captioning. Gemini receives image inputs and extracts structured textual content from
visual questions. The model was used with default generation parameters: temperature =
1.0, top_k = 40, top_p = 0.95, max_output_tokens = 1024, and no specified seed
(non-deterministic behavior).
1https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/1-5-flash
• Reasoning: The extracted text is passed to DeepSeek-R1 Distill (LLaMA-70B)2, which serves as
our main reasoning engine. We use Groq Inference Servers3 to execute this model with
lowlatency inference. The prompt strategy follows a few-shot, answer-only format for consistency
and control. We explicitly set temperature = 0 to ensure deterministic answers. Other
hyperparameters were left at default.
• Environment: All processing was conducted in Google Colab4. Data handling and batch
operations were implemented using pandas, tqdm, and Python’s built-in json module.
For image loading and preprocessing, we used the PIL.Image module. Prompt formatting and
Gemini interaction were handled using the oficial google.generativeai SDK5. DeepSeek-R1
was accessed via Groq Inference Servers
Additional runtime utilities such as inspect.signature were used to validate prompt structure
and automate input formatting.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Evaluation Criteria for Model’s Performance Evaluation</title>
        <p>
          We use accuracy as the primary metric, following the oficial ImageCLEF 2025 evaluation protocol
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Our main pipeline Gemini (visual parsing) and DeepSeek-R1 Distill (reasoning) is submitted to
the full multilingual test set. Accuracy is computed using the oficial leaderboard, which includes 14
sub-leaderboards: one per language and one for overall multilingual performance.
        </p>
        <p>All other models and pipeline variants are evaluated on small, curated subsets of 20–50 MCQs per
language. These tests are exploratory, aimed at understanding model behavior rather than producing
benchmark scores. We focus on specific challenges such as layout parsing, symbolic reasoning, and
cross-language variability.</p>
        <p>While only one system is evaluated at scale, these targeted experiments provide valuable insights
into where models succeed or fail—and why. Together, they ofer a broader perspective on current
limitations in multimodal reasoning across diverse visual and linguistic formats.</p>
        <p>All models are tested in zero-shot or few-shot settings, without any fine-tuning.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Analysis</title>
      <p>We evaluate the Gemini + DeepSeek-R1-Distill-Llama-70B model on the MBZUAI/EXAMS-V dataset,
tackling English, Chinese, and Arabic MCQs. With overall test accuracies of 0.8125 (English), 0.6560
(Chinese), and 0.4775 (Arabic), our model excels in structured, science-based tasks but faces hurdles with
complex visuals and cultural nuances. We begin with qualitative insights into our model’s performance,
followed by language-specific results, a detailed comparison with alternative models, and an in-depth
analysis of subject and visual element performance. Tables 3, 7, and 8 summarize findings, while Figures
2 and 3 visualize trends.</p>
      <sec id="sec-6-1">
        <title>2https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B 3https://groq.com/ 4https://colab.research.google.com/ 5https://pypi.org/project/google-generativeai/</title>
        <sec id="sec-6-1-1">
          <title>6.1. Qualitative Insights into Our Model</title>
          <p>Our model, blending Gemini’s multimodal extraction with DeepSeek’s robust reasoning, navigates the
EXAMS-V dataset’s diverse visuals and multilingual demands like a seasoned scholar. Table 3 captures
its strengths, weaknesses, error patterns, and vivid scenarios, framing its performance across languages.</p>
          <p>English thrives on science-heavy questions, leveraging robust training. Chinese excels in logical
tasks like math but stumbles in the notation of Chemistry. Arabicrowess in tables contrasts with graph
struggles, likely from script challenges. These insights guide our comparison with other models and
our detailed analysis.</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>6.2. Results by Language</title>
          <p>6.2.1. English
We analyze performance by language, integrating overall accuracies and baseline comparisons.
Performance Overview: Our model achieves 0.8125, tripling the baseline 0.2701. Physics scores 0.8125,
with visual elements at 0.8125 (text), 0.625 (graphs), and 0.4468 (figures). (chemical structures — not in
English test dataset).</p>
          <p>Qualitative Insights: Gemini’s extraction excels in text and graphs, likely answering a Physics
question on projectile motion. Figures (0.4468) pose challenges, possibly misreading dense biology
annotations. The weak visual processing of the baseline fails on the graphs, unlike the precision of our
model. Errors include overcomplicating simple MCQs, but structured tasks shine.
6.2.2. Chinese
Performance Overview: Chinese scores 0.6560, doubling the baseline’s 0.2678. Subject-wise results
show 0.7714 (Math), 0.73 (Biology), 0.65 (Physics), and 0.4706 (Chemistry). Visual elements are 0.6560
(text), 0.5397 (figures), 0.5122 (graphs), 0.5 (tables), and 0.4348 (chemical structures).</p>
          <p>Qualitative Insights: DeepSeek’s reasoning drives Math and Biology success, solving table-based
algebra or figure-based questions. Chemistry (0.4706) and chemical structures (0.4348) lag, likely from
Gemini’s notation misreads. The baseline’s visual limitations contrast with our model’s versatility.
Errors stem from character recognition issues, but logical tasks excel.
6.2.3. Arabic
Performance Overview: Arabic achieves 0.4775, nearly doubling the baseline’s 0.2703. Subjects
include 0.65 (Chemistry), 0.5 (Math), 0.4868 (Physics), and 0.4028 (Biology). Visual elements show 0.8889
(tables), 0.875 (chemical structures), 0.4775 (text), 0.4567 (figures), and 0.2703 (graphs).</p>
          <p>Qualitative Insights: Gemini’s extraction excels in tables and chemical structures, mastering
Chemistry questions. Graphs (0.2703) and Biology (0.4028) struggle, likely from script misalignment or
cultural gaps. The baseline’s poor visual handling falls short. Errors arise from graph misreads, but
structured formats thrive.</p>
          <p>‘</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>6.3. Qualitative Comparison with Alternative Models</title>
          <p>To underscore why Gemini + DeepSeek outperforms alternatives, we compare qualitative performance
on EXAMS-V’s multilingual, visually complex MCQs.</p>
          <p>Table 7 details the strengths, limitations and scenarios of the few-shot and small-batch testing,
revealing the edge of our model.</p>
          <p>Our model’s synergy tackles EXAMS-V’s challenges—multilingual text, graphs, tables, and chemical
structures—more efectively than alternatives. Mistral’s spatial reasoning falters, misinterpreting graph
layouts (e.g., a Physics slope), while our model excels (English graphs: 0.625). VLaVA struggles with
dense figures, like biology diagrams, unlike our model’s moderate success (Chinese figures: 0.5397).
Qwen-VL’s resource demands slow it on structured visuals, such as Arabic tables, where our model shines
(0.8889). DeepSeek-VL’s hardware requirements make it impractical, unlike our model’s accessibility.
Gemini + Mistral-7B’s shallow reasoning fails complex MCQs, whereas DeepSeek’s logic drives success
(Chinese Math: 0.7714). These contrasts highlight our model’s adaptability to EXAMS-V’s visual and
linguistic complexity, though Arabic graphs and Chinese notations need refinement.</p>
        </sec>
        <sec id="sec-6-1-4">
          <title>6.4. Performance by Subjects and Visual Elements</title>
          <p>Phys.
0.8125
0.65
0.4868</p>
          <p>Bio.</p>
          <p>Qualitative Insights: English’s Physics dominance (0.8125) stems from Gemini’s graph extraction,
likely acing questions on projectile motion or energy plots, where clear axes and labels align with
training data. The low figure score (0.4468) reflects struggles with dense annotations, as in biology
diagrams with overlapping labels, possibly due to Gemini’s OCR limitations. Chinese’s Math (0.7714)
and Biology (0.73) strengths showcase DeepSeek’s logical reasoning, excelling at table-based algebra or
ifgure-based genetics questions. However, Chemistry (0.4706) and chemical structures (0.4348) sufer</p>
          <p>Subject Performance Across Languages
ryca 0.5
u
c
cA 0.4
from Gemini’s misreads of complex notations, like mistaking a benzene ring’s bonds, reflecting limited
training on chemical symbols. Arabic’s table (0.8889) and chemical structure (0.875) success highlights
Gemini’s structured data handling, mastering reaction tables in Chemistry. Graphs (0.2703) and Biology
(0.4028) lag, likely from right-to-left script misalignment (e.g., misreading a mechanics graph’s axes)
or cultural gaps in Biology (e.g., unfamiliar terminology). These patterns contrast with the baseline’s
uniform visual struggles and alternatives’ limitations (e.g., Qwen-VL’s slow table processing). Figures
2 and 3 vividly illustrate these trends, guiding future improvements like script handling for Arabic
graphs.</p>
        </sec>
        <sec id="sec-6-1-5">
          <title>6.5. Why Our Model Outperforms the Baseline and Alternatives</title>
          <p>Our model’s superiority arises from Gemini’s multimodal extraction and DeepSeek’s reasoning, tailored
to EXAMS-V’s challenges:
• Baseline: Its low accuracies (0.2701–0.2703) reflect basic text processing, failing on visuals like
graphs. Our model triples English accuracy (0.8125) and doubles Chinese (0.6560) and Arabic
(0.4775), excelling in Physics (English: 0.8125) and Arabic tables (0.8889), like decoding a motion
graph.
• Alternatives: Table 7 shows Mistral/VLaVA’s spatial weaknesses, Qwen-VL’s ineficiency,
DeepSeek R1’s hallucination, DeepSeek-VL’s impracticality, and Gemini + Mistral-7B’s
shallow reasoning. Our model’s eficient pipeline and robust logic handle Chinese Math (0.7714) and
Arabic chemical structures (0.875) efectively.</p>
          <p>This synergy minimizes errors, unlike the baseline’s broad failures or alternatives’ specific limitations.</p>
        </sec>
        <sec id="sec-6-1-6">
          <title>6.6. Overarching Insights</title>
          <p>Our model, like a multilingual science scholar, excels in structured tasks but needs coaching for complex
visuals and cultural nuances. Key takeaways from Tables 3, 7, and 8, and Figures 2 and 3:
• Strengths: Dominates Physics (English: 0.8125), Math (Chinese: 0.7714), and Arabic tables
(0.8889). Gemini’s extraction and DeepSeek’s reasoning outperform alternatives.
• Weaknesses: Struggles with Arabic graphs (0.2703), Chinese Chemistry (0.4706), and figures
(e.g., English: 0.4468). Script and cultural gaps (e.g., Arabic Biology: 0.4028) pose challenges.
• Error Patterns: OCR misreads (e.g., Arabic script, Chinese notations) and reasoning gaps (e.g.,
cultural knowledge) drive errors, like misreading a molecular structure.
• Improvements: Fine-tune Gemini for script and notation handling; expand DeepSeek’s chemistry
and humanities training.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>These insights highlight our model’s adaptability and growth areas.</title>
        <sec id="sec-6-2-1">
          <title>6.7. Example Output</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Below is an example Figure 4 of the model’s output for a sample exam-style image, demonstrating the
extracted text from Gemini and the predicted answer from DeepSeek.</p>
      <p>In this work, we introduced a modular Visual Question Answering (VQA) system designed for
examstyle questions with complex visuals and multiple languages. Our approach uses Gemini to extract
structured information from images and DeepSeek-R1-Distill-LLaMA to reason over that data and
select the correct answer. This setup worked well on the EXAMS-V 2025 dataset, especially in subjects
like Physics and Math, and handled English, Arabic, and Chinese without needing language-specific
changes.</p>
      <p>Our results show that the system performs better than existing models in understanding tables,
diagrams, and multilingual questions. However, it still faces challenges with dense visuals (like Biology
diagrams), right-to-left scripts in Arabic graphs, and special symbols in Chemistry.</p>
      <p>Overall, this work shows that using a modular setup for visual question answering can be more
lfexible, accurate, and easier to improve. Future work can focus on improving image extraction for
tricky visuals, adding more support for diferent languages, and making the system faster and lighter
for real-world use.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors utilized tools such as ChatGPT and Grammarly to
assist with grammar and spelling checks, as well as paraphrasing and rewording. All content generated
or modified using these tools was subsequently reviewed and edited by the authors, who accept full
responsibility for the final content of the publication.
[13] D. Dimitrov, M. S. Hee, Z. Xie, R. Jyoti Das, M. Ahsan, S. Ahmad, N. Paev, I. Koychev, P. Nakov,
Overview of imageclef 2025 – multimodal reasoning, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.-M. Drăgulinescu</surname>
            ,
            <given-names>W.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Yim</surname>
            ,
            <given-names>A. Ben</given-names>
          </string-name>
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Snider</surname>
            , G. Adams,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Yetisgen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rückert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bloch</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Brüngel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Idrissi-Yaghir</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schäfer</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Storås</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Papachrysos</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schöler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jha</surname>
            ,
            <given-names>A.-G.</given-names>
          </string-name>
          <string-name>
            <surname>Andrei</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Coman</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radzhabov</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Prokopchuk</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-D. Ştefan</surname>
            ,
            <given-names>M.</given-names>
            -G. Constantin, M.
          </string-name>
          <string-name>
            <surname>Dogariu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Deshayes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Popescu</surname>
          </string-name>
          ,
          <article-title>Overview of the imageclef 2023: Multimedia retrieval in medical, social media and internet applications</article-title>
          , in: A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Aliannejadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>370</fpage>
          -
          <lpage>396</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kembhavi</surname>
          </string-name>
          ,
          <article-title>Don't just assume; look and answer: Overcoming priors for visual question answering</article-title>
          ,
          <source>CoRR abs/1712</source>
          .00377 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1712. 00377. arXiv:
          <volume>1712</volume>
          .
          <fpage>00377</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>I. Koychev</given-names>
          </string-name>
          , P. Nakov,
          <string-name>
            <surname>EXAMS-V:</surname>
          </string-name>
          <article-title>A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7768</fpage>
          -
          <lpage>7791</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Antol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          , Vqa:
          <article-title>Visual question answering, ICCV (</article-title>
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kembhavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salvato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>A diagram is worth a dozen images</article-title>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bagal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Valveny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          , Infographicvqa, in: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV),
          <year>2022</year>
          , pp.
          <fpage>2582</fpage>
          -
          <lpage>2591</lpage>
          . doi:
          <volume>10</volume>
          .1109/WACV51458.
          <year>2022</year>
          .
          <volume>00264</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tafjord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalyan</surname>
          </string-name>
          ,
          <article-title>Learn to explain: Multimodal reasoning via thought chains for science question answering</article-title>
          ,
          <source>arXiv preprint arXiv:2209.09513</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Aljunied</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Chia</surname>
          </string-name>
          , L. Bing,
          <article-title>M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2306.05179</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Zheng, R. Liu, G. Zhang,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          , W. Chen,
          <article-title>Mmmu: a massive multi-discipline multimodal understanding and reasoning benchmark for expert agi</article-title>
          ,
          <source>arXiv preprint arXiv:2311.16502</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Tong,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          , G. Zhang, H. Sun,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Neubig,
          <article-title>Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark</article-title>
          ,
          <source>arXiv preprint aarXiv:2409.02813</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xia</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , H. Cheng, K.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Galley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts</article-title>
          , https://arxiv.org/abs/2310.02255 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>I. Koychev</given-names>
          </string-name>
          , P. Nakov,
          <string-name>
            <surname>EXAMS-V:</surname>
          </string-name>
          <article-title>A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models</article-title>
          ,
          <source>ACL</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>