<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Variations and Textual Augmentation⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vasilena T. Krazheva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diana Markova</string-name>
          <email>dianamarkovakn@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitar I. Dimitrov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Koychev</string-name>
          <email>koychev@fmi.uni-so</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Preslav Nakov</string-name>
          <email>preslav.nakov@mbzuai.ac.ae</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics and Informatics, Soa University “St. Kliment Ohridski”</institution>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mohamed Bin Zayed University of Articial Intelligence</institution>
          ,
          <addr-line>UAE</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the growing capabilities of vision-language models (VLMs), current systems achieve impressive performance on tasks requiring the integration of vision and language, such as image captioning, simple visual question answering, and visual dialogue. However, it is ofien claimed that these models fall short when deeper reasoning is required. In this paper, we investigate this claim through the ImageCLEF 2025 MultimodalReasoning task, which challenges models to solve multiple-choice questions in image format across a number of subjects and languages. Using Gemini 2.0 Flash and 2.5 Flash, we study the e‌ect of reasoning capacity and budget, external textual transcription, and prompt design on the EXAMS-V benchmark for Bulgarian and English. Our results indicate that, contrary to expectation, VLMs can perform remarkably well on multimodal reasoning tasks in both languages. In particular, they are able to solve tasks in Physics and Science with an accuracy of over 80%. We identify thinking budget as the main contributing factor. Additionally, we demonstrate a setting where unconstrained thinking budget might deteriorate performance in Biology and Chemistry. The system submitted ranked rst on English and Bulgarian leaderboards with respective 89.65% and 90.50% accuracy scores.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal Reasoning</kwd>
        <kwd>Vision-Language Model</kwd>
        <kwd>Gemini</kwd>
        <kwd>Optical Character Recognition</kwd>
        <kwd>Visual Question Answering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent advances in VLMs have enabled new capabilities for solving problems that span both visual and
textual modalities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This multimodal reasoning ability is essential for a diverse set of applications such
as document question answering, educational tutoring systems, and embodied intelligence. However,
evaluating these systems remains challenging — especially when reasoning must occur across images,
diagrams, and multiple languages.
      </p>
      <p>
        Existing benchmarks for evaluating multimodal reasoning have provided valuable insights into the
capabilities of VLMs for a range of tasks. One such benchmark is Massive Multi-discipline Multimodal
Understanding and Reasoning (MMMU) [
        <xref ref-type="bibr" rid="ref15 ref2">2</xref>
        ], which consists of college-level exam questions, spanning a
wide range of academic subjects and elds including mathematics, science, the humanities, and the arts.
It has become the de facto benchmark for measuring the multimodal reasoning capabilities of VLMs.
      </p>
      <p>
        The ImageCLEF 2025 MultimodalReasoning lab [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] at the Conference and Labs of the Evaluation
Forum (CLEF) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] addresses VLM evaluation in a broader and more language-inclusive manner by
selecting the EXAMS-V dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for training and validation. It consists of 20,932 multiple-choice
questions covering 20 school subjects across 11 languages and incorporates multimodal features such
as text, images, tables, gures, diagrams, maps, scientic symbols, and equations. The competition also
includes a held-out test set of 3,565 new questions in three additional languages (Urdu, Kazakh, and
Spanish), introduced for the 2025 edition. Unlike other benchmarks, it provides a broader linguistic
scope and places a particular emphasis on lower-resource languages [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The task itself is dened
as follows: given an image containing a multiple-choice question with three to ve answer options and
associated metadata, the objective is to identify the single correct answer.
      </p>
      <p>In this working notes paper, we investigate the performance of selected free-tier proprietary Gemini
vision-language models on the ImageCLEF MultimodalReasoning task. We explore the impact of
reasoning budget constraints, prompt design, and external Optical Character Recognition (OCR) textual
transcription. Focusing on English and Bulgarian, we analyze performance trends across subjects and
modalities. Our system ranked rst on both English and Bulgarian leaderboards.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Early approaches to Visual Question Answering (VQA) typically relied on heavily engineered modular
architectures, where separate components handled image encoding, question interpretation, and answer
classication. For tasks involving text within images, systems incorporated OCR pipelines to extract
textual content, which was then fused with visual features for downstream reasoning [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Recent
advances in vision-language models have replaced such hand-crafied systems with unied
transformerbased architectures that jointly model vision and language.
      </p>
      <p>
        In fact, multimodal models have multiple encoders (for each modality) and then fuse the embeddings
together to create a shared representation space; decoders operate over the shared latent space to
produce output in the desired modality [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Examples of such models include GPT-4o [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Claude 3.5
Sonnet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Gemini [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Qwen2-VL-7B [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], LLaVA [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and Gemma 3 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Large Language Models (LLMs), such as OpenAI’s o1 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], have dramatically improved performance on
increasingly complex tasks by scaling test-time computation during problem-solving. The combination
of multimodal capabilities and extended chain-of-thought ne-tuning and alignment has been recently
implemented in models such as QVQ-72B-Preview [15], Kimi-VL-A3B-Thinking [16], and Gemini 2.5
[17], achieving SOTA results on the MMMU benchmark [18].
      </p>
      <p>Prompt engineering has proven essential for extracting reasoning behavior from foundation models.
As demonstrated in GPT-3 [19], few-shot prompting can enable models to generalize with minimal
supervision in certain contexts. Furthermore, it has been shown that chain-of-thought prompting
improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks [20].</p>
      <p>Our system builds on these advances in the following ways:
We select multimodal Gemini models as the core of our VQA system. Specically, Gemini 2.5 Flash
(thinking) serves as a primary inference component. Gemini 2.0 Flash (non-thinking) is employed in
two roles: (1) as a baseline and experimental playground, and (2) to assess whether external textual
transcription (via OCR) can enhance performance by better eliciting the model’s textual reasoning
capabilities. Prompt engineering strategies are also considered to optimize performance.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <p>
        3.1. Data
The dataset provided for the MultimodalReason task is an expanded version of EXAMS-V [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Each
question is in image format and has corresponding metadata for language, subject, grade, presence
of tables, gures, diagrams and chemical structures. Due to temporal and computational limitations,
participation and further analysis are restricted to two language subsets. Namely, English and Bulgarian
were selected as representatives of a higher-resource and a lower-resource language, respectively.
      </p>
      <p>Table 1 shows the validation split count distribution of questions, grouped by language. With Figure
refers to questions whose associated images contain a graphical element, while Text Only refers to
questions whose image representations contain only text (type attribute).</p>
      <sec id="sec-3-1">
        <title>3.2. Methodology</title>
        <p>The main phases in our experimental workow are data preparation, prompt engineering, model
querying, and output post-processing &amp; evaluation (Figure 1).</p>
        <sec id="sec-3-1-1">
          <title>3.2.1. Data Preparation</title>
          <p>Preprocessing involved merging the dataset parquet les and ltering only the languages of interest.
Additionally, for the Bulgarian validation set, answer keys were mapped to the corresponding unied
English letters. Textual content extraction was performed with two OCR engines, each supporting both
languages. The rst, Tesseract OCR [21, 22], is an open-source OCR engine widely used in academic
research. The second, OCRSpace [23], provides a cloud-based OCR service, ofien used in applied
research.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2.2. Prompt Design</title>
          <p>Prompt formatting can signicantly a‌ect the performance of evaluated models [24, 25]. Therefore,
two fundamentally di‌erent prompting approaches were undertaken:
• Approach 1: A handcrafied task-specic prompt was designed as recommended in [ 26],
utilizing the following techniques: role-play, step-by-step instructions (Chain of Thought) and
contextualization.
• Approach 2: Meta prompting technique [27] was undertaken by instructing GPT-4o to generate
and improve a prompt. The nal version adheres to the Structured Prompt Template [28] by
systematically organizing the prompt into the distinct components – task introduction, task detail,
output format, few-shot examples and query. The prompt incorporates a structured input in json
format.</p>
          <p>Both prompt types integrate the question metadata provided in the dataset.</p>
          <p>We hypothesize that augmenting the prompt with OCR text transcription could better engage the
reasoning capabilities of the models and ensure focused problem understanding. To test this, prompts
with and without external transcription were formed. Also, to measure the e‌ectiveness of in-context
model adaptation through samples, zero-shot and one-shot versions of the prompts were considered.
Prompt versions and templates can be found in the Prompt 1 and Prompt 2 subsections of the Appendix
section.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.2.3. Model Querying</title>
          <p>The specic release versions of Gemini 2.0 Flash and Gemini 2.5 Flash used are gemini-2.0-ash
and gemini-2.5-ash-preview-04-17. All experiments were conducted using the following default
congurations: temperature set to 1 and topP to 95. Gemini 2.5 Flash uses a default topK of 64, while
2.0 Flash defaults to a topK of 40.</p>
          <p>Experiments were run with limited and unconstrained thinking budget. The thinking budget
parameter1 guides the model on the number of thinking tokens it can use when generating a response
[29]. Higher values correspond to more detailed reasoning. By default, it is unconstrained; we denote
this option with the ∞ symbol.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.2.4. Output Post-processing &amp; Evaluation</title>
          <p>Task submission requires all answers to be one of the letters ’A’, ’B’, ’C’, ’D’ or ’E’. To ensure that the
submission les adhered to the requirements, the following post-processing steps were applied to the
models’ responses: (1) extraction of the answer letter if it was not readily provided, and (2) mapping
the answer symbol to one of the letters listed above when the model responded in the alphabet of the
question’s language. The ocial competition evaluation metric is Accuracy.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments &amp; Results</title>
      <sec id="sec-4-1">
        <title>4.1. Pre-submission evaluations</title>
        <p>A limited number of evaluations were performed with and without external OCR textual transcription,
zero-shot and few-shot prompting, and di‌erent thinking budgets. Table 2 provides a summary of
pre-submission runs. The best two runs for each language are in bold.</p>
        <p>Since Gemini 2.0 Flash quota limitations were more favorable, and under the assumption that
e‌ects would be suciently similar when using Gemini 2.5 Flash, experiments with and without OCR
augmentation were performed. OCR proved benecial for Bulgarian with 2.0 Flash, contributing to a
10% increase in accuracy. Incorporating OCR data resulted in 95.25% accuracy for Bulgarian when using
2.5 Flash, and was thus used for submission runs.</p>
        <p>However, for English, we observed a slight decrease in performance for Gemini 2.5 Flash runs with a
1024 thinking budget, and therefore chose to refrain from adding external textual transcription for the
rst submission run. These experiments were congured with a limited budget, originally motivated by
shorter processing times.</p>
        <p>Each last run per language in Table 2 was carried out with ∞ thinking budget and OCR augmentation,
leading to longer execution times and substantial performance gains on the English set. Specically, the
1The Google GenAI API denes thinkingBudget as an integer in the range 0 to 24576
accuracy on the English validation set increased from 57.92% to 78.09%, potentially highlighting the
impact ∞ budget has on model performance.</p>
        <p>Due to nearing submission deadlines and quota restrictions, the best-performing systems were
selected based on these preliminary validation results. Later experiments were conducted to explore a
broader conguration space.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. O‌icial Submission Results</title>
        <p>The ocial results of the competition are presented in Table 3. Our system achieved rst place on both
English and Bulgarian leaderboards with respective 89.65% and 90.50% accuracy scores.
*Since the authors participated with two di‌erent team names, which were subsequently united into one,
there are two submissions per subtask.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Post-submission investigation</title>
        <p>To better understand the pronounced di‌erence in accuracy for the English subtask and the unexpected
consistency in accuracy for Bulgarian questions between the two approaches, we perform a series
of ablations and modications to the experimental settings. Tables 4 and 5 in the Appendix present
validation accuracy for all experimental congurations.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. External OCR contribution</title>
          <p>OCR augmentation visibly improved accuracy in Gemini 2.0 Flash experiments for both languages.
However, it is unclear whether the addition of external textual transcription contributes to the
performance of 2.5 Flash on the validation sets.</p>
          <p>OCR external transcription boosted accuracy in Gemini 2.0 Flash (non-thinking) experiments for
Bulgarian (Table 4) with over 10% (experiments #15, #11, #7). For English (Table 5), we observe a smaller,
but marked max performance increase of 7.2% (experiments #11, #15, #8).</p>
          <p>Following further experiments for Bulgarian with Gemini 2.5 Flash, the highest accuracy achieved of
97.25% was in experiment #1 (No OCR) (Table 4). Experiment runs #2 (No OCR) and #13 (OCRSpace),
with xed otherwise settings, achieved corresponding scores of 97.00% and 96.75%, showing a decrease
of 0.25% when OCR transcription is included. In contrast, when transcription is added to the setup
of experiment run #3 (No OCR), we observe a max accuracy increase of 0.75% and a score of 97%
(experiment run #12 with OCRSpace). All other groups of experiments show a max OCR boost of less
than 0.75%. These uctuations of 0.25-0.75% could be due to the generative nature of the model.</p>
          <p>Trends on the English validation set are inconclusive (Table 5). On the one hand, experiments #3 (No
OCR), #9 (Tesseract), #12 (OCRSpace) have accuracies of respectively 78.96%, 80.40%, 80.69%, indicating
a max score increase of 1.73%. On the other hand, experiments #1 (No OCR) and #13 (OCRSpace),
with accuracies of respectively 80.12% and 78.09%, show a decrease of 2.03%. This is also the case with
experiments #6 (No OCR) and #10 (Tesseract) – we observe reduction (although smaller) of 0.86%.
However, when using OCRSpace with the same settings (experiment #14), we note an increase of 1.45%.</p>
          <p>Overall, for Gemini 2.5 Flash experiments, OCR external transcription did not result in consistent
performance benets. This could be attributed to Gemini 2.5 Flash’s superior visual understanding
capabilities, reducing reliance on external textual transcriptions.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Zero-Shot vs. Few-Shot prompting</title>
          <p>Few-shot prompting occasionally improved performance slightly, though as can be seen in Tables 4
and 5, results remain inconclusive.</p>
          <p>Namely, for English (Table 5), experiments #1 (One-Shot) and #2 (Zero-Shot) with respective accuracies
of 80.12% and 79.25% show a slight increase of 0.87%. However, experiments #3 (Zero-Shot) and #5
(One-Shot) with corresponding scores of 78.96% and 78.39%, demonstrate a small decrease of 0.57%.</p>
          <p>For Bulgarian (Table 4), experiments #1 (One-Shot) and #3 (Zero-Shot) with scores 97.25% and 96.25%
indicate an increase of 1%; experiments #2 (One-Shot) and #5 (Zero-Shot) follow the same trend with a
marginal di‌erence of 1%.</p>
          <p>These minor di‌erences may be due to the models already producing well-structured responses - in
all cases, answer key extraction was practically reduced to a simple regular expression over the last ve
response characters.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.3. Reasoning Budget Variations</title>
          <p>In light of the stark di‌erence in accuracy for experiments on the English validation set of around 20%,
and the clearly noticeable, though smaller di‌erence on the test set of less than 10%, we hereby analyze
how thinking budget contributes to performance.</p>
          <p>In order to isolate the e‌ect of other improvements, which might be attributed to adaptation or
external OCR text transcription, we conduct experiments with zero-shot prompting (Аpproach 1), no
OCR augmentation, and thinking budgets of 1024, 8192 and ∞ (Table 5; experiments #6, #4 and #3,
respectively). Corresponding scores achieved are 57.92%, 78.96% and 78.96%. Although the last two
thinking budget settings yielded higher overall accuracy, Figure 2 reveals a more complex dynamic.</p>
          <p>In the case of Physics and Science questions, the ∞ (unconstrained adaptive) conguration improved
performance in both modalities. Specically, for problems in Physics, overall accuracy increased from
60% (thinking budget 1024) to a score of 86% (thinking budget ∞). Similarly, overall Science accuracy
increased to 96% (thinking budget ∞), while for 1024 thinking budget it was 53%. The higher limited
budget conguration of 8192 resulted in accuracy values between those achieved in 1024 and ∞
congurations. This is also true modality-wise (Figure 2).</p>
          <p>However, the aforementioned dependency between thinking budget value and accuracy does not
hold for Biology and Chemistry. In particular, for Biology questions, the score dropped from 81% with
1024 budget to 79% with ∞ budget, and reached its highest value of 85% when the parameter was set
to 8192. Interestingly, the reduction in accuracy was for With Figure questions (Figure 2). We observe
a similar trend for Chemistry, where overall accuracy peaked at 59% with 8192 thinking budget and
reduced to 55% when thinking was unconstrained. Moreover, the reduction in accuracy for Chemistry
was observed in both modalities (Figure 2).</p>
          <p>This non-monotonicity is likely related to the underthinking and overthinking phenomena, suggesting
the existence of optimal reasoning length. While this e‌ect has been previously studied in LLMs
[30, 31, 32, 33], the same pattern might naturally extend for multimodal reasoning tasks in VLMs.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this working notes paper, we presented our results and analysis for the ImageCLEF 2025
MultimodalReasoning competition. By examining our pre-submission and post-submission experiments,
we conclude that dataset questions vary in complexity — language-wise, subject-wise, and split-wise.</p>
      <p>Regarding the proprietary models used, we found that reasoning capacity directly a‌ects performance
on both English and Bulgarian subsets. Remarkably, we discovered that Gemini 2.5 Flash performs
better in the visual modality for certain subjects when the thinking budget is limited — potentially
indicating a failure to self-calibrate its chain-of-thought reasoning length relative to problem demands.
In addition, it is worth noting that external textual transcription substantially improved accuracy in 2.0
Flash experiments and occasionally resulted in slight increases in performance in 2.5 Flash experiments.</p>
      <p>Nevertheless, we acknowledge that our analysis did not include experiments across all languages
available in the competition dataset. As a result, it remains uncertain whether our ndings generalize
to other languages and problem formulations. Future work could be directed at investigating this
extrapolation explicitly. Furthermore, response token-level metadata—such as prompt, thoughts, output,
and total token counts—can be tracked to examine their relation to correctness. Gemini’s thought
summaries option [34] could also be used to reveal the model’s internal problem-solving pathway for
dataset problems.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The work is partially nanced by the European Union-NextGenerationEU, through the National
Recovery and Resilience Plan of the Republic of Bulgaria, project SUMMIT, No BG-RRP-2.004-0008.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools in this work.
[15] Qwen Team, QVQ: To See the World with Wisdom | Qwen, 2024. URL: https://qwenlm.github.io/
blog/qvq-72b-preview/.
[16] Kimi Team, Kimi-VL Technical Report (2025). URL: https://arxiv.org/pdf/2504.07491.
[17] Google, Gemini 2.5 Flash Preview - Model Card (2025). URL: https://storage.googleapis.com/
model-cards/documents/gemini-2.5-ash-preview.pdf.
[18] Y. Xiang, N. Yuansheng, Z. Kai, Z. Tianyu, MMMU Leaderboard, 2025. URL: https://
mmmu-benchmark.github.io/.
[19] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran
Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_les/paper/2020/
le/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[20] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou,
Chain-ofThought Prompting Elicits Reasoning in Large Language Models, Advances in Neural Information
Processing Systems 35 (2022). URL: https://arxiv.org/pdf/2201.11903.
[21] R. Smith, An overview of the Tesseract OCR engine, Proceedings of the International
Conference on Document Analysis and Recognition, ICDAR 2 (2007) 629–633. URL: https:
//static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf. doi:10.
1109/ICDAR.2007.4376991.
[22] GitHub, tesseract-ocr/tesseract: Tesseract open source ocr engine (main repository), 2025. URL:
https://github.com/tesseract-ocr/tesseract.
[23] 9t9 sofiware GmbH, Free OCR API V2025, Online OCR, Searchable PDF Creator and OCR Sofiware,
2025. URL: https://ocr.space/.
[24] T. Z. Zhao, E. Wallace, S. Feng, D. Klein, S. Singh, Calibrate Before Use: Improving Few-Shot
Performance of Language Models, Proceedings of Machine Learning Research 139 (2021) 12697–
12706. URL: https://arxiv.org/pdf/2102.09690.
[25] J. He, M. Rungta, D. Koleczek, A. Sekhon, F. X. Wang, S. Hasan, Does Prompt Formatting Have</p>
      <p>Any Impact on LLM Performance? (2024). URL: https://arxiv.org/pdf/2411.10541.
[26] H. He, M. Ye, J. Zhang, X. Cai, J. Liu, B. Du, D. Tao, Reasoning-OCR: Can Large Multimodal Models
Solve Complex Logical Reasoning Problems from OCR Cues? (2025). URL: https://arxiv.org/pdf/
2505.12766.
[27] S. Schulho‌, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li, A. Gupta, H. Han, S. Schulho‌, P. S.</p>
      <p>Dulepet, S. Vidyadhara, D. Ki, S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H. D.
Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker, D. Pesko‌, M. Carpuat, J. White,
S. Anadkat, A. Hoyle, P. Resnik, The Prompt Report: A Systematic Survey of Prompt Engineering
Techniques (2024). URL: https://arxiv.org/pdf/2406.06608.
[28] Y. Liu, J. Xu, Li, L. Zhang, Q. Chen, X. Feng, Y. Chen, Z. Guo, Y. Yang, P. Cheng, Beyond Prompt
Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization (2025).</p>
      <p>URL: https://arxiv.org/pdf/2502.04295.
[29] Google, Image understanding | Gemini API | Google AI for Developers, 2025. URL: https://ai.google.</p>
      <p>dev/gemini-api/docs/image-understanding.
[30] X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang, R. Wang, Z. Tu,
H. Mi, D. Yu, Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs (2024).</p>
      <p>URL: https://arxiv.org/pdf/2412.21187.
[31] Y. Wu, Y. Wang, M. Csail, T. Du, S. Jegelka, T. U. Munich, Y. Wang, When More is Less:</p>
      <p>Understanding Chain-of-Thought Length in LLMs (2025). URL: https://arxiv.org/pdf/2502.07266.
[32] Y. Wang, Q. Liu, J. Xu, T. Liang, X. Chen, Z. He, L. Song, D. Yu, J. Li, Z. Zhang, R. Wang, Z. Tu,
H. Mi, D. Yu, Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs (2025).</p>
      <p>URL: https://arxiv.org/pdf/2501.18585.
[33] J. Su, J. Healey, P. Nakov, C. Cardie, Between Underthinking and Overthinking: An Empirical
Study of Reasoning Length and correctness in LLMs (2025). URL: https://arxiv.org/pdf/2505.00127.
doi:10.48550/arXiv.2505.00127.
[34] Google, Gemini API | Google AI for Developers, 2025. URL: https://ai.google.dev/gemini-api/docs.
OCR data is conditionally included according to the experiment cases. We note that, for few-shot
experiments with Prompt 1, the same example as in Prompt 2 is concatenated. Also, the content variable
is substituted and formatted based on the question metadata.</p>
      <sec id="sec-7-1">
        <title>Prompt 1 Template, English</title>
        <p>PROMPT_TEMPLATE_STRICTER = (
"You are a sophisticated Vision-Language Model (VLM) capable of analyzing images
containing multiple-choice questions."
" To guide your analysis, you may adopt the following process:\n"
"0. Consider the subject of question is {subject} and image contains {content}.\n"
"1. Image Analysis: Examine the image closely, identifying key elements such as text,
diagrams, and any other relevant features.\n"</p>
        <p>"-{ocr}\n"
"2. Question Text Extraction: Extract the text of the question\n"
"3. Extract Answer Choices: Identify and extract the answer choices provided in the image\n"
" - if the answer options are not enumerated with letters, do enumerate them with
letters (A, B, C, D, ...)\n"
"4. Look for additional visual elements such as tables, diagrams, charts, or graphs.\n"
"5. Ensure to consider any multilingual or multidomain aspects of the image, including text
in diferent languages or mathematical/physics/scientific notation.\n"
"6. Analyze the complete context and data provided\n"
"7. Select correct answer based solely on analysis.\n"
"8. Respond by only the corresponding letter (single capital letter) without any extra
explanation.\n"
"9. If the answer is not clear, still provide the best guess as single capital letter.\n\n"
"Always respond with a single capital letter (A, B, C, D, E) without any extra explanation."
)</p>
      </sec>
      <sec id="sec-7-2">
        <title>Prompt 1 Template, Bulgarian</title>
        <p>PROMPT_TEMPLATE_BUL_STRICTER = (
"Ти си комплексен Vision-Language модел (VLM) способен да анализира изображения,
съдържащи multiple-choice questions."
" В насочването на анализите си, подходи така:\n"
"0. Вземи предвид, че предметът на въпроса е свързан с {subject} и изображението
съдържа {content}.\n"
"-{ocr}\n"
"1. Анализ на изображение: Изследвай отблизо изображението, идентифицирай
ключови елементи като текст, диаграми, и всякакви други релевантни характеристики.\n"
"2. Извлечи текста, който представлява въпроса\n"
"3. Идентифицирай и извлечи опциите за отговор на въпроса \n"
" - Ако отговорите не са номерирани с букви, номерирай ги с български букви
(А, Б, В, Г, Д)\n"
"4. Потърси допълнителни визуални елементи, като таблици, диаграми, графики
или фигури.\n"
"5. Увери се, че вземаш предвид всички многоезични или многодоменни аспекти
на изображението, включително текст на различни езици или математическа/
физична/научна нотация.\n"
"6. Анализирай целия контекст и предоставените данни\n"
"7. Избери правилния отговор единствено въз основа на анализ.\n"
"8. Отговори само със съответната буква (една главна буква) без допълнителни
обяснения.\n"
"9. Ако отговорът не е ясен, все пак посочи най-доброто предположение с една
българска главна буква.\n\n"
"Винаги отговаряй с една българска буква без никакви допълнителни обяснения."
)</p>
        <sec id="sec-7-2-1">
          <title>Prompt 2</title>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>Prompt 2, OCR included</title>
        <p>You are an expert at solving high-school multiple-choice questions.</p>
        <p>Each input will be a JSON object with the following fields:
- image: The question image (base64-encoded or attached).
- raw_ocr_text: OCR output for the full block including choices
- metadata: An object with:
• subject (e.g., "Biology", "Geometry")
• grade (9–12)
• has_figure (boolean)
• has_graph (boolean)
• language (e.g., "en", "bgn")
Your task is to:
1. Parse the "raw_ocr_text" to extract the question and its multiple-choice options.
2. All questions will have exactly 4 or 5 answer choices.
3. If labeled in a non-English alphabet (e.g., а., б., в., г., д. in Bulgarian),
map them to the Latin letters A, B, C, D, E.
4. Select the single best answer choice based on your expert knowledge.
5. Output only the corresponding uppercase Latin letter: A, B, C, D, or E.
Do NOT include any explanation, translation output, punctuation, or additional text.
Only return the final answer as a single uppercase letter.</p>
        <p>Example Input:
{
"raw_ocr_text": "A cyclist pedals with constant power P. Which expression gives her
speed v? A. P/mg B. (P/mg)^{1/3} C. P/(mg)^{1/2} D. (P/mg)^{2} ",
"metadata": {
"subject": "Physics",
"grade": 10,
"has_figure": false,
"has_graph": false,
"language": "en"
}</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Noyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paniego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Gosthipaty</surname>
          </string-name>
          , Aritra Roy,
          <article-title>Vision Language Models (Better, faster</article-title>
          , stronger),
          <year>2025</year>
          . URL: https://huggingface.co/blog/vlms-2025.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Zheng, R. Liu, G. Zhang,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          , W. Chen,
          <article-title>MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (</article-title>
          <year>2023</year>
          ). URL: https://arxiv.org/pdf/2311.16502. doi:
          <volume>10</volume>
          .1109/CVPR52733.
          <year>2024</year>
          .
          <volume>00913</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Joyti Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahsan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paev</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , Overview of ImageCLEF 2025 -
          <article-title>Multimodal Reasoning</article-title>
          , in: CLEF 2025 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , H. Mu¨ller, D.-
          <string-name>
            <given-names>C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          , ¸Stefan, LiviuDaniel, M.-G. Constantin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          , J. Ru¨ckert,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Garc´ia Seco de Herrera,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          , R. Bru¨ngel, A.
          <string-name>
            <surname>Idrissi-Yaghir</surname>
            , H. Scha¨fer,
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>T. M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Pakull</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Bracke</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Eryilmaz</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Becker</surname>
            , W.-W. Yim,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Codella</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <string-name>
            <surname>Novoa</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Malvehy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>H. M.</given-names>
          </string-name>
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of ImageCLEF 2025:
          <article-title>Multimedia Retrieval in Medical, Social Media and Content Recommendation Applications</article-title>
          , in: Experimental IR Meets Multilinguality, Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>I. Koychev</given-names>
          </string-name>
          , P. Nakov,
          <string-name>
            <surname>EXAMS-V:</surname>
          </string-name>
          <article-title>A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7768</fpage>
          -
          <lpage>7791</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <surname>Towards VQA Models That Can Read</surname>
          </string-name>
          ,
          <source>Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          <year>2019</year>
          -June (
          <year>2019</year>
          )
          <fpage>8309</fpage>
          -
          <lpage>8318</lpage>
          . URL: https://arxiv.org/pdf/
          <year>1904</year>
          .08920. doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2019</year>
          .
          <volume>00851</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          , E. Cui,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bharti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sacheti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data (</article-title>
          <year>2020</year>
          ). URL: https://arxiv.org/pdf/
          <year>2001</year>
          .07966.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] OpenAI, GPT-4o
          <string-name>
            <surname>System Card</surname>
          </string-name>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/pdf/2410.21276.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[9] Anthropic, Claude</source>
          <volume>3</volume>
          .5
          <string-name>
            <given-names>Sonnet</given-names>
            <surname>Model Card Addendum</surname>
          </string-name>
          (
          <year>2024</year>
          ). URL: https://www-cdn.anthropic. com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Google</surname>
          </string-name>
          , Gemini
          <volume>2</volume>
          .0 Flash - Model
          <string-name>
            <surname>Card</surname>
          </string-name>
          (
          <year>2025</year>
          ). URL: https://storage.googleapis.com/model-cards/ documents/gemini-2-ash.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Men</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <fpage>Qwen2</fpage>
          -VL:
          <article-title>Enhancing Vision-Language Model's Perception of the World at Any Resolution (</article-title>
          <year>2024</year>
          ). URL: https://arxiv.org/pdf/2409.12191.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Visual Instruction Tuning,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/pdf/2304.08485.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Gemma</surname>
            <given-names>Team</given-names>
          </string-name>
          ,
          <source>Gemma 3 Technical Report</source>
          (
          <year>2025</year>
          ). URL: https://arxiv.org/pdf/2503.19786.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>OpenAI, OpenAI o1 System Card (</article-title>
          <year>2024</year>
          ). URL: https://cdn.openai.
          <source>com/o1-system-card-20241205</source>
          . pdf.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>Prompt 2, OCR excluded, zero-shot</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>