<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Zero-Shot Reasoning with BLIP and SmolLM</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elena Tosheva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitar Dimitrov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Koychev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Preslav Nakov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mohamed bin Zayed University of Artificial Intelligence</institution>
          ,
          <addr-line>UAE</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sofia University "St. Kliment Ohridski"</institution>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article was developed as part of the ImageCLEF 2025 competition. We adapted the BLIP-Base image-captioning model for the Multimodal Reasoning task, integrating the SmolLM-360M model for question answering and training on the MBZUAI EXAMS-V dataset (16 724 training and 4 208 validation examples). We then conducted a prompt-ablation study using three diferent templates to evaluate their impact on answer-key accuracy, measured by case-insensitive substring matching against the correct option within the provided set of three to five answers. Finally, we analyzed the distributions of generated caption lengths.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;MultiModal</kwd>
        <kwd>Image CLEF 2025</kwd>
        <kwd>Image Captioning</kwd>
        <kwd>MultiModal Reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Modern vision–language systems combine powerful image encoders with autoregressive text decoders
to perform tasks such as image captioning, visual question answering, and multimodal reasoning. Early
works such as CLIP demonstrated that contrastive pretraining on large-scale image–text pairs yields
embeddings that transfer well to downstream classification and retrieval tasks. Building on this, BLIP
introduced a dual objective of contrastive alignment and generative captioning, producing models such
as Salesforce/blip-image-captioning-base and -large that achieve state-of-the-art results
on COCO and other benchmarks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        At the same time, recent advances in compact causal language models (with less than 500 M
parameters) show that mid-scale Transformers can deliver strong generative performance under tight compute
budgets. SmolLM-360M [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is one such model, featuring 24 layers, rotary positional embeddings, and
optimized training for inference on a single 12 GB GPU. Prompt engineering has emerged as a simple
yet efective way to steer generative models toward desired behaviors. In vision–language captioning,
prepending a short instruction (e.g, “A picture of . . . ”) can influence both the style and content of the
generated description. Understanding how prompt phrasing afects downstream tasks—such as
extracting multiple-choice answers from generated captions, is critical for reliable deployment in real-world
settings.
      </p>
      <p>
        The ImageCLEF 2025 Multimodal Reasoning task challenges systems to select the correct answer
from 3–5 provided options, given an image of a science exam question, covering topics from chemistry
to physics, across multiple languages [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The publicly released MBZUAI EXAMS-V dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provides
16,724 training and 4,208 validation examples, each consisting of a question image, a balanced four-way
answer key, and associated metadata. In our study, we leverage this dataset to evaluate how BLIP-based
captioning and prompt variations impact the ability of an LLM-powered pipeline to recover the correct
answer via simple substring matching.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. System Overview</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          We use the MBZUAI EXAMS-V dataset [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which consists of 16,724 training and 4,208 validation
science exam questions in image format. Each image comes with a 3–5-option multiple-choice question
and associated metadata. Importantly, the dataset spans multiple languages, and in our experiments,
we utilize all available languages to evaluate model robustness across multilingual contexts.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Captioning Pipeline</title>
        <p>To generate image captions, we use the following encoder-decoder models:
• BLIP-Base(Salesforce/blip-image-captioning-base)
• BLIP-Large (Salesforce/blip-image-captioning-large,CLIP-ViT-L/14 backbone)
These encoders extract visual features from the exam images and decode them into textual descriptions.
The captions are later used as inputs for question-answering via a language model.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Prompt Ablation</title>
        <p>We assess how prompt phrasing afects caption content and downstream Accuracy. Each image is
paired with one of three prompt templates:
• None: The image is passed without additional text.
• "A picture of": encourages concise object-focused captions.</p>
        <p>
          • "Describe what you see:": encourages detailed, descriptive captions.
3.4. Model
To perform reasoning over our generated image captions, we employ a lightweight yet powerful
language model:
• SmalLM 360M - a compact transformer-based language model optimized for low-resource
inference. With only 360 million parameters and eficient deployment on hardware with as little
as 12 GB of GPU memory, it enables practical experimentation without sacrificing performance.
Despite its small size, SmolLM-360M is currently the best-performing model in its category
(sub-500M parameters). According to the Hugging Face benchmark [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], it outperforms
other similarly sized models—including MobileLM-350M and Qwen2-500M—across a range
of benchmarks that test general knowledge, commonsense reasoning, and reading comprehension.
We use a zero-shot setup: SmolLM-360M is not fine-tuned. Given a caption produced
by BLIP, we prompt the model as follows:
[CAPTIONED QUESTION]
{caption}
Choose the correct answer from the following options: A, B, C, D, E.\n
        </p>
        <p>Answer:
This zero-shot approach allows us to simulate realistic, low-resource deployment conditions
while assessing how well the model can reason over image-derived text alone.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.5. Evaluation</title>
        <p>Answer-key accuracy is the percentage of validation samples whose generated caption contains the
correct option letter (A–E) as a standalone token. Formally:</p>
        <p>accuracy = 1 ∑︁ ⊮[token() ∈ tokens()] × 100%.</p>
        <p>=1
Here  = 4208 for the validation split. We also report the distribution of caption token lengths.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>All experiments are run on Google Colab Pro’s T4 GPU.</title>
        <sec id="sec-4-1-1">
          <title>4.1. Prompt-Ablation Results</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.2. Oficial Submission Results</title>
          <p>
            Our system was oficially evaluated as part of the ImageCLEF 2025 Multimodal Reasoning task [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ],
where we participated under the team name elenat. We submitted a single, zero-shot pipeline that used
BLIP-based captioning and the compact SmolLM-360M model for reasoning.
          </p>
          <p>Unlike many participating systems that focused on individual languages, we ran our model on the
entire multilingual test set, which includes science exam questions in multiple languages such as
English, Bulgarian, Arabic, and others. This multilingual setup allowed us to evaluate the generalization
capabilities of our lightweight models across a diverse range of inputs. Importantly, our oficial
submission used the bare image input without any additional prompt text for captioning—i.e.,
we did not prepend instructions like “A photo of” or “Describe what you see:”. This minimal setup
demonstrates the capability of our pipeline to extract useful semantic information from images alone.</p>
          <p>And here are the results that we achieved:</p>
          <p>• English: placed 11th with 25.20 % accuracy
• Bulgarian: placed 6th with 23.50 % accuracy
• Multilingual: placed 10th with 21.88 % accuracy</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>Our experiments reveal two overarching trends:</title>
        <p>In other words, our strongest relative showing was in the Bulgarian track (rank 6), even though the
absolute highest accuracy was in the English track.</p>
        <p>• Caption Conciseness Correlates with Accuracy Across both BLIP-Base and BLIP-Large, the
shortest outputs consistently yield the best match against the answer key. For BLIP-Base, omitting
any leading prompt (“None”) produces the briefest captions (11.4 tokens) and delivers the highest
accuracy (22.0%). Likewise, for BLIP-Large, the “Describe what you see:” template—despite being
wordier than no prompt—actually results in the most concise captions (12.7 tokens) of the three
setups and achieves the top performance (22.0%).
• Prompt Wording Matters—But Only Modestly Swapping among “A photo of,” “Describe
what you see:,” or no explicit prefix shifts accuracy by at most 1.6 points. In contrast, average
caption lengths vary by as much as 3 tokens. This gap suggests that while prompt phrasing
reliably inflates or trims verbosity, it only marginally influences the model’s ability to generate
an answer–key match. In other words, template choice can nudge performance but is not the
dominant factor.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        We present a prompt-ablation study for BLIP-Base on ImageCLEF 2025 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], demonstrating that simple
question prompt variations can afect multiple-choice accuracy. Encouraging the model to keep captions
brief (either via no prompt or a very lean template) appears to help it mention the correct
multiplechoice letter more reliably. Future work may include dynamic prompt optimization and multilingual
adaptation.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, we used OpenAI GPT-4o to assist with grammar and spelling
improvements. All suggestions were reviewed and edited by the authors, who take full responsibility
for the final content of the publication.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <article-title>Imageclef 2025 multimodal reasoning task</article-title>
          , https://www.imageclef.org/2025/multimodalreasoning,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jyoti Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahsan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paev</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of imageclef 2025 - multimodal reasoning</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          , Ştefan, Liviu-Daniel, M.-G. Constantin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          , W.-W. Yim,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>H. M.</given-names>
          </string-name>
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of imageclef 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Salesforce</surname>
          </string-name>
          ,
          <article-title>Blip image captioning base model</article-title>
          , https://huggingface.co/Salesforce/ blip
          <article-title>-image-captioning-</article-title>
          <string-name>
            <surname>base</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Hugging</given-names>
            <surname>Face</surname>
          </string-name>
          , Smollm-360m model, https://huggingface.co/HuggingFaceTB/SmolLM-360M,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Hugging</given-names>
            <surname>Face</surname>
          </string-name>
          <string-name>
            <surname>Blog</surname>
          </string-name>
          ,
          <article-title>Smollm: Blazingly fast and remarkably powerful</article-title>
          , https://huggingface.co/blog/ smollm,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>I. Koychev</given-names>
          </string-name>
          , P. Nakov,
          <string-name>
            <surname>EXAMS-V:</surname>
          </string-name>
          <article-title>A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7768</fpage>
          -
          <lpage>7791</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>