<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (Q. Wu);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Multimodal Feature Alignment Prompt-Enhanced Method for Multimodal Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qida Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leilei Kong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jianzhong Yan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junyi Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan, Guangdong</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Multimodal reasoning is a task focused on multilingual visual question answering, aiming to evaluate the reasoning capabilities of modern LLMs on complex inputs presented in various languages and involving diverse subjects. This paper elaborates on the strategy of using prompts that combine image and text features to enhance the image understanding capabilities of multimodal models. By aligning images with descriptive text features and constructing multimodal prompts, the approach aims to improve the model's comprehension of images. The proposed method achieves an accuracy of 74.56% on the multilingual validation set and 56.19% on the multilingual test set, representing a 29.18% improvement in performance on the test set compared to the baseline.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal Reasoning</kwd>
        <kwd>Prompt-Enhancement</kwd>
        <kwd>Feature Alignment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Multimodal reasoning tasks, such as ImageCLEF 2025 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], require models to comprehensively process
and understand information from multiple modalities (e.g., vision, language, audio, etc.) to accomplish
complex reasoning and decision-making. These tasks have not only propelled advancements in the field
of artificial intelligence but have also fostered progress in technologies such as multimodal learning,
cross-modal alignment, and deep learning. Furthermore, through interdisciplinary research, these
tasks have contributed to the development of more robust models, thereby enhancing user experience,
addressing challenges posed by complex tasks, and yielding significant benefits in social and economic
domains [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>
        In recent years, with the ongoing development of multimodal pretraining technology, a new array
of multimodal models has emerged. Among these, the Qwen model, as an advanced Vision-Language
model, provides innovative solutions for multimodal reasoning tasks with its robust multimodal fusion
capabilities and eficient inference performance [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. However, Vision-Language Models (VLMs) still
face challenges in deep logical reasoning and inference. They may struggle to answer questions that
necessitate reasoning through complex dependencies or hypothetical scenarios [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        To overcome this limitation, numerous studies have attempted to enhance the models’ deep reasoning
capabilities through fine-tuning, such as by introducing additional training data or designing
taskspecific loss functions to bolster the models’ reasoning abilities [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, due to the complex model
architecture of VLMs, the large scale of parameters, and the scarcity of training data (for instance, in
few-shot settings), directly fine-tuning the entire model for downstream tasks is impractical. Such
finetuning may also lead to the forgetting of useful knowledge acquired during the large-scale pretraining
phase and may cause overfitting to the downstream task [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Therefore, while fine-tuning is an efective method to enhance model performance, its high cost and
low eficiency limit its practical application for VLMs. This has prompted researchers to explore more
eficient ways to improve the deep reasoning capabilities of VLMs, such as by designing lightweight
adaptation modules or employing prompt engineering strategies. These approaches aim to enhance the
model’s reasoning ability on specific tasks while retaining the knowledge acquired during pretraining
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Inspired by prompt engineering, we propose a prompt engineering strategy for the Qwen-VL-Max
model. By introducing prompts that combine image and text features, our method guides the model to
better understand task requirements, thereby efectively handling complex multimodal reasoning tasks.
Our approach not only retains the strengths of the Qwen-VL-Max model in multimodal understanding
but also significantly enhances its performance in deep reasoning tasks through prompt engineering,
providing an eficient and efective solution for multimodal reasoning tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>The methodology of this study designs the multimodal prompts to enhance the image understanding
capabilities of the visual language model., thereby better generalizing to downstream tasks. The Figure 1
illustrates the overall architecture of this method.</p>
      <p>Specifically, for an image reasoning question answering dataset  = {1, 2, . . . , }, where
 denotes the images in the dataset  and  represents the length of the dataset , we first
use a VLM to generate formatted descriptive text  for each image . These texts are then
combined with standardized prompts  to form a set of multimodal prompt pairs   =
{(, 1, 1), (, 2, 2), . . . , (, , )}.</p>
      <p>For each  , the image  is preprocessed to  ×  resolution and divided into  patches,
each embedded as a vector v . The text  is tokenized into  tokens and embedded as vectors e.
Subsequently, v and e are fed into separate encoders for images and text to obtain the features ′ and
 , respectively. These features are then fused into fusion through modality-specific feature alignment
methods. To distinguish between image and text inputs, special tokens &lt; img &gt; and &lt; /img &gt; are used
to wrap the image feature sequence, &lt; box &gt; and &lt; /box &gt; are used for bounding box information,
and &lt; ref &gt; and &lt; /ref &gt; are used for referenced content. Finally, fusion is input into the large
language model to obtain the results.</p>
      <p>The following sections will first introduce the construction of multimodal prompts, followed by an
explanation of the multimodal feature alignment methods.</p>
      <sec id="sec-2-1">
        <title>2.1. Construction of Multimodal Prompts</title>
        <p>In the context of this study, for each image  belonging to the dataset , a corresponding descriptive text
 is generated using VLM. This descriptive text is formatted within the model’s prompt to standardize
the style and structure, adhering to the following requirements:
• Content Description: The text must encompass a comprehensive description of the content
present in the image.
• Emphasis on Visual Elements: Particular attention should be given to describing charts, tables,
diagrams, and other illustrative elements that may be present in the image.
• Problem Statement Clarification : The text should clearly articulate the problem or question
posed within the image.
• Option Description: Each option available within the image must be described explicitly.
• Option Specification : The range of options (A, B, C, D, E) should be clearly delineated.</p>
        <p>By standardizing the descriptive text in this manner, the model is better equipped to understand the
questions embedded within the images. Given that the dataset D encompasses question-answer pairs in
various languages, with slightly difering option symbols, it is imperative to further standardize the
format of the output options within the prompt. This standardized prompt is henceforth referred to as
.A specific example is shown in the Figure 2.</p>
        <p>Subsequently, the standardized prompt , along with the image  and its descriptive text , are
combined to form a data pair, creating a multimodal prompt pair   = (, , ).</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multimodal Feature Alignment</title>
        <p>
          The multimodal feature alignment method is designed to deeply integrate image and text information
for eficient multimodal understanding and generation[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. This method consists of the following key
components:
• Large Language Model (LLM): The foundational component responsible for processing text
inputs and generating linguistic responses.
• Visual Encoder: Utilizes the Vision Transformer (ViT)[
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ] architecture to transform image
data into feature representations that can be fused with text data.
• Position-aware Vision-Language Adapter: Aligns and integrates visual features with textual
features, ensuring efective interaction between image and text information.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Text Encoding</title>
        <p>The text encoding process is based on a pre-trained LLM and follows these steps:
• Input Text Representation: The input text  is tokenized into a sequence of tokens, denoted
as  = {1, 2, . . . , }, where  is the length of the text.
• Embedding Layer: Each token  is converted into a fixed-dimensional vector e through an
embedding layer, i.e., e = Embed().
• Positional Encoding: To preserve the sequential information of the text, positional encoding p
is added to each embedded vector, resulting in e′ = e + p.
• Large Language Model Encoding: The embedded text vector sequence {e′1, e′2, . . . , e′} is fed
into the LLM to generate the text feature representation H = {h1, h2, . . . , h}.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Visual Encoding</title>
        <p>The visual encoding process utilizes the Vision Transformer (ViT) architecture and follows these steps:
• Input Image Preprocessing: The input image  is resized to a specific resolution  × , such
as 448 × 448.
• Patch Embedding: The image is divided into patches of size  ×  . Each patch is flattened and
embedded into a fixed-dimensional vector. Suppose the image size is  ×  and the patch size is
 ×  ; the image is divided into  = (︀  )︀ 2 patches. Each patch  is embedded into a vector v ,

i.e., v = PatchEmbed( ).
• Positional Encoding: To preserve the spatial information of the image, positional encoding p
is added to each patch embedding vector, resulting in v′ = v + p .
• Transformer Encoding: The patch embedding vector sequence {v1′, v2′, . . . , v′ } is fed into
the Vision Transformer to generate the image feature representation H = {h′1, h′2, . . . , h′ }.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Vision-Language Fusion</title>
        <p>The fusion of visual and language features is achieved through the position-aware vision-language
adapter, following these steps:
• Feature Compression: Since the length  of the image feature sequence H is usually much
larger than the length  of the text feature sequence H , the image features need to be compressed.
The adapter uses a single-layer cross-attention module to achieve this. Let the learnable query
vectors be Q = {q1, q2, . . . , q }, where  is the number of query vectors. The cross-attention
operation is defined as:</p>
        <p>A = Softmax
︂( QH )︂</p>
        <p>√</p>
        <p>H′ = AH
where  is the feature dimension, A is the attention weight matrix, and H′ is the compressed
image feature representation with length .
• Position Information Injection: To preserve the spatial information of the image, 2D absolute
positional encoding P is injected into the cross-attention operation, i.e.,</p>
        <p>A = Softmax
︂( (Q + P)H )︂
√
where P is the positional encoding matrix with the same dimension as the query vectors Q.
• Fusion Representation: The compressed image features H′ and the text features H are fed
into the LLM for further integration, generating the final multimodal representation Hfusion.</p>
        <p>The above methods can eficiently integrate image and text data, achieving superior performance in
multimodal tasks.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Data Pre-processing</title>
        <p>
          The EXAMS-V dataset provided by ImageCLEF 2025 for multimodal reasoning tasks consists of
24,856(training set: 16,494, validation set: 4,797, test set: 3,565) multiple choice questions (MCQ)
collected from real school exams and other educational sources, presented in the form of images[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
The data set features:
• Diverse: The content covers pure text questions as well as visual elements such as tables, figures,
graphs, or scientific symbols.
• Multilingual: A multilingual corpus covering 13 diferent languages, such as English, Arabic,
and Chinese.
• Interdisciplinary: A wide coverage of academic subjects, including biology, chemistry, physics,
and more.
• Binary Encoding Conversion: The binary image encoding is converted into Base64 format, an
encoding method that transforms binary data into ASCII strings for convenient transmission and
processing in text-based systems.
• Image Description Generation: The Qwen-VL-Plus model is utilized to analyze the image and
generate a descriptive text for it. The purpose is to extract key information from the image to
facilitate better understanding of its content by subsequent models.
• Data Pair Construction: The generated image description text is combined with the
Base64encoded image to form a data pair, which is then passed as input to the Qwen-VL-Max model.
After the model processes the data, the output results are organized in the following format:
• id: A unique identifier(matching to a sample from the Test set).
• language: The language used in the sample.
        </p>
        <p>• answer_key: The identifier for the correct answer option(one of A, B, C, D, or E).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental Results</title>
        <p>The oficial evaluation metric for this task is accuracy. In this experiment, we use Prompt 2 provided by
ImageCLEF 2025 (a step-by-step reasoning prompt encouraging deeper analysis of textual and visual
cues) as the standardized prompt. Table 2 shows the accuracy of Qwen-VL-Max on the validation set
using the following three methods:
• Qwen-VL-Max (Direct): This method directly applies the Qwen-VL-Max model without any
prompt engineering or additional data pairing.
• Qwen-VL-Max (Prompt-Engineering): This method adjusts the prompt to guide the model
towards more accurate reasoning.
• Qwen-VL-Max (Prompt-Engineering + Pair): This method combines the adjusted prompts
with multimodal data pairs to form multimodal prompts.</p>
        <p>Table 3 presents the comparison of accuracy between Qwen-VL-Max and the baseline methods on
the test set.</p>
        <p>The experimental results show that by introducing multimodal prompts, the Qwen-VL-Max model
has achieved enhanced performance in multimodal reasoning tasks. On the validation set, the model’s
accuracy across all languages has surpassed both the direct use of the model and the use with adjusted
prompts, reaching 74.56% in multilingual settings. On the test set, compared to the baseline methods,
Qwen-VL-Max with prompt Engineering and data pairing has seen a comprehensive improvement
in accuracy across all languages, with a 29.18% increase in multilingual accuracy, reaching 56.19%.
This indicates that the proposed method in this paper can efectively enhance the model’s ability to
understand and reason with complex multimodal inputs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This paper presents a multimodal prompting strategy for the Qwen-VL-Max model, focusing on
enhancing the performance of Vision-Language Models (VLMs) in multimodal reasoning tasks. The
core objective of this study is to enhance the model’s comprehension and reasoning abilities for both
image and text information through meticulously designed multimodal prompts and feature alignment
methods, thereby efectively addressing complex multimodal reasoning tasks. The research findings
and experimental results on the EXAMS-V dataset provided by ImageCLEF 2025 are detailed in this
paper. The experiments demonstrate that the introduction of multimodal prompts can significantly
enhance the image understanding capabilities of VLMs.</p>
      <p>However, this method, which solely relies on prompting to guide model learning, is highly eficient
and easy to implement but has limitations in enhancing the image understanding and reasoning
capabilities of VLMs. Future research may further explore the design and optimization of prompts and
integrate prompt learning with model fine-tuning to improve the models’ reasoning abilities in complex
multimodal tasks.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is supported by the Quality Engineering Projects for Teaching Quality and Teaching Reform
in Undergraduate Colleges and Universities of Guangdong Province (No.20251067).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used kimi in order to: Grammar and spelling check.
After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          , Ştefan, LiviuDaniel, M.-G. Constantin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          , W.-W. Yim,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          <string-name>
            <surname>Hee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of imageclef 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Antol</surname>
          </string-name>
          , M. Mitchell,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          , Vqa: Visual question answering,
          <source>International Journal of Computer Vision</source>
          <volume>123</volume>
          (
          <year>2015</year>
          )
          <fpage>4</fpage>
          -
          <lpage>31</lpage>
          . URL: https: //api.semanticscholar.org/CorpusID:3180429.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Taleb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lippert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nabi</surname>
          </string-name>
          ,
          <article-title>Multimodal self-supervised learning for medical image analysis</article-title>
          ,
          <source>in: Information Processing in Medical Imaging</source>
          ,
          <year>2019</year>
          . URL: https://api.semanticscholar. org/CorpusID:209202500.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Engelcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Posner</surname>
          </string-name>
          ,
          <article-title>Vote3deep: Fast object detection in 3d point clouds using eficient convolutional neural networks</article-title>
          ,
          <source>2017 IEEE International Conference on Robotics and Automation (ICRA)</source>
          (
          <year>2016</year>
          )
          <fpage>1355</fpage>
          -
          <lpage>1361</lpage>
          . URL: https://api.semanticscholar.org/CorpusID: 2017183.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Qwen-vl: A frontier large vision-language model with versatile abilities</article-title>
          ,
          <source>ArXiv abs/2308</source>
          .12966 (
          <year>2023</year>
          ). URL: https: //api.semanticscholar.org/CorpusID:263875678.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Qwen-vl: A versatile vision-language model for understanding, localization</article-title>
          , text reading, and beyond,
          <year>2023</year>
          . URL: https://api.semanticscholar.org/CorpusID:261101015.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Joyti Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ahsan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Paev</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of imageclef 2025 - multimodal reasoning</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: North American Chapter of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https://api.semanticscholar.org/CorpusID:52967399.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. U.</given-names>
            <surname>Khattak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Rasheed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>Maple: Multi-modal prompt learning</article-title>
          ,
          <source>2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2022</year>
          )
          <fpage>19113</fpage>
          -
          <lpage>19122</lpage>
          . URL: https://api.semanticscholar.org/CorpusID:252735181.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <article-title>The power of scale for parameter-eficient prompt tuning</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>3045</fpage>
          -
          <lpage>3059</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>243</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>243</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Qwen</surname>
            <given-names>Team</given-names>
          </string-name>
          , Introducing Qwen-7B:
          <article-title>Open foundation and human-aligned models (of the state-ofthe-arts)</article-title>
          , https://github.com/QwenLM/Qwen-7B,
          <year>2023</year>
          . Accessed:
          <fpage>2025</fpage>
          -05-28.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , ArXiv abs/
          <year>2010</year>
          .11929 (
          <year>2020</year>
          ). URL: https://api. semanticscholar.org/CorpusID:225039882.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ilharco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Taori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Namkoong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , L. Schmidt, Openclip, https://doi.org/10.5281/zenodo.5143773,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.5143773, version 0.1.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>I. Koychev</given-names>
          </string-name>
          , P. Nakov,
          <string-name>
            <surname>EXAMS-V:</surname>
          </string-name>
          <article-title>A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7768</fpage>
          -
          <lpage>7791</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>