1. Introduction

X (Q. Wu);

A Multimodal Feature Alignment Prompt-Enhanced Method for Multimodal Reasoning

Qida Wu

Leilei Kong

Jianzhong Yan

Junyi Li

0 0 Foshan University , Foshan, Guangdong , China

2025

000 0 0002

Multimodal reasoning is a task focused on multilingual visual question answering, aiming to evaluate the reasoning capabilities of modern LLMs on complex inputs presented in various languages and involving diverse subjects. This paper elaborates on the strategy of using prompts that combine image and text features to enhance the image understanding capabilities of multimodal models. By aligning images with descriptive text features and constructing multimodal prompts, the approach aims to improve the model's comprehension of images. The proposed method achieves an accuracy of 74.56% on the multilingual validation set and 56.19% on the multilingual test set, representing a 29.18% improvement in performance on the test set compared to the baseline.

eol>Multimodal Reasoning Prompt-Enhancement Feature Alignment

1. Introduction

Multimodal reasoning tasks, such as ImageCLEF 2025 [ 1 ], require models to comprehensively process and understand information from multiple modalities (e.g., vision, language, audio, etc.) to accomplish complex reasoning and decision-making. These tasks have not only propelled advancements in the field of artificial intelligence but have also fostered progress in technologies such as multimodal learning, cross-modal alignment, and deep learning. Furthermore, through interdisciplinary research, these tasks have contributed to the development of more robust models, thereby enhancing user experience, addressing challenges posed by complex tasks, and yielding significant benefits in social and economic domains [ 2, 3, 4 ].

In recent years, with the ongoing development of multimodal pretraining technology, a new array of multimodal models has emerged. Among these, the Qwen model, as an advanced Vision-Language model, provides innovative solutions for multimodal reasoning tasks with its robust multimodal fusion capabilities and eficient inference performance [ 5, 6 ]. However, Vision-Language Models (VLMs) still face challenges in deep logical reasoning and inference. They may struggle to answer questions that necessitate reasoning through complex dependencies or hypothetical scenarios [ 7 ].

To overcome this limitation, numerous studies have attempted to enhance the models’ deep reasoning capabilities through fine-tuning, such as by introducing additional training data or designing taskspecific loss functions to bolster the models’ reasoning abilities [ 8 ]. However, due to the complex model architecture of VLMs, the large scale of parameters, and the scarcity of training data (for instance, in few-shot settings), directly fine-tuning the entire model for downstream tasks is impractical. Such finetuning may also lead to the forgetting of useful knowledge acquired during the large-scale pretraining phase and may cause overfitting to the downstream task [ 9 ].

Therefore, while fine-tuning is an efective method to enhance model performance, its high cost and low eficiency limit its practical application for VLMs. This has prompted researchers to explore more eficient ways to improve the deep reasoning capabilities of VLMs, such as by designing lightweight adaptation modules or employing prompt engineering strategies. These approaches aim to enhance the model’s reasoning ability on specific tasks while retaining the knowledge acquired during pretraining [ 10 ].

Inspired by prompt engineering, we propose a prompt engineering strategy for the Qwen-VL-Max model. By introducing prompts that combine image and text features, our method guides the model to better understand task requirements, thereby efectively handling complex multimodal reasoning tasks. Our approach not only retains the strengths of the Qwen-VL-Max model in multimodal understanding but also significantly enhances its performance in deep reasoning tasks through prompt engineering, providing an eficient and efective solution for multimodal reasoning tasks.

2. Method

The methodology of this study designs the multimodal prompts to enhance the image understanding capabilities of the visual language model., thereby better generalizing to downstream tasks. The Figure 1 illustrates the overall architecture of this method.

Specifically, for an image reasoning question answering dataset = {1, 2, . . . , }, where denotes the images in the dataset and represents the length of the dataset , we first use a VLM to generate formatted descriptive text for each image . These texts are then combined with standardized prompts to form a set of multimodal prompt pairs = {(, 1, 1), (, 2, 2), . . . , (, , )}.

For each , the image is preprocessed to × resolution and divided into patches, each embedded as a vector v . The text is tokenized into tokens and embedded as vectors e. Subsequently, v and e are fed into separate encoders for images and text to obtain the features ′ and , respectively. These features are then fused into fusion through modality-specific feature alignment methods. To distinguish between image and text inputs, special tokens < img > and < /img > are used to wrap the image feature sequence, < box > and < /box > are used for bounding box information, and < ref > and < /ref > are used for referenced content. Finally, fusion is input into the large language model to obtain the results.

The following sections will first introduce the construction of multimodal prompts, followed by an explanation of the multimodal feature alignment methods.

2.1. Construction of Multimodal Prompts

In the context of this study, for each image belonging to the dataset , a corresponding descriptive text is generated using VLM. This descriptive text is formatted within the model’s prompt to standardize the style and structure, adhering to the following requirements: • Content Description: The text must encompass a comprehensive description of the content present in the image. • Emphasis on Visual Elements: Particular attention should be given to describing charts, tables, diagrams, and other illustrative elements that may be present in the image. • Problem Statement Clarification : The text should clearly articulate the problem or question posed within the image. • Option Description: Each option available within the image must be described explicitly. • Option Specification : The range of options (A, B, C, D, E) should be clearly delineated.

By standardizing the descriptive text in this manner, the model is better equipped to understand the questions embedded within the images. Given that the dataset D encompasses question-answer pairs in various languages, with slightly difering option symbols, it is imperative to further standardize the format of the output options within the prompt. This standardized prompt is henceforth referred to as .A specific example is shown in the Figure 2.

Subsequently, the standardized prompt , along with the image and its descriptive text , are combined to form a data pair, creating a multimodal prompt pair = (, , ).

2.2. Multimodal Feature Alignment

The multimodal feature alignment method is designed to deeply integrate image and text information for eficient multimodal understanding and generation[ 11 ]. This method consists of the following key components: • Large Language Model (LLM): The foundational component responsible for processing text inputs and generating linguistic responses. • Visual Encoder: Utilizes the Vision Transformer (ViT)[ 12, 13 ] architecture to transform image data into feature representations that can be fused with text data. • Position-aware Vision-Language Adapter: Aligns and integrates visual features with textual features, ensuring efective interaction between image and text information.

Text Encoding

The text encoding process is based on a pre-trained LLM and follows these steps: • Input Text Representation: The input text is tokenized into a sequence of tokens, denoted as = {1, 2, . . . , }, where is the length of the text. • Embedding Layer: Each token is converted into a fixed-dimensional vector e through an embedding layer, i.e., e = Embed(). • Positional Encoding: To preserve the sequential information of the text, positional encoding p is added to each embedded vector, resulting in e′ = e + p. • Large Language Model Encoding: The embedded text vector sequence {e′1, e′2, . . . , e′} is fed into the LLM to generate the text feature representation H = {h1, h2, . . . , h}.

Visual Encoding

The visual encoding process utilizes the Vision Transformer (ViT) architecture and follows these steps: • Input Image Preprocessing: The input image is resized to a specific resolution × , such as 448 × 448. • Patch Embedding: The image is divided into patches of size × . Each patch is flattened and embedded into a fixed-dimensional vector. Suppose the image size is × and the patch size is × ; the image is divided into = (︀ )︀ 2 patches. Each patch is embedded into a vector v , i.e., v = PatchEmbed( ). • Positional Encoding: To preserve the spatial information of the image, positional encoding p is added to each patch embedding vector, resulting in v′ = v + p . • Transformer Encoding: The patch embedding vector sequence {v1′, v2′, . . . , v′ } is fed into the Vision Transformer to generate the image feature representation H = {h′1, h′2, . . . , h′ }.

Vision-Language Fusion

The fusion of visual and language features is achieved through the position-aware vision-language adapter, following these steps: • Feature Compression: Since the length of the image feature sequence H is usually much larger than the length of the text feature sequence H , the image features need to be compressed. The adapter uses a single-layer cross-attention module to achieve this. Let the learnable query vectors be Q = {q1, q2, . . . , q }, where is the number of query vectors. The cross-attention operation is defined as:

A = Softmax ︂( QH )︂

√

H′ = AH where is the feature dimension, A is the attention weight matrix, and H′ is the compressed image feature representation with length . • Position Information Injection: To preserve the spatial information of the image, 2D absolute positional encoding P is injected into the cross-attention operation, i.e.,

A = Softmax ︂( (Q + P)H )︂ √ where P is the positional encoding matrix with the same dimension as the query vectors Q. • Fusion Representation: The compressed image features H′ and the text features H are fed into the LLM for further integration, generating the final multimodal representation Hfusion.

The above methods can eficiently integrate image and text data, achieving superior performance in multimodal tasks.

3. Experiments 3.1. Data Pre-processing

The EXAMS-V dataset provided by ImageCLEF 2025 for multimodal reasoning tasks consists of 24,856(training set: 16,494, validation set: 4,797, test set: 3,565) multiple choice questions (MCQ) collected from real school exams and other educational sources, presented in the form of images[ 14 ]. The data set features: • Diverse: The content covers pure text questions as well as visual elements such as tables, figures, graphs, or scientific symbols. • Multilingual: A multilingual corpus covering 13 diferent languages, such as English, Arabic, and Chinese. • Interdisciplinary: A wide coverage of academic subjects, including biology, chemistry, physics, and more. • Binary Encoding Conversion: The binary image encoding is converted into Base64 format, an encoding method that transforms binary data into ASCII strings for convenient transmission and processing in text-based systems. • Image Description Generation: The Qwen-VL-Plus model is utilized to analyze the image and generate a descriptive text for it. The purpose is to extract key information from the image to facilitate better understanding of its content by subsequent models. • Data Pair Construction: The generated image description text is combined with the Base64encoded image to form a data pair, which is then passed as input to the Qwen-VL-Max model. After the model processes the data, the output results are organized in the following format: • id: A unique identifier(matching to a sample from the Test set). • language: The language used in the sample.

• answer_key: The identifier for the correct answer option(one of A, B, C, D, or E).

3.2. Experimental Results

The oficial evaluation metric for this task is accuracy. In this experiment, we use Prompt 2 provided by ImageCLEF 2025 (a step-by-step reasoning prompt encouraging deeper analysis of textual and visual cues) as the standardized prompt. Table 2 shows the accuracy of Qwen-VL-Max on the validation set using the following three methods: • Qwen-VL-Max (Direct): This method directly applies the Qwen-VL-Max model without any prompt engineering or additional data pairing. • Qwen-VL-Max (Prompt-Engineering): This method adjusts the prompt to guide the model towards more accurate reasoning. • Qwen-VL-Max (Prompt-Engineering + Pair): This method combines the adjusted prompts with multimodal data pairs to form multimodal prompts.

Table 3 presents the comparison of accuracy between Qwen-VL-Max and the baseline methods on the test set.

The experimental results show that by introducing multimodal prompts, the Qwen-VL-Max model has achieved enhanced performance in multimodal reasoning tasks. On the validation set, the model’s accuracy across all languages has surpassed both the direct use of the model and the use with adjusted prompts, reaching 74.56% in multilingual settings. On the test set, compared to the baseline methods, Qwen-VL-Max with prompt Engineering and data pairing has seen a comprehensive improvement in accuracy across all languages, with a 29.18% increase in multilingual accuracy, reaching 56.19%. This indicates that the proposed method in this paper can efectively enhance the model’s ability to understand and reason with complex multimodal inputs.

4. Conclusion

This paper presents a multimodal prompting strategy for the Qwen-VL-Max model, focusing on enhancing the performance of Vision-Language Models (VLMs) in multimodal reasoning tasks. The core objective of this study is to enhance the model’s comprehension and reasoning abilities for both image and text information through meticulously designed multimodal prompts and feature alignment methods, thereby efectively addressing complex multimodal reasoning tasks. The research findings and experimental results on the EXAMS-V dataset provided by ImageCLEF 2025 are detailed in this paper. The experiments demonstrate that the introduction of multimodal prompts can significantly enhance the image understanding capabilities of VLMs.

However, this method, which solely relies on prompting to guide model learning, is highly eficient and easy to implement but has limitations in enhancing the image understanding and reasoning capabilities of VLMs. Future research may further explore the design and optimization of prompts and integrate prompt learning with model fine-tuning to improve the models’ reasoning abilities in complex multimodal tasks.

Acknowledgments

This work is supported by the Quality Engineering Projects for Teaching Quality and Teaching Reform in Undergraduate Colleges and Universities of Guangdong Province (No.20251067).

Declaration on Generative AI

During the preparation of this work, the author(s) used kimi in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Ionescu ,

Müller ,

D.-C.

Stanciu ,

A.-G.

Andrei ,

Radzhabov ,

Prokopchuk , Ştefan, LiviuDaniel, M.-G. Constantin,

Dogariu ,

Kovalev ,

Damm ,

Rückert ,

A. Ben

Abacha ,

García Seco de Herrera ,

C. M.

Friedrich ,

Bloch ,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

T. M. G.

Pakull ,

Bracke ,

Pelka ,

Eryilmaz ,

Becker , W.-W. Yim,

Codella ,

R. A.

Novoa ,

Malvehy ,

Dimitrov , R. J. Das , Z.

Xie , M. S.

Hee , P.

Nakov , I. Koychev, S. A.

Hicks , S.

Gautam , M. A.

Riegler , V.

Thambawita , P.

Halvorsen , D.

Fabre , C.

Macaire , B.

Lecouteux , D.

Schwab , M.

Potthast , M.

Heinrich , J.

Kiesel , M.

Wolter , B.

Stein , Overview of imageclef 2025: Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 16th International Conference of the CLEF Association (CLEF 2025 ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain, 2025 .

[2]

Agrawal ,

Lu ,

Antol , M. Mitchell,

C. L.

Zitnick ,

Parikh ,

Batra , Vqa: Visual question answering, International Journal of Computer Vision 123 ( 2015 ) 4 - 31 . URL: https: //api.semanticscholar.org/CorpusID:3180429.

[3]

Taleb ,

Lippert ,

Klein ,

Nabi , Multimodal self-supervised learning for medical image analysis , in: Information Processing in Medical Imaging , 2019 . URL: https://api.semanticscholar. org/CorpusID:209202500.

[4]

Engelcke ,

Rao ,

D. Z.

Wang ,

C. H.

Tong , I. Posner , Vote3deep: Fast object detection in 3d point clouds using eficient convolutional neural networks , 2017 IEEE International Conference on Robotics and Automation (ICRA) ( 2016 ) 1355 - 1361 . URL: https://api.semanticscholar.org/CorpusID: 2017183.

[5]

Bai ,

Yang ,

Wang ,

Tan ,

Wang ,

Lin ,

Zhou ,

Zhou , Qwen-vl: A frontier large vision-language model with versatile abilities , ArXiv abs/2308 .12966 ( 2023 ). URL: https: //api.semanticscholar.org/CorpusID:263875678.

[6]

Bai ,

Yang ,

Wang ,

Tan ,

Wang ,

Lin ,

Zhou ,

Zhou , Qwen-vl: A versatile vision-language model for understanding, localization , text reading, and beyond, 2023 . URL: https://api.semanticscholar.org/CorpusID:261101015.

[7]

Dimitrov ,

M. S.

Hee ,

Xie ,

Joyti Das ,

Ahsan ,

Ahmad ,

Paev , I. Koychev ,

Nakov , Overview of imageclef 2025 - multimodal reasoning , in: CLEF 2025 Working Notes, CEUR Workshop Proceedings , CEUR-WS.org, Madrid, Spain, 2025 .

[8]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , in: North American Chapter of the Association for Computational Linguistics , 2019 . URL: https://api.semanticscholar.org/CorpusID:52967399.

[9]

M. U.

Khattak ,

H. A.

Rasheed ,

Maaz ,

S. H.

Khan ,

F. S.

Khan , Maple: Multi-modal prompt learning , 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ( 2022 ) 19113 - 19122 . URL: https://api.semanticscholar.org/CorpusID:252735181.

[10]

Lester ,

Al-Rfou ,

Constant , The power of scale for parameter-eficient prompt tuning , in: M. -

F. Moens , X.

Huang , L.

Specia , S. W.-t. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Online and

Punta

Cana , Dominican Republic, 2021 , pp. 3045 - 3059 . URL: https://aclanthology.org/ 2021 .emnlp-main. 243 /. doi: 10 .18653/v1/ 2021 .emnlp-main. 243 .

[11] Qwen

Team

, Introducing Qwen-7B: Open foundation and human-aligned models (of the state-ofthe-arts) , https://github.com/QwenLM/Qwen-7B, 2023 . Accessed: 2025 -05-28.

[12]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly ,

Uszkoreit ,

Houlsby , An image is worth 16x16 words: Transformers for image recognition at scale , ArXiv abs/ 2010 .11929 ( 2020 ). URL: https://api. semanticscholar.org/CorpusID:225039882.

[13]

Ilharco ,

Wortsman ,

Carlini ,

Taori ,

Dave ,

Shankar ,

Namkoong ,

Miller ,

Hajishirzi ,

Farhadi , L. Schmidt, Openclip, https://doi.org/10.5281/zenodo.5143773, 2021 . doi: 10 .5281/zenodo.5143773, version 0.1.

[14] R. Das , S.

Hristov , H.

Li , D.

Dimitrov , I. Koychev , P. Nakov, EXAMS-V: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models , in: L. -W. Ku , A. Martins , V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Bangkok, Thailand, 2024 , pp. 7768 - 7791 . URL: https://aclanthology.org/ 2024 . acl-long . 420 . doi: 10 .18653/v1/ 2024 . acl-long . 420 .