1. Introduction

Bridging the Modality Gap Through CoT-Enhanced Multimodal Reasoning

Shengjun Deng

Guo Niu

Xiongfei Yao

Huanlin Mo

Tao Li

Shuaiwei Jiao

0 0 Foshan University , Foshan , China

This paper proposes a "Question Reconstruction before Answering" (QRA) prompting strategy for the ImageCLEF2025 multimodal reasoning task. The method first completes missing question stems using image information, then guides the language model through step-by-step reasoning and answering, thereby enhancing the model's comprehension and reasoning capabilities. On the EXAMS-V dataset, through our investigation of diferent prompts and their impact on accuracy, we found that the QRA prompting demonstrates strong cross-lingual adaptability compared to conventional Chain-of-Thought (CoT) prompting. Experimental results show that this method efectively improves visual question answering performance without requiring OCR or additional fine-tuning, ofering a new perspective for multimodal reasoning tasks.

eol>Multimodal reasoning Vision-language models Prompt engineering Chain-of-thought

1. Introduction 2. Related Work 2.1. Visual-Language Models

Visual-Language Models (VLMs) have made significant progress in multimodal understanding tasks in recent years. CLIP[ 1 ] established the foundation for multimodal pretraining by constructing a general image-text embedding space through contrastive learning of images and text. BLIP-2[ 2 ] introduced a lightweight intermediate module to connect a frozen visual encoder with a language model, enhancing image-text question answering and generation capabilities. LLaVA[ 3 ] combined CLIP and LLM, adding a projection layer to improve the model’s understanding of images through visual instruction fine-tuning, supporting various question answering and dialogue scenarios. VisionLLM[ 8 ] optimized the visual attention mechanism based on BLIP-2, achieving more refined image-text alignment. Qwen-VL 2.5[ 9 ] further expanded the model’s perceptual capabilities by optimizing the projection layer and other methods, demonstrating strong reasoning abilities with excellent performance on multiple benchmarks.

Although these methods have made progress in image-text alignment and language generation, their reasoning processes still perform limitedly under incomplete prompts. Our approach attempts to address this shortcoming by reconstructing question text combined with CoT (Chain-of-Thought) reasoning.

2.2. Chain-of-Thought Prompt

Chain-of-Thought (CoT) prompting significantly enhances the reasoning capabilities of large language models in complex tasks by guiding the model to generate intermediate reasoning steps [ 4 ]. In scenarios such as mathematics and commonsense question answering, CoT helps the model decompose problems step-by-step and generate coherent reasoning chains, thereby improving accuracy. However, applying CoT to multimodal tasks still faces challenges. On one hand, CoT typically relies on explicit textual prompts, but key information in multimodal tasks may exist in visual form, making it dificult for the model to correctly understand the problem. On the other hand, visual features lack clear semantic boundaries, and directly inputting them into the language model often leads to prompt interpretation deviations due to the "modality gap," which in turn afects the completeness and logicality of the reasoning chain.

cnE iV o su red la

LLM

Answer 3. Method

In image-only multimodal question answering tasks,such as the EXAMS-V dataset [ 7 ], visual encoders often lead to the loss of certain semantic information when abstractly representing images, particularly the textual cues and detailed content in the images that are relevant to the question stem. This information gap makes it dificult for language models to construct clear reasoning chains. In contrast, when faced with obscured or incomplete questions, humans are usually able to reasonably complete the missing information based on their existing background knowledge and contextual understanding, thereby successfully completing the reasoning task.

Inspired by this, we propose a "complete first, then reason" strategy, the Architecture shown in Figure 1. This strategy first uses image features to guide the language model to complete the missing question information, thereby reconstructing the complete question stem; subsequently, based on the reconstruction results, a Chain-of-Thought (CoT) reasoning mechanism is introduced to enhance the model’s cross-modal reasoning ability. This method not only enhances the model’s understanding of the task context but also efectively alleviates the semantic disconnection caused by modal diferences. Specifically, our method includes two key steps: 1) Question Background Information Prompt Embedding , 2) Question-Reasoning-Answer Prompting.

3.1. Question Background Information Prompt Embedding

In practical multimodal question answering tasks, questions often involve specific languages and subject backgrounds, with language expressions that are highly specialized and context-dependent. Especially in scenarios containing only images, language models, lacking explicit context, are prone to misunderstandings of the question stem.

To address this issue, we introduce question background information embedding. Specifically, we extract the language category (e.g., English, French, etc.) and subject labels (e.g., physics, chemistry, etc.) of the question from the image’s metadata and use them as prior knowledge prompt words to guide the language model in context modeling. This approach efectively mitigates semantic ambiguity caused by language specificity, making the model more targeted and accurate when generating completion content.

Standard Prompting There is a question in the image, please provide the correct answer.

Upon rechecking, it appears there might be a typo in the problem or the options. However, based on the calculations, the closest match is: D(135±0.23)

Chain-of-Thought Prompting A conversation between User and Assistant...

Format: <think> reasoning </think> <answer> final answer</answer> <think> Step-by-Step Solution: The balance condition for the Wheatstone bridge is … <think/> <answer>A(60±0.15) </answer>

Question-Reasoning-Answer Prompting

A conversation between User and Assistant… The problem you should complete, the reasoning process, and the answer are enclosed within the following tags: <problem> problem </problem> <think> reasoning</think> <answer> answer here </answer> <problem> During an experiment with a metre bridge, the galvanometer shows a null point when the jockey is pressed at 40.0cm using a standard resistance of 90 … </problem> <think> This is a classic Wheatstone bridge setup using a metre bridge. The balance condition is given by: …… </think> <answer> C (60±0.25)</answer>

3.2. Question Reconstruction before Answering Prompting

After completing the question stem, the model still requires strong reasoning capabilities to correctly perform the question-answering task. Traditional Chain-of-Thought (CoT) prompting, which guides language models to generate intermediate reasoning steps, has achieved significant success in textual reasoning tasks. However, directly applying the CoT mechanism to multimodal question-answering tasks involving only images can lead to information confusion or insuficient semantic alignment, resulting in the model’s inability to construct coherent and clear reasoning chains.

To address this, we propose a structured "Question Reconstruction before Answering" guided prompting strategy, aiming to explicitly separate the question comprehension process from the reasoning process to enhance the model’s ability to build reasoning chains. Specifically, we design a unified prompt template that introduces the <Question>...</Question> tag to guide the model in first understanding the question before engaging in step-by-step thinking and answering. We show the efects of three types of prompts in Figure 2.

4. Results 4.1. Comparative Experiments

To validate the efectiveness of our proposed QRA Prompting strategy, we participated in the ImageCLEF2025 multimodal reasoning task and submitted test results for both the Multilingual Track and the English Track. Table 1 lists our performance on the multilingual test set.

In the Multilingual Track, our method ranked 6th among all participating teams, achieving an accuracy of 0.5195. Compared to the oficial baseline method (accuracy of 0.2701), our approach improved performance by 24.9%, demonstrating the strong competitiveness of our method in practical tasks. This significant improvement indicates that our proposed structured strategy of "first completing the question, then reasoning" has clear advantages in alleviating inter-modal information misalignment and enhancing cross-modal understanding.

Notably, we achieved near-top-tier performance without relying on any additional OCR modules or ifne-tuning the model for multilingual tasks. This demonstrates that QRA Prompting possesses strong robustness and excellent transfer generalization capabilities, performing stably and reliably in complex real-world multimodal reasoning scenarios.

In the English Track, we also submitted model predictions based on QRA Prompting, achieving an accuracy of 0.5371 and ranking 6th,As shown in Table 2. Our method consistently delivered strong performance across both tasks, further validating its cross-language consistency.

4.2. Ablation Study

To systematically evaluate the contributions of each key component in QRA Prompting, we conducted ablation experiments on the English validation set of the EXAMS-V dataset. The experiments used QwenVL-2.5-32B as the base model, employed a zero-shot setting, and compared against standard Prompting and Chain-of-Thought (CoT) Prompting methods. As shown in Table 3, the standard Prompting method achieved an accuracy of 0.458, demonstrating relatively weak performance. The CoT Prompting method, which guides the model through chain-of-thought reasoning, improved accuracy to 0.548. The QRA Prompting strategy further enhanced this performance, achieving an accuracy of 0.582, which represents a 12.4% improvement over the standard method and a 3.4% improvement over the CoT method.

These results indicate that QRA Prompting not only inherits the advantages of chain-of-thought reasoning from CoT but also efectively enhances the language model’s understanding of image semantics through explicit question stem completion, significantly boosting the model’s performance in complex reasoning tasks.

5. Conclusion

In this study, we addressed the Multimodal Reasoning task of the ImageCLEF2025 Multimodal Lab. By employing the QRA strategy, we enhance the inference accuracy of models in visual question answering tasks. Our approach involves constructing QRA prompt templates and integrating contextual information. These strategies efectively address two key challenges faced by traditional Chain-ofThought (CoT) in multimodal scenarios: they alleviate the "modality gap" problem caused by relying solely on visual features and enhance the ability to reconstruct missing question text from visual data.

Evaluation results demonstrate the feasibility and efectiveness of our method, achieving an accuracy of 0.5195 on the multilingual version of the EXAMS-V test set. These findings indicate that our approach provides a viable solution for visual question answering tasks that use only visual features, contributing to the field of multimodal reasoning.

Acknowledgments

This work is supported by the Research Projects of OrdinaryUniversities in Guangdong Province under Grant2023KTSCX133, the Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515140103.

Declaration on Generative AI

During the preparation of this work, the author(s) used deepseek-v3 in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Radford ,

J. W.

Kim ,

Hallacy ,

Ramesh , G. Goh,

Agarwal ,

Sastry ,

Askell ,

Mishkin ,

Clark , et al., Learning transferable visual models from natural language supervision , in: International conference on machine learning, PmLR , 2021 , pp. 8748 - 8763 .

[2]

Li ,

Savarese ,

Hoi , Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , in: International conference on machine learning, PMLR , 2023 , pp. 19730 - 19742 .

[3]

Liu ,

Li ,

Wu ,

Y. J.

Lee , Visual instruction tuning, 2023 . URL: https://arxiv.org/abs/2304.08485. arXiv: 2304 . 08485 .

[4]

Wei ,

Wang ,

Schuurmans ,

Bosma ,

Xia ,

Chi ,

Q. V.

Le ,

Zhou , et al., Chain-of-thought prompting elicits reasoning in large language models , Advances in neural information processing systems 35 ( 2022 ) 24824 - 24837 .

[5]

Ionescu ,

Müller ,

D.-C.

Stanciu ,

A.-G.

Andrei ,

Radzhabov ,

Prokopchuk , Ştefan, LiviuDaniel, M.-G. Constantin,

Dogariu ,

Kovalev ,

Damm ,

Rückert ,

A. Ben

Abacha ,

García Seco de Herrera ,

C. M.

Friedrich ,

Bloch ,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

T. M. G.

Pakull ,

Bracke ,

Pelka ,

Eryilmaz ,

Becker , W.-W. Yim,

Codella ,

R. A.

Novoa ,

Malvehy ,

Dimitrov , R. J. Das , Z.

Xie , M. S.

Hee , P.

Nakov , I. Koychev, S. A.

Hicks , S.

Gautam , M. A.

Riegler , V.

Thambawita , P.

Halvorsen , D.

Fabre , C.

Macaire , B.

Lecouteux , D.

Schwab , M.

Potthast , M.

Heinrich , J.

Kiesel , M.

Wolter , B.

Stein , Overview of imageclef 2025: Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 16th International Conference of the CLEF Association (CLEF 2025 ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain, 2025 .

[6]

Dimitrov ,

M. S.

Hee ,

Xie ,

Jyoti Das ,

Ahsan ,

Ahmad ,

Paev , I. Koychev ,

Nakov , Overview of imageclef 2025 - multimodal reasoning , in: CLEF 2025 Working Notes, CEUR Workshop Proceedings , CEUR-WS.org, Madrid, Spain, 2025 .

[7] R. J. Das , S. E.

Hristov , H.

Li , D. I.

Dimitrov , I. Koychev ,

Nakov , Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models , 2024 . URL: https://arxiv.org/abs/2403.10378. arXiv: 2403 . 10378 .

[8]

Wang ,

Chen ,

Wu ,

Zhu , G. Zeng,

Luo ,

Lu ,

Zhou ,

Qiao ,

Dai , Visionllm: Large language model is also an open-ended decoder for vision-centric tasks , 2023 . URL: https: //arxiv.org/abs/2305.11175. arXiv: 2305 . 11175 .

[9]

Bai ,

Chen ,

Liu ,

Wang ,

Ge ,

Song ,

Dang ,

Wang ,

Tang ,

Zhong ,

Zhu ,

Yang ,

Li ,

Wan ,

Wang ,

Ding ,

Fu ,

Xu ,

Ye ,

Zhang ,

Xie , Z. Cheng, H. Zhang,

Yang ,

Xu ,

Lin , Qwen2 .5-vl technical report , 2025 . URL: https://arxiv.org/abs/2502.13923. arXiv: 2502 . 13923 .