1. Introduction

Cagliari, Italy * Corresponding author. $ s.ergoli@studenti.unipi.it (S. Ergoli)

Seeing Cause and Time: a Visually Grounded Evaluation of Multimodal Models

Salvatore Ergoli

Alessandro Bondielli

0 1

Alessandro Lenci

0 0 CoLing Lab, Department of Philology , Literature and Linguistics , University of Pisa 1 Department of Computer Science, University of Pisa

2025

000 0 0003

Reasoning about causal and temporal relationships is fundamental to human intelligence but poses a persistent challenge for AI. Vision-Language Models (VLMs) ofer a promising path towards more robust conceptual understanding by grounding language in perception. However, it is unclear if this grounding enables genuine, human-like reasoning. We investigate this question by focusing on the causal and temporal abilities of two leading VLMs using a novel multimodal dataset derived from the ExpliCa dataset. Through a series of carefully designed tasks, we isolate their performance on visual-only input versus combined visual-textual inputs. Our results show that while models exhibit some reasoning capability, they are hindered by a marked “iconicity bias”: their performance degrades on relations where the perceptual sequence of images mismatches the logical event order (i.e., anti-iconic). This reliance on simple visual heuristics suggests that their high-level reasoning failures may be symptomatic of a more fundamental, fragile visual understanding.

eol>Multimodality Causal Reasoning Temporal Reasoning Vision Language Models

1. Introduction

and temporal reasoning capabilities remains an area in need of investigation.

The ability to comprehend and reason about causal and This paper contributes to this line of inquiry by temporal relationships is a cornerstone of human cogni- conducting a focused analysis of the causal and temtion, underpinning our capacity to understand narratives, poral reasoning abilities of two distinct, current genpredict outcomes and navigate the complexities of the eration multimodal models: Llama-11b-vision and world. We efortlessly discern why an event occurred and Gemini-flash-2.0. We explore their performances the sequence in which events unfolded, integrating infor- with a series of carefully designed tasks on a multimodal mation from various modalities. While Large Language version of the ExpliCa dataset, which explicitly comModels (LLMs) have demonstrated remarkable fluency in bines causal and temporal relations [8]. Our objective is generating text that describes such relationships, a criti- twofold: first, we aim to assess the models’ capacity to cal question remains: do they possess a genuine, human- infer these relations from visual input alone; second, we like understanding of these fundamental concepts or do want to address how their performances change when they primarily rely on sophisticated pattern matching the visual stimuli accompany the textual captions. We learned from vast textual corpora [1, 2]? This distinction do so by comparing models with difering architectures is crucial, as linguistic proficiency can sometimes obscure and parameter counts and varying the input modalities. deeper cognitive limitations, a phenomenon known as Our experimental methodology involves i) constructthe “fallacy of language as thought” [3]. ing a novel image dataset, that we name Visual-ExpliCa,

Recent advancements have led to the development aligned with the ExpliCa dataset,1 and ii) evaluating the of Vision Language Models (VLMs), which are trained models on five distinct tasks of increasing dificulty. The on both textual and visual data [4, 5]. This multimodal tasks range from directly identifying the type of relationgrounding ofers a potential pathway to richer, more ship (causal vs. temporal) and specifying the antecedent robust representations of concepts, potentially bridging and consequent from image-only input, to selecting the the gap between linguistic competence and conceptual correct linguistic connective and judging the overall acunderstanding, as human meaning representation itself ceptability of an event when both images and textual relies on multiple modalities [6, 7]. However, the extent descriptions are provided. Through this graduated apto which this enriched inputs translate to superior causal proach, we seek to disentangle the models’ visual inferencing capabilities from their ability to integrate multimodal information.

Our findings reveal that while both models demonstrate capabilities beyond chance in interpreting visual sequences, they exhibit distinct strengths, weaknesses pronounced when extending from the linguistic to the and biases, particularly struggling with anti-iconic rela- multimodal domain, where models must integrate visual tions (i.e., when the sequence of events is inverted com- evidence with abstract knowledge. Recent benchmarks pared to their chronological and/or logical-causal order) reveal that the performance of state-of-the-art VLMs is ofwhen relying solely on visual input. This suggests that ten no better than random chance. The MuCR benchmark current VLMs, despite their multimodal training, may [15], designed to test the inference of cause-and-efect still heavily favour direct, sequential interpretations of from visual cues alone, found that models either sufer visual information for complex reasoning tasks. from inadequate visual perception or are biased by their language priors to the point of ignoring contradictory visual evidence. This deficiency is not merely about iden2. Related works tifying simple causal chains. The NL-EYE benchmark, which frames abductive reasoning as a visual entailment A growing body of work focuses on assessing the rea- task, found that VLMs perform at or below random basesoning abilities of pre-trained models, particularly in the lines on a task humans find trivial [ 16]. Crucially, the domain of causality. LLMs have been evaluated on vari- failure was not one of logic—when given textual descripous causal tasks which reveals that their grasp of formal tions of the scenes, the models succeeded. The breakcausality is often superficial and prone to heuristic-based down occurs in visual interpretation, where models are errors. A key development in rigorously probing these distracted by superficial cues and fail to grasp the unlimits is the CLADDER dataset [9], which moves beyond derlying commonsense relationships. This points to a commonsense questions by grounding them in symbolic fundamental gap between a model’s linguistic reasonqueries derived from an oracle causal inference engine. ing capabilities and its ability to ground that reasoning By evaluating models against the formal rungs of Pearl’s in the perceptual world. Similarly, the TemporalVQA Ladder of Causation [10], the authors found that even benchmark tests models on temporal order understandwith bespoke prompting strategies like CAUSALCOT, ing and time-lapse estimation between images [17]. Their LLMs struggle significantly with formal, rule-based in- conclusions reveal that even top-tier models perform at ference. This concern over the fallibility of LLMs causal or below random chance, are highly sensitive to image understanding is echoed by other research, which shows layout and rely on superficial spatial cues rather than models are susceptible to inferring causality from simple genuine temporal comprehension. positional cues or temporal precedence (post hoc fallacy) and struggle to infer causal links from counterfactual evidence, suggesting a reliance on memorized heuristics 3. The Visual-ExpliCa Dataset rather than deep reasoning [11]. In another work was proposed a novel architecture (CARE-CA) [12] that in- The empirical investigation presented in this paper relies tegrates explicit causal knowledge from resources like on a carefully constructed dataset, specifically created to ConceptNet with implicit reasoning patterns from LLMs, align visual stimuli with textual ones from the ExpliCa enhanced by counterfactual analysis. dataset [8]. ExpliCa features 600 unique events, each

This susceptibility to temporal fallacies underscores represented by a pair of sentences. These pairs are linked a critical prerequisite for robust causal reasoning: a co- by an explicit connective that establishes one of three herent understanding of time itself. However, research relationship types: causal (so, because), temporal (then, demonstrates that LLMs’ internal model of time is fragile. after) or unrelated. The connectives define the nature Authors in [13] identify several key failure modes, includ- and directionality of the relationship between the two ing temporal shifts, invariance and inertia, where models sentences. Specifically, this directionality distinguishes either disregard the specific time in a query or fail to up- between iconic relations, where the order of sentences date long-held facts. Recognizing that direct reasoning reflects the chronological or causal sequence of events over unstructured text may be the source of this fragility, (i.e., with connectives so and then), and anti-iconic relasome approaches focus on actively mitigating these flaws. tions, where the presentation order is inverted relative to The TG-LLM framework, for instance, proposes a two- the logical flow (i.e., with connectives because and after). step process: first translating unstructured text into a Explicit connectives for sentence pairs where selected via formal temporal graph and then fine-tuning the LLM to crowdsourcing experiments [8]. Additionally, ExpliCa is perform Chain-of-Thought reasoning over this explicit controlled for potential confounding biases, such as Lexistructure [14]. This methodological shift from implicit cal Association Bias (ensuring that word co-occurrences to explicit representation significantly improves perfor- within sentence pairs do not disproportionately favor mance, highlighting that the reasoning deficit may lie certain relationship types) and Frequency Bias (ensurmore in parsing complexity than in logical inability. ing that the linguistic structures representing diferent

The challenge of causal reasoning becomes even more relations are comparably frequent in natural language).

This makes it a robust resource for evaluating genuine

reasoning rather than statistical shortcuts.

In building Visual-ExpliCa, we focused exclusively on the causal and temporal relations, excluding the unrelated category of the original dataset. In order to collect visuals matching sentences in the dataset, we first conducted some pre-processing steps. These involved i) lemmatisation, to mitigate data sparsity issues and to alleviate issues with VLMs struggling with temporal dimensions encoded in verb conjugations [18], and ii) NER, specifically to replace people NEs with generic placeholders (e.g., "Matteo" is replaced by "[PERSON]"), and prevent image retrieval to focus on specific individuals rather than the core actions and concepts of the sentence. For pre-processing, we used SpaCy.2

3.1. Images Collection

Relation, Direction Count so then because after Total Caus., Ic.

Temp., Ic.

Caus., A-Ic.

Temp., A-Ic.

Images to match sentences of ExpliCa were mostly col

lected from the Fondant-CC-25M dataset. 3 It is a largescale image corpus derived from CommonCrawl, composed exclusively of images with Creative Commons licenses. This choice ensures ethical usage and avoids copyright issues prevalent in many traditional image datasets.

To retrieve images, we used the clip-retrieval library.4 This tool leverages CLIP (Contrastive Language-Image Pre-Training) [19] to find images whose embeddings are Figure 1: An example of a sentence pair with images; the semantically closest to the text query’s embedding. For relation in this case is Causal, Iconic. each sentence, we selected the 10 images with the highest CLIP score. Then, to ensure a reasonable degree of semantic alignment between the visual and textual components, we conducted a further manual review to select 4. Experimental Setup the final image for each single sentence.

For a small number of sequences we were not able to retrieve high-quality descriptions. To address these cases, 4.1. Models we resorted to text-to-image generation. Specifically, we To evaluate the capabilities of current VLMs in causal and used the Segmind Stable Difusion model 5 to create visual temporal reasoning, we selected two prominent models representations for captions that were too abstract or spe- representing distinct architectural families and developcific for the retrieval process. The generative approach ment origins: Llama-11b-vision from Meta AI [20] was required for 39 individual captions (out of the 778 and Gemini 2.0 Flash from Google DeepMind [21]. total captions in the final dataset). Llama-11b-vision is part of the Llama 3.2-Vision

Nevertheless, a smaller subset of captions proved in- family of models. It was released by Meta in September tractable. Specifically, for 12 sentence-pairs of the origi- 2024. These models are designed to be natively multinal dataset, it was not possible to obtain a suitable image modal, capable of processing paired image and text inputs for at least one of the two descriptions, either through to generate textual outputs. Its architecture builds upon retrieval or generation. We chose to exclude the entire the Llama 3.1 LLM family. The instruction-tuned versions sentence-pair from the final analysis to ensure the qual- of Llama-3.2-Vision, including the variant used here, are ity and coherence of the dataset. Consequently, the final optimized through a combination of Supervised Finecurated multimodal dataset used for our experiments con- Tuning (SFT) and Reinforcement Learning from Human sists of 388 event pairs. Table 1 shows the distribution of Feedback (RLHF) [20]. Authors argue that this alignment categories in the dataset. process aims to enhance the model’s utility, safety and ability to follow instructions. The vision component was 2spacy.io pre-trained on a dataset of 6 billion image-text pairs. 3https://huggingface.co/datasets/fondant-ai/fondant-cc-25m 4https://github.com/rom1504/clip-retrieval Gemini 2.0 Flash is a multimodal large language 5https://huggingface.co/segmind/SSD-1B model (text, image, audio, video) with a 1M-token context window, positioned as an upgrade over Gemini 1.5 Flash. Task 1. Relation identification In the first task, the It is reported to achieve improved eficiency and bench- model’s goal is to classify the fundamental remark performance through a refined Mixture-of-Experts lationship between the two visual depictions of Transformer architecture and supports real-time multi- events as either causal or temporal, regardless of modal interactions [22]. It inherits the general Gemini the order they are presented in. philosophy of deep interweaving of modalities.

We chose these models to reflect two contrasting Task 2. Directionality Specification In the second trends in multimodal AI: Llama, an open-source and task, the model’s goal is to determine the logical relatively small model accessible for research at mod- order of the event, identifying which image repest computational cost, and Gemini Flash, a closed but resent the antecedent and which the consequent, comparatively compact commercial system optimized regardless of their causal or temporal relation. for eficiency and lower inference costs. This contrast highlights diferences in openness, scale, and resource demands, providing a balanced testbed for evaluating causal and temporal reasoning.

4.2. Tasks design

To systematically probe the models’ reasoning capabilities, we designed five distinct experimental tasks grounded in the Visual-ExpliCa dataset. These tasks are structured to progressively increase in complexity and are organized into two primary conditions that directly address our research objectives: assessing reasoning from visual-only input (Tasks 1 to 3) and evaluating multimodal integration (Tasks 4 and 5).

We employ a Multimodal-Chain-of-Thought (Multimodal-CoT) strategy for prompting in visualonly tasks. This strategy is inspired by [23], and is aimed at addressing one of the most critical failure modes in prompting VLMs, i.e. their tendency to rely on superficial visual processing and get distracted by irrelevant cues. In contrast, using Multimodal-CoT we structure the prompt to first elicit a description and interpretation of the visual information before attempting further reasoning, to establish a grounded rationale. This visual analysis then serves as the foundation for the reasoning steps needed to derive the ifnal conclusion, efectively creating a reasoning chain [24].6

The first three tasks are designed to isolate the models’ ability to infer causal and temporal relations relying solely on visual evidence. The model is first prompted to describe the visual content of the two images before being asked to perform the specific reasoning step. The ifnal two tasks assess how performance vary given the support of textual data, thus evaluating the models’ capacity to integrate information from both modalities. In the following, we detail each task.

6We report examples of prompts in the Appendix.

Task 3. Connective Selection In the third task, the model’s goal is to provide the most appropriate linguistic connective (among so, because, then, and after) given the pair of images representing the events, in a specific order. Recall that each connective is directly associated with a Relation (causal or temporal) and a Direction of such relation (iconic or anti-iconic).

Task 4. Connective Selection With Captions The fourth task is analogous to the third task. However, in this case the model is provided with both the images and their corresponding textual description of the events from ExpliCa. This allows for a direct comparison of performance with and without linguistic context.

Task 5. Acceptability rating In the fifth and final task, we replicate one of the experiments conducted on ExpliCa in [8]. Here, the model must perform a holistic evaluation of a complete multimodal input (two images, two captions and a humanprovided connective). It is tasked with providing a numerical plausibility rating from 1 to 10, simulating a human-like judgment of coherence. We chose to exclude Llama-11b-vision from this specific task, as preliminary tests revealed it was unreliable in consistently generating ratings in the required numerical format. This is a known issue also reported in [8]. We can speculate that it is probably due to the limited model size. Conversely, to robustly assess Gemini-2.0-Flash and account for output variability, we prompted it to generate five distinct ratings for each event. This was achieved by querying the model five times, each with a diferent temperature setting to modulate the randomness of the output. We used the average of these ratings as the final score.

4.3. Evaluation Our evaluation strategy was designed to measure the multifaceted nature of the models’ causal and temporal reasoning across the five experimental tasks. The metrics Temp. respectively, and abbreviate Iconic and Anti-Iconic

with Ic. and A-Ic. respectively. were chosen to reflect the nature of each task, ranging from categorical decisions to graded plausibility judgments.

For tasks requiring a categorical decision (Tasks 1-4), we employed a “cloze test” paradigm, mirroring the evaluation approach often used for the ExpliCa dataset [8]. In this setup, the models were presented with the input (either images-only, or images and partly-hidden captions) and asked to “fill in the blank” by choosing the most suitable option from a predefined list of candidates. A response was considered correct only if it exactly matched the designated ground truth; both incorrect choices and responses that did not conform to one of the choices were marked as an error. The primary evaluation metric for Figure 2: Results for Task 2 on each connective. these tasks was Accuracy. However, for Tasks 3 (Connective Selection) and 4 (Connective Selection with Captions), which involve a multi-class classification among First, we evaluate the performance of the VLMs on four connectives, we also computed the F1-score. This causal and temporal reasoning tasks using only visual metric provides a more balanced assessment than accu- inputs. Results from Task 1 (Relation Identification) are racy alone, as it considers both precision and recall for reported in Table 2, while results on Task 2 (Directioneach connective class. This is particularly useful for iden- ality Specification) are shown in Figure 2. We observe a tifying whether a model’s performance is uniform across two-tiered competency. The models can broadly classify the diferent logical relationships or if it excels at some the type of relationship (causal vs temporal) with aboveat the expense of others. chance accuracy. However, they largely fail to determine

For Task 2 (Directionality Specification), correctness its underlying structure and directionality. In Task 1, was determined by the alignment between the event both models perform significantly better than the random order identified by the model and the iconicity status baseline, indicating that they can extract relevant signals (iconic/anti-iconic) of the original pair. For example, if from the image pairs. A closer look at the results in Table the model identified Image A (presented first) as the an- 2 reveals Gemini-flash-2.0 shows a clear proficiency tecedent and Image B as the consequent, the answer was on temporal relations (87% accuracy), suggesting a default deemed correct only if the ground-truth connective for tendency to interpret visual sequences as a chronological the original pair was iconic (i.e., “so” or “then”). progression. In contrast, Llama-11b-vision demon

Finally, for Task 5 (Acceptability Rating), evaluation strates the inverse pattern, excelling at identifying causal was based on the Pearson correlation between the scores relations (86% accuracy), implying a strong prior to infer generated by the model and the human-provided accept- cause-and-efect. This superficial competence however ability judgments for the highest-rated connective. To breaks down when models are required to identify the ensure the values were comparable on a common scale, directionality of the relationship in Task 2 (Figure 2). The both the model ratings and the human judgments were performance plummets for both models and this failure ifrst normalized using min-max technique. This allowed is almost entirely attributable to an inability to process us to quantify the degree of alignment between the plau- anti-iconic relations, thus revealing a noticeable “iconicsibility assessment of the model and of humans. ity bias”. This bias manifests as a dependency on the perceptual order of visual events to infer their logical structure. Llama-11b-vision excels at identifying the 5. Results and Discussion direction for the Temporal Iconic connective then, but its performance on the anti-iconic connectives is nonIn this Section, we outline and discuss the results obtained existent. Gemini-flash-2.0 appear more robust, but by the models on all tasks. In the presentation of the displays a similar pattern, with a moderate accuracy on results, we abbreviate Causal and Temporal Caus. and iconic relations but a sharp drop in performance for antiTask

Model

Accuracy

Causal Relations (F1)

Temporal Relations (F1) so (Ic.) because (A-Ic.) then (Ic.) after (A-Ic.) Task 3 Task 4

Gemini LLaMA Gemini LLaMA iconic relations (connectives because and after). Finally, results for Task 5 are shown in Figures 4 and 5

Table 3 reports result on Tasks 3 (Connective Selec- and Table 4. Recall that the objective of the task 5 is is to tion) and 4 (Connective Selection With Captions). Task 4, provide a numerical plausibility rating from 1 (completely which provides both images and their corresponding tex- incoherent) to 10 (perfectly coherent) for the complete tual captions, ofers an ideal setting to assess the practical multimodal event: both images, their corresponding texutility of visual grounding in multimodal models. Here, tual captions, and the human-provided connective linkthe models receive both images and their corresponding ing them. Also recall that Task 5 was evaluated only textual captions and their performance can be directly on Gemini-flash-2.0. To enable a direct comparison compared to that of the text-only LLMs evaluated on the between the model’s output and the human judgments, same cloze task in the original ExpliCa study [8]. The both sets of scores were first normalized to a common multimodal models, particularly Gemini-flash-2.0 scale using a min-max scaler. The density plots in Figure achieve overall comparable or slightly better results (0.64 4 reveal both a promising alignment and critical divervs 0.62 accuracy) than strong text-only proprietary mod- gences. For the iconic connectives, the model’s scores els. This suggests that the visual input may actually show a distribution that closely resembles the human disprovide efective grounding, reinforcing or clarifying tribution of the connective with the highest rating. Both the relationship expressed via text without being a hin- distributions are heavily skewed towards higher values drance. Similarly, Llama-11b-vision’s multimodal (0.8-1.0), indicating that the model, like humans, find performance aligns with text-only open-source LLMs these iconic constructions highly plausible. Conversely, (0.33 vs 0.34 accuracy). Nevertheless, if we look at Confu- a significant discrepancy emerges for the anti-iconic consion Matrices in Figure 3 we observe that they reinforce nectives. For because and especially after, the human the findings from previous tasks: the models’ perfor- ratings show a much broader distribution with a notable mance are in general dictated by the iconicity of the peak in the mid-to-low range, indicating greater uncerunderlying relation, even more so than in the original tainty and lower acceptability in general. To quantify this study. This may suggest that, while visual inputs can alignment, we computed the Pearson correlation between prove beneficial on a surface level, their order of presen- the model’s ratings and human judgments (see Table 4). tation may strongly afect and bias the models’ ability, The results confirm the visual trend: We observe a moderespecially in anti-iconic cases. This may also be taken ate and statistically significant correlation for the iconic as indication that the models’ training data contained a connectives so and then. The correlation is weaker for the significantly larger number of “iconic examples”. anti-iconic connective because, and becomes statistically insignificant for after.

To better understand the sources of divergence between the model’s and human judgments, particularly for the cases that the model rated as highly implausible, we performed an outlier analysis. We specifically focused on low-scoring outliers, which we formally identified using the interquartile range (IQR) rule: any data point falling below the first quartile (Q1) minus 1.5 times the IQR was lfagged. As noted in the original ExpliCa dataset, a subset of sentences were intentionally designed to be socially challenging, touching on sensitive topics like religion, immigration, drug abuse or sex. Our analysis (Figure 5) reveals that a significant portion of the outliers are directly attributable to this subset. Specifically, 13 out of the 31 most prominent low-scoring outliers correspond to these socially challenging sentences. This finding suggests that the model’s performance may be influenced by its internal bias-mitigation and safety alignment protocols. When confronted with sensitive content, the model appears to override its linguistic and logical assessment, assigning a very low acceptability score regardless of the sentence’s grammatical or causal coherence. This highlights a potential conflict where safety-driven heuristics can interfere with and ultimately degrade the model’s core reasoning capabilities on specific types of content.

6. Conclusion and future works This paper investigated the capacity of modern Vision

Language Models to reason about the structure of events.

We augmented a curated dataset on causal reasoning with visual stimuli, and designed five tasks of increasing dificulty to asses how well the evaluated systems handle causal and temporal relationships, particularly when the logical flow of events diverges from their visual presentation. The central finding of our experiments is a profound vulnerability of the tested VLMs to an "iconicity bias." This manifests as a sharp decline in accuracy for anti-iconic relations, revealing a dependency on perceptual order over abstract logic. This weakness in abstract reasoning is likely rooted in an equally fragile foundational visual understanding. Recent studies using controlled evaluation frameworks [25], have in fact shown that VLMs struggle to robustly identify even fundamen- stimuli. While the ExpliCa dataset contains mostly contal object properties (like color or shape) and their basic crete, everyday scenarios, searching for relations bespatial relations. Indeed, their performance is heavily tween their abstractness and the performances of the dependent on positional biases, with objects at the center model may yield more robust findings. of an image being recognized more reliably than those at the periphery. If models fail to build a stable and reliable representation of a single scene, their ability to References infer complex causal and temporal relationships across multiple scenes becomes inherently compromised. The [1] A. Lenci, Understanding natural language macroscopic failures we observed (e.g., the iconicity bias) understanding systems. a critical analysis, can therefore be seen as a direct consequence of these 2023. URL: https://arxiv.org/abs/2303.04229. microscopic weaknesses. Furthermore, our analysis indi- arXiv:2303.04229. cates that this reasoning is not purely logical; it may also [2] C. D. Manning, Human language understanding & be modulated by the models’ safety training, which can reasoning, Daedalus 151 (2022) 127–138. URL: https: produce inconsistent evaluations of causally coherent //api.semanticscholar.org/CorpusID:248377870. but sensitive content. Taken together, these results chal- [3] K. Mahowald, A. A. Ivanova, I. A. Blank, N. Kanlenge the notion that scaling and multimodal pre-training wisher, J. B. Tenenbaum, E. Fedorenko, Dissoare suficient for achieving robust, human-like reason- ciating language and thought in large language ing. The models’ reliance on perceptual heuristics points models, 2024. URL: https://arxiv.org/abs/2301.06627. to a fundamental gap between their pattern-matching prowess and their ability to model the more complex, non-sequential nature of real-world events.

A crucial next step is to investigate whether these behavioral failures reflect a deeper deficit in the models’ underlying competence. A more direct evaluation, drawing on the framework of Hu and Levy [26], would involve measuring the log-likelihood that models assign to diferent event structures. However, this approach faces a significant technical barrier: the public APIs for state-of-the-art multimodal models, including Gemini 2.0 Flash, do not currently provide access to token-level log-likelihoods. This constraint makes it impossible to directly probe their internal probability distributions. Future work should therefore seek to replicate this study using open-source VLMs where such access is possible.

Limitations

While the present work provide some interesting insights, it is fundamental to point out several of its limitations. First, the two models chosen for the analysis can be considered as good representatives of open-weights and closed-weights models in the small to medium-sized model range; we purposely avoided using larger VLMs as they tipically come with a high computational (or mon- 18653/v1/2025.findings-acl.891. etary) cost. However, we must acknowledge that the [9] Z. Jin, Y. Chen, F. Leeb, L. Gresele, O. Kamal, paper’s results may not hold for other VLMs. Z. Lyu, K. Blin, F. G. Adauto, M. Kleiman-Weiner,

Second, we leverage CoT prompting, but do not present M. Sachan, B. Schölkopf, Cladder: Assessing causal here an analysis of the results from the CoT; these could reasoning in language models, 2024. URL: https: be point to additional insights. In addition to this, we //arxiv.org/abs/2312.04350. arXiv:2312.04350. must note that we did not perform any prompt-level [10] J. Pearl, D. Mackenzie, The Book of Why: The New optimization to improve the performances of each model Science of Cause and Efect, 1st ed., Basic Books, individually. Inc., USA, 2018.

Third, we do not account for the abstractness of the [11] N. Joshi, A. Saparov, Y. Wang, H. He, Llms are prone to fallacies in causal inference, 2024. URL: https: of-thought prompting elicits reasoning in large //arxiv.org/abs/2406.12158. arXiv:2406.12158. language models, 2023. URL: https://arxiv.org/abs/ [12] S. Ashwani, K. Hegde, N. R. Mannuru, M. Jin- 2201.11903. arXiv:2201.11903. dal, D. S. Sengar, K. C. R. Kathala, D. Banga, [25] M. Rizzoli, S. Alghisi, O. Khomyn, G. Roccabruna, V. Jain, A. Chadha, Cause and efect: Can S. M. Mousavi, G. Riccardi, Civet: Systematic evallarge language models truly understand causal- uation of understanding in vlms, 2025. URL: https: ity?, 2024. URL: https://arxiv.org/abs/2402.18139. //arxiv.org/abs/2506.05146. arXiv:2506.05146. arXiv:2402.18139. [26] J. Hu, R. Levy, Prompting is not a substitute for [13] J. Wallat, A. Jatowt, A. Anand, Temporal blind spots probability measurements in large language modin large language models, 2024. URL: https://arxiv. els, 2023. URL: https://arxiv.org/abs/2305.13264. org/abs/2401.12078. arXiv:2401.12078. arXiv:2305.13264. [14] S. Xiong, A. Payani, R. Kompella, F. Fekri,

Large language models can learn temporal reasoning, 2024. URL: https://arxiv.org/abs/2401.06853.

arXiv:2401.06853. [15] Z. Li, H. Wang, D. Liu, C. Zhang, A. Ma, J. Long,

W. Cai, Multimodal causal reasoning benchmark: Challenging vision large language models to dis- A. Prompts cern causal links across modalities, 2025. URL: https: //arxiv.org/abs/2408.08105. arXiv:2408.08105. [16] M. Ventura, M. Toker, N. Calderon, Z. Gekhman, Task 1. Relation Identification (Causal) Y. Bitton, R. Reichart, Nl-eye: Abductive nli for images, 2024. URL: https://arxiv.org/abs/2410.02613. The image above c o n t a i n two s e p a r a t e d arXiv:2410.02613. images : Image a ( on t h e l e f t ) and [17] M. F. Imam, C. Lyu, A. F. Aji, Can multimodal llms Image b do visual temporal understanding and reasoning? ( on t h e r i g h t ) . D e s c r i b e t h e e l e m e n t s the answer is no!, 2025. URL: https://arxiv.org/abs/ i n both images . Now , t h i n k 2501.10674. arXiv:2501.10674. a b o u ta bt shter arcetllayt i o n s h i p between t h e two [18] L. A. Hendricks, A. Nematzadeh, Probing image- images . Focus on t h e g e n e r a l language transformers for verb understanding, cause − 2021. URL: https://arxiv.org/abs/2106.09141. and − e f f e c t p a t t e r n r a t h e r than arXiv:2106.09141. s p e c i f i c d e t a i l s . The a n t e c e d e n t [19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, happeinss t fhier setv eanntd tdhiarte c t l y c a u s e s G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, a n o t h e r e v e n t ( t h e c a u s e ) . The J. Clark, G. Krueger, I. Sutskever, Learning trans- c o n s e q u e n t i s ferable visual models from natural language super- t h e e v e n t t h a t happens a s a r e s u l t o f vision, 2021. URL: https://arxiv.org/abs/2103.00020. t h e a n t e c e d e n t ( t h e e f f e c t ) . I f arXiv:2103.00020. t h e cIomnasgeeq uae nits and Image b i s t h e [20] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, a n t e c e d e n t , r e s p o n d with Image a .

A. Kadian, A. A.-D. et al., The llama 3 herd of I f Image b models, 2024. URL: https://arxiv.org/abs/2407.21783. i s t h e c o n s e q u e n t and Image a i s t h e arXiv:2407.21783. a n t e c e d e n t , r e s p o n d with Image b . [21] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, p r o v i dDeo enxoptl a n a t i o n s , a d d i t i o n a l t e x t R. S. et al., Gemini: A family of highly capable mul- or commentary . timodal models, 2025. URL: https://arxiv.org/abs/ 2312.11805. arXiv:2312.11805. [22] Google DeepMind, Gemini 2.0 flash – model card, 2025. URL: https://ai.google.dev/gemini-api/docs/ models, model card published April 15, 2025. [23] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis,

A. Smola, Multimodal chain-of-thought reasoning in language models, 2024. URL: https://arxiv.org/ abs/2302.00923. arXiv:2302.00923. [24] J. Wei, X. Wang, D. Schuurmans, M. Bosma,

B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, ChainTask 1. Relation Identification (Temporal)

Task 3. Connective Selection

The image above c o n t a i n two s e p a r a t e d images : Image a ( on t h e l e f t ) and

Image b ( on t h e r i g h t ) . D e s c r i b e t h e e l e m e n t s i n both images . Now , t h i n k a b o u t t h e t e m p o r a l r e l a t i o n s h i p between t h e two images . Focus on t h e s e q u e n c e o f e v e n t s r a t h e r than s p e c i f i c d e t a i l s . I f Image a f o l l o w s Image b , r e s p o n d with

Image a .

I f Image b f o l l o w s Image a , r e s p o n d with Image b . Do not p r o v i d e e x p l a n a t i o n s , a d d i t i o n a l t e x t or commentary . E v a l u a t e t h e a c c e p t a b i l i t y o f s e n t e n c e s t h a t d e s c r i b e two e v e n t s l i n k e d by c o n n e c t i v e s : ’ so ’ , ’ because ’ , ’ a f t e r ’ , and ’ then ’ . R a t e each s e n t e n c e on a s c a l e from 1 t o 10 b a s e d on how w e l l t h e c o n n e c t i v e e x p r e s s e s t h e r e l a t i o n s h i p between t h e e v e n t s . Each e v e n t i s a l s o v i s u a l l y r e p r e s e n t e d by an image : t h e l e f t image c o r r e s p o n d s t o t h e f i r s t s e n t e n c e and t h e r i g h t image c o r r e s p o n d s t o t h e second s e n t e n c e . S e n t e n c e : s e n t e n c e a c o n n e c t i v e s e n t e n c e b . P r o v i d e o n l y a n u m e r i c a l r a t i n g between 1 and 1 0 , w i t h o u t

e x p l a n a t i o n s .

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI), Gemini (Google), and Grammarly in order to: Generate images, Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

arXiv:2301 . 06627 . [4]

Du ,

Liu ,

Li ,

W. X.

Zhao , A survey of vision-

language pre-trained models, 2022 . URL: https://

arxiv.org/abs/2202.10936. arXiv: 2202 . 10936 . [5]

Gan ,

Li ,

Wang ,

Liu ,

Gao , Vision-

future trends , 2022 . URL: https://arxiv.org/abs/2210.

09263. arXiv: 2210 . 09263 . [6]

L. W.

Barsalou , Grounded cognition: Past,

2 ( 2010 ) 716 - 724 . doi: 10 .1111/j.1756- 8765 .

2010 .01115.x. [7]

Bisk ,

Holtzman ,

Thomason , J. Andreas,

grounds language , 2020 . URL: https://arxiv.org/abs/

2004 .10151. arXiv: 2004 . 10151 . [8]

Miliani ,

Auriemma ,

Bondielli , E. Cher-

tion for Computational Linguistics: ACL 2025 ,

enna , Austria, 2025 , pp. 17335 - 17355 . URL: https:

//aclanthology.org/ 2025 .findings-acl. 891 /. doi: 10.