<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Cagliari, Italy
* Corresponding author.
$ s.ergoli@studenti.unipi.it (S. Ergoli)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Seeing Cause and Time: a Visually Grounded Evaluation of Multimodal Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvatore Ergoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Bondielli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CoLing Lab, Department of Philology</institution>
          ,
          <addr-line>Literature and Linguistics</addr-line>
          ,
          <institution>University of Pisa</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Pisa</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Reasoning about causal and temporal relationships is fundamental to human intelligence but poses a persistent challenge for AI. Vision-Language Models (VLMs) ofer a promising path towards more robust conceptual understanding by grounding language in perception. However, it is unclear if this grounding enables genuine, human-like reasoning. We investigate this question by focusing on the causal and temporal abilities of two leading VLMs using a novel multimodal dataset derived from the ExpliCa dataset. Through a series of carefully designed tasks, we isolate their performance on visual-only input versus combined visual-textual inputs. Our results show that while models exhibit some reasoning capability, they are hindered by a marked “iconicity bias”: their performance degrades on relations where the perceptual sequence of images mismatches the logical event order (i.e., anti-iconic). This reliance on simple visual heuristics suggests that their high-level reasoning failures may be symptomatic of a more fundamental, fragile visual understanding.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodality</kwd>
        <kwd>Causal Reasoning</kwd>
        <kwd>Temporal Reasoning</kwd>
        <kwd>Vision Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>and temporal reasoning capabilities remains an area in
need of investigation.</p>
      <p>The ability to comprehend and reason about causal and This paper contributes to this line of inquiry by
temporal relationships is a cornerstone of human cogni- conducting a focused analysis of the causal and
temtion, underpinning our capacity to understand narratives, poral reasoning abilities of two distinct, current
genpredict outcomes and navigate the complexities of the eration multimodal models: Llama-11b-vision and
world. We efortlessly discern why an event occurred and Gemini-flash-2.0. We explore their performances
the sequence in which events unfolded, integrating infor- with a series of carefully designed tasks on a multimodal
mation from various modalities. While Large Language version of the ExpliCa dataset, which explicitly
comModels (LLMs) have demonstrated remarkable fluency in bines causal and temporal relations [8]. Our objective is
generating text that describes such relationships, a criti- twofold: first, we aim to assess the models’ capacity to
cal question remains: do they possess a genuine, human- infer these relations from visual input alone; second, we
like understanding of these fundamental concepts or do want to address how their performances change when
they primarily rely on sophisticated pattern matching the visual stimuli accompany the textual captions. We
learned from vast textual corpora [1, 2]? This distinction do so by comparing models with difering architectures
is crucial, as linguistic proficiency can sometimes obscure and parameter counts and varying the input modalities.
deeper cognitive limitations, a phenomenon known as Our experimental methodology involves i)
constructthe “fallacy of language as thought” [3]. ing a novel image dataset, that we name Visual-ExpliCa,</p>
      <p>Recent advancements have led to the development aligned with the ExpliCa dataset,1 and ii) evaluating the
of Vision Language Models (VLMs), which are trained models on five distinct tasks of increasing dificulty. The
on both textual and visual data [4, 5]. This multimodal tasks range from directly identifying the type of
relationgrounding ofers a potential pathway to richer, more ship (causal vs. temporal) and specifying the antecedent
robust representations of concepts, potentially bridging and consequent from image-only input, to selecting the
the gap between linguistic competence and conceptual correct linguistic connective and judging the overall
acunderstanding, as human meaning representation itself ceptability of an event when both images and textual
relies on multiple modalities [6, 7]. However, the extent descriptions are provided. Through this graduated
apto which this enriched inputs translate to superior causal proach, we seek to disentangle the models’ visual
inferencing capabilities from their ability to integrate
multimodal information.</p>
      <p>Our findings reveal that while both models
demonstrate capabilities beyond chance in interpreting visual
sequences, they exhibit distinct strengths, weaknesses pronounced when extending from the linguistic to the
and biases, particularly struggling with anti-iconic rela- multimodal domain, where models must integrate visual
tions (i.e., when the sequence of events is inverted com- evidence with abstract knowledge. Recent benchmarks
pared to their chronological and/or logical-causal order) reveal that the performance of state-of-the-art VLMs is
ofwhen relying solely on visual input. This suggests that ten no better than random chance. The MuCR benchmark
current VLMs, despite their multimodal training, may [15], designed to test the inference of cause-and-efect
still heavily favour direct, sequential interpretations of from visual cues alone, found that models either sufer
visual information for complex reasoning tasks. from inadequate visual perception or are biased by their
language priors to the point of ignoring contradictory
visual evidence. This deficiency is not merely about
iden2. Related works tifying simple causal chains. The NL-EYE benchmark,
which frames abductive reasoning as a visual entailment
A growing body of work focuses on assessing the rea- task, found that VLMs perform at or below random
basesoning abilities of pre-trained models, particularly in the lines on a task humans find trivial [ 16]. Crucially, the
domain of causality. LLMs have been evaluated on vari- failure was not one of logic—when given textual
descripous causal tasks which reveals that their grasp of formal tions of the scenes, the models succeeded. The
breakcausality is often superficial and prone to heuristic-based down occurs in visual interpretation, where models are
errors. A key development in rigorously probing these distracted by superficial cues and fail to grasp the
unlimits is the CLADDER dataset [9], which moves beyond derlying commonsense relationships. This points to a
commonsense questions by grounding them in symbolic fundamental gap between a model’s linguistic
reasonqueries derived from an oracle causal inference engine. ing capabilities and its ability to ground that reasoning
By evaluating models against the formal rungs of Pearl’s in the perceptual world. Similarly, the TemporalVQA
Ladder of Causation [10], the authors found that even benchmark tests models on temporal order
understandwith bespoke prompting strategies like CAUSALCOT, ing and time-lapse estimation between images [17]. Their
LLMs struggle significantly with formal, rule-based in- conclusions reveal that even top-tier models perform at
ference. This concern over the fallibility of LLMs causal or below random chance, are highly sensitive to image
understanding is echoed by other research, which shows layout and rely on superficial spatial cues rather than
models are susceptible to inferring causality from simple genuine temporal comprehension.
positional cues or temporal precedence (post hoc fallacy)
and struggle to infer causal links from counterfactual
evidence, suggesting a reliance on memorized heuristics 3. The Visual-ExpliCa Dataset
rather than deep reasoning [11]. In another work was
proposed a novel architecture (CARE-CA) [12] that in- The empirical investigation presented in this paper relies
tegrates explicit causal knowledge from resources like on a carefully constructed dataset, specifically created to
ConceptNet with implicit reasoning patterns from LLMs, align visual stimuli with textual ones from the ExpliCa
enhanced by counterfactual analysis. dataset [8]. ExpliCa features 600 unique events, each</p>
      <p>This susceptibility to temporal fallacies underscores represented by a pair of sentences. These pairs are linked
a critical prerequisite for robust causal reasoning: a co- by an explicit connective that establishes one of three
herent understanding of time itself. However, research relationship types: causal (so, because), temporal (then,
demonstrates that LLMs’ internal model of time is fragile. after) or unrelated. The connectives define the nature
Authors in [13] identify several key failure modes, includ- and directionality of the relationship between the two
ing temporal shifts, invariance and inertia, where models sentences. Specifically, this directionality distinguishes
either disregard the specific time in a query or fail to up- between iconic relations, where the order of sentences
date long-held facts. Recognizing that direct reasoning reflects the chronological or causal sequence of events
over unstructured text may be the source of this fragility, (i.e., with connectives so and then), and anti-iconic
relasome approaches focus on actively mitigating these flaws. tions, where the presentation order is inverted relative to
The TG-LLM framework, for instance, proposes a two- the logical flow (i.e., with connectives because and after).
step process: first translating unstructured text into a Explicit connectives for sentence pairs where selected via
formal temporal graph and then fine-tuning the LLM to crowdsourcing experiments [8]. Additionally, ExpliCa is
perform Chain-of-Thought reasoning over this explicit controlled for potential confounding biases, such as
Lexistructure [14]. This methodological shift from implicit cal Association Bias (ensuring that word co-occurrences
to explicit representation significantly improves perfor- within sentence pairs do not disproportionately favor
mance, highlighting that the reasoning deficit may lie certain relationship types) and Frequency Bias
(ensurmore in parsing complexity than in logical inability. ing that the linguistic structures representing diferent</p>
      <p>The challenge of causal reasoning becomes even more relations are comparably frequent in natural language).</p>
      <sec id="sec-1-1">
        <title>This makes it a robust resource for evaluating genuine</title>
        <p>reasoning rather than statistical shortcuts.</p>
        <p>In building Visual-ExpliCa, we focused exclusively on
the causal and temporal relations, excluding the unrelated
category of the original dataset. In order to collect
visuals matching sentences in the dataset, we first conducted
some pre-processing steps. These involved i)
lemmatisation, to mitigate data sparsity issues and to alleviate
issues with VLMs struggling with temporal dimensions
encoded in verb conjugations [18], and ii) NER,
specifically to replace people NEs with generic placeholders
(e.g., "Matteo" is replaced by "[PERSON]"), and prevent
image retrieval to focus on specific individuals rather
than the core actions and concepts of the sentence. For
pre-processing, we used SpaCy.2</p>
        <sec id="sec-1-1-1">
          <title>3.1. Images Collection</title>
          <p>Relation, Direction
Count
so
then
because
after
Total
Caus., Ic.</p>
          <p>Temp., Ic.</p>
          <p>Caus., A-Ic.</p>
          <p>Temp., A-Ic.</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Images to match sentences of ExpliCa were mostly col</title>
        <p>lected from the Fondant-CC-25M dataset. 3 It is a
largescale image corpus derived from CommonCrawl,
composed exclusively of images with Creative Commons
licenses. This choice ensures ethical usage and avoids
copyright issues prevalent in many traditional image datasets.</p>
        <p>To retrieve images, we used the clip-retrieval library.4
This tool leverages CLIP (Contrastive Language-Image
Pre-Training) [19] to find images whose embeddings are Figure 1: An example of a sentence pair with images; the
semantically closest to the text query’s embedding. For relation in this case is Causal, Iconic.
each sentence, we selected the 10 images with the
highest CLIP score. Then, to ensure a reasonable degree of
semantic alignment between the visual and textual
components, we conducted a further manual review to select 4. Experimental Setup
the final image for each single sentence.</p>
        <p>For a small number of sequences we were not able to
retrieve high-quality descriptions. To address these cases, 4.1. Models
we resorted to text-to-image generation. Specifically, we To evaluate the capabilities of current VLMs in causal and
used the Segmind Stable Difusion model 5 to create visual temporal reasoning, we selected two prominent models
representations for captions that were too abstract or spe- representing distinct architectural families and
developcific for the retrieval process. The generative approach ment origins: Llama-11b-vision from Meta AI [20]
was required for 39 individual captions (out of the 778 and Gemini 2.0 Flash from Google DeepMind [21].
total captions in the final dataset). Llama-11b-vision is part of the Llama 3.2-Vision</p>
        <p>Nevertheless, a smaller subset of captions proved in- family of models. It was released by Meta in September
tractable. Specifically, for 12 sentence-pairs of the origi- 2024. These models are designed to be natively
multinal dataset, it was not possible to obtain a suitable image modal, capable of processing paired image and text inputs
for at least one of the two descriptions, either through to generate textual outputs. Its architecture builds upon
retrieval or generation. We chose to exclude the entire the Llama 3.1 LLM family. The instruction-tuned versions
sentence-pair from the final analysis to ensure the qual- of Llama-3.2-Vision, including the variant used here, are
ity and coherence of the dataset. Consequently, the final optimized through a combination of Supervised
Finecurated multimodal dataset used for our experiments con- Tuning (SFT) and Reinforcement Learning from Human
sists of 388 event pairs. Table 1 shows the distribution of Feedback (RLHF) [20]. Authors argue that this alignment
categories in the dataset. process aims to enhance the model’s utility, safety and
ability to follow instructions. The vision component was
2spacy.io pre-trained on a dataset of 6 billion image-text pairs.
3https://huggingface.co/datasets/fondant-ai/fondant-cc-25m
4https://github.com/rom1504/clip-retrieval Gemini 2.0 Flash is a multimodal large language
5https://huggingface.co/segmind/SSD-1B model (text, image, audio, video) with a 1M-token context
window, positioned as an upgrade over Gemini 1.5 Flash. Task 1. Relation identification In the first task, the
It is reported to achieve improved eficiency and bench- model’s goal is to classify the fundamental
remark performance through a refined Mixture-of-Experts lationship between the two visual depictions of
Transformer architecture and supports real-time multi- events as either causal or temporal, regardless of
modal interactions [22]. It inherits the general Gemini the order they are presented in.
philosophy of deep interweaving of modalities.</p>
        <p>We chose these models to reflect two contrasting Task 2. Directionality Specification In the second
trends in multimodal AI: Llama, an open-source and task, the model’s goal is to determine the logical
relatively small model accessible for research at mod- order of the event, identifying which image
repest computational cost, and Gemini Flash, a closed but resent the antecedent and which the consequent,
comparatively compact commercial system optimized regardless of their causal or temporal relation.
for eficiency and lower inference costs. This contrast
highlights diferences in openness, scale, and resource
demands, providing a balanced testbed for evaluating
causal and temporal reasoning.</p>
        <sec id="sec-1-2-1">
          <title>4.2. Tasks design</title>
          <p>To systematically probe the models’ reasoning
capabilities, we designed five distinct experimental tasks
grounded in the Visual-ExpliCa dataset. These tasks are
structured to progressively increase in complexity and
are organized into two primary conditions that directly
address our research objectives: assessing reasoning from
visual-only input (Tasks 1 to 3) and evaluating
multimodal integration (Tasks 4 and 5).</p>
          <p>We employ a Multimodal-Chain-of-Thought
(Multimodal-CoT) strategy for prompting in
visualonly tasks. This strategy is inspired by [23], and is
aimed at addressing one of the most critical failure
modes in prompting VLMs, i.e. their tendency to rely
on superficial visual processing and get distracted by
irrelevant cues. In contrast, using Multimodal-CoT
we structure the prompt to first elicit a description
and interpretation of the visual information before
attempting further reasoning, to establish a grounded
rationale. This visual analysis then serves as the
foundation for the reasoning steps needed to derive the
ifnal conclusion, efectively creating a reasoning chain
[24].6</p>
          <p>The first three tasks are designed to isolate the
models’ ability to infer causal and temporal relations relying
solely on visual evidence. The model is first prompted
to describe the visual content of the two images before
being asked to perform the specific reasoning step. The
ifnal two tasks assess how performance vary given the
support of textual data, thus evaluating the models’
capacity to integrate information from both modalities. In
the following, we detail each task.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>6We report examples of prompts in the Appendix.</title>
        <p>Task 3. Connective Selection In the third task, the
model’s goal is to provide the most appropriate
linguistic connective (among so, because, then,
and after) given the pair of images representing
the events, in a specific order. Recall that each
connective is directly associated with a Relation
(causal or temporal) and a Direction of such
relation (iconic or anti-iconic).</p>
        <p>Task 4. Connective Selection With Captions The
fourth task is analogous to the third task.
However, in this case the model is provided with
both the images and their corresponding textual
description of the events from ExpliCa. This
allows for a direct comparison of performance
with and without linguistic context.</p>
        <p>Task 5. Acceptability rating In the fifth and final task,
we replicate one of the experiments conducted
on ExpliCa in [8]. Here, the model must perform
a holistic evaluation of a complete multimodal
input (two images, two captions and a
humanprovided connective). It is tasked with providing
a numerical plausibility rating from 1 to 10,
simulating a human-like judgment of coherence. We
chose to exclude Llama-11b-vision from this
specific task, as preliminary tests revealed it was
unreliable in consistently generating ratings in
the required numerical format. This is a known
issue also reported in [8]. We can speculate that
it is probably due to the limited model size.
Conversely, to robustly assess Gemini-2.0-Flash
and account for output variability, we prompted
it to generate five distinct ratings for each event.
This was achieved by querying the model five
times, each with a diferent temperature setting
to modulate the randomness of the output. We
used the average of these ratings as the final score.</p>
        <sec id="sec-1-3-1">
          <title>4.3. Evaluation</title>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>Our evaluation strategy was designed to measure the multifaceted nature of the models’ causal and temporal reasoning across the five experimental tasks. The metrics</title>
      </sec>
      <sec id="sec-1-5">
        <title>Temp. respectively, and abbreviate Iconic and Anti-Iconic</title>
        <p>with Ic. and A-Ic. respectively.
were chosen to reflect the nature of each task, ranging
from categorical decisions to graded plausibility
judgments.</p>
        <p>For tasks requiring a categorical decision (Tasks 1-4),
we employed a “cloze test” paradigm, mirroring the
evaluation approach often used for the ExpliCa dataset [8]. In
this setup, the models were presented with the input
(either images-only, or images and partly-hidden captions)
and asked to “fill in the blank” by choosing the most
suitable option from a predefined list of candidates. A
response was considered correct only if it exactly matched
the designated ground truth; both incorrect choices and
responses that did not conform to one of the choices were
marked as an error. The primary evaluation metric for Figure 2: Results for Task 2 on each connective.
these tasks was Accuracy. However, for Tasks 3
(Connective Selection) and 4 (Connective Selection with
Captions), which involve a multi-class classification among First, we evaluate the performance of the VLMs on
four connectives, we also computed the F1-score. This causal and temporal reasoning tasks using only visual
metric provides a more balanced assessment than accu- inputs. Results from Task 1 (Relation Identification) are
racy alone, as it considers both precision and recall for reported in Table 2, while results on Task 2
(Directioneach connective class. This is particularly useful for iden- ality Specification) are shown in Figure 2. We observe a
tifying whether a model’s performance is uniform across two-tiered competency. The models can broadly classify
the diferent logical relationships or if it excels at some the type of relationship (causal vs temporal) with
aboveat the expense of others. chance accuracy. However, they largely fail to determine</p>
        <p>For Task 2 (Directionality Specification), correctness its underlying structure and directionality. In Task 1,
was determined by the alignment between the event both models perform significantly better than the random
order identified by the model and the iconicity status baseline, indicating that they can extract relevant signals
(iconic/anti-iconic) of the original pair. For example, if from the image pairs. A closer look at the results in Table
the model identified Image A (presented first) as the an- 2 reveals Gemini-flash-2.0 shows a clear proficiency
tecedent and Image B as the consequent, the answer was on temporal relations (87% accuracy), suggesting a default
deemed correct only if the ground-truth connective for tendency to interpret visual sequences as a chronological
the original pair was iconic (i.e., “so” or “then”). progression. In contrast, Llama-11b-vision
demon</p>
        <p>Finally, for Task 5 (Acceptability Rating), evaluation strates the inverse pattern, excelling at identifying causal
was based on the Pearson correlation between the scores relations (86% accuracy), implying a strong prior to infer
generated by the model and the human-provided accept- cause-and-efect. This superficial competence however
ability judgments for the highest-rated connective. To breaks down when models are required to identify the
ensure the values were comparable on a common scale, directionality of the relationship in Task 2 (Figure 2). The
both the model ratings and the human judgments were performance plummets for both models and this failure
ifrst normalized using min-max technique. This allowed is almost entirely attributable to an inability to process
us to quantify the degree of alignment between the plau- anti-iconic relations, thus revealing a noticeable
“iconicsibility assessment of the model and of humans. ity bias”. This bias manifests as a dependency on the
perceptual order of visual events to infer their logical
structure. Llama-11b-vision excels at identifying the
5. Results and Discussion direction for the Temporal Iconic connective then, but
its performance on the anti-iconic connectives is
nonIn this Section, we outline and discuss the results obtained existent. Gemini-flash-2.0 appear more robust, but
by the models on all tasks. In the presentation of the displays a similar pattern, with a moderate accuracy on
results, we abbreviate Causal and Temporal Caus. and iconic relations but a sharp drop in performance for
antiTask</p>
        <p>Model</p>
        <p>Accuracy</p>
        <p>Causal Relations (F1)</p>
        <p>Temporal Relations (F1)
so (Ic.)
because (A-Ic.)
then (Ic.)
after (A-Ic.)
Task 3
Task 4</p>
        <p>Gemini
LLaMA
Gemini
LLaMA
iconic relations (connectives because and after). Finally, results for Task 5 are shown in Figures 4 and 5</p>
        <p>Table 3 reports result on Tasks 3 (Connective Selec- and Table 4. Recall that the objective of the task 5 is is to
tion) and 4 (Connective Selection With Captions). Task 4, provide a numerical plausibility rating from 1 (completely
which provides both images and their corresponding tex- incoherent) to 10 (perfectly coherent) for the complete
tual captions, ofers an ideal setting to assess the practical multimodal event: both images, their corresponding
texutility of visual grounding in multimodal models. Here, tual captions, and the human-provided connective
linkthe models receive both images and their corresponding ing them. Also recall that Task 5 was evaluated only
textual captions and their performance can be directly on Gemini-flash-2.0. To enable a direct comparison
compared to that of the text-only LLMs evaluated on the between the model’s output and the human judgments,
same cloze task in the original ExpliCa study [8]. The both sets of scores were first normalized to a common
multimodal models, particularly Gemini-flash-2.0 scale using a min-max scaler. The density plots in Figure
achieve overall comparable or slightly better results (0.64 4 reveal both a promising alignment and critical
divervs 0.62 accuracy) than strong text-only proprietary mod- gences. For the iconic connectives, the model’s scores
els. This suggests that the visual input may actually show a distribution that closely resembles the human
disprovide efective grounding, reinforcing or clarifying tribution of the connective with the highest rating. Both
the relationship expressed via text without being a hin- distributions are heavily skewed towards higher values
drance. Similarly, Llama-11b-vision’s multimodal (0.8-1.0), indicating that the model, like humans, find
performance aligns with text-only open-source LLMs these iconic constructions highly plausible. Conversely,
(0.33 vs 0.34 accuracy). Nevertheless, if we look at Confu- a significant discrepancy emerges for the anti-iconic
consion Matrices in Figure 3 we observe that they reinforce nectives. For because and especially after, the human
the findings from previous tasks: the models’ perfor- ratings show a much broader distribution with a notable
mance are in general dictated by the iconicity of the peak in the mid-to-low range, indicating greater
uncerunderlying relation, even more so than in the original tainty and lower acceptability in general. To quantify this
study. This may suggest that, while visual inputs can alignment, we computed the Pearson correlation between
prove beneficial on a surface level, their order of presen- the model’s ratings and human judgments (see Table 4).
tation may strongly afect and bias the models’ ability, The results confirm the visual trend: We observe a
moderespecially in anti-iconic cases. This may also be taken ate and statistically significant correlation for the iconic
as indication that the models’ training data contained a connectives so and then. The correlation is weaker for the
significantly larger number of “iconic examples”. anti-iconic connective because, and becomes statistically
insignificant for after.</p>
        <p>To better understand the sources of divergence
between the model’s and human judgments, particularly for
the cases that the model rated as highly implausible, we
performed an outlier analysis. We specifically focused on
low-scoring outliers, which we formally identified using
the interquartile range (IQR) rule: any data point falling
below the first quartile (Q1) minus 1.5 times the IQR was
lfagged. As noted in the original ExpliCa dataset, a subset
of sentences were intentionally designed to be socially
challenging, touching on sensitive topics like religion,
immigration, drug abuse or sex. Our analysis (Figure
5) reveals that a significant portion of the outliers are
directly attributable to this subset. Specifically, 13 out of
the 31 most prominent low-scoring outliers correspond
to these socially challenging sentences. This finding
suggests that the model’s performance may be influenced by
its internal bias-mitigation and safety alignment
protocols. When confronted with sensitive content, the model
appears to override its linguistic and logical assessment,
assigning a very low acceptability score regardless of the
sentence’s grammatical or causal coherence. This
highlights a potential conflict where safety-driven heuristics
can interfere with and ultimately degrade the model’s
core reasoning capabilities on specific types of content.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>6. Conclusion and future works</title>
      <sec id="sec-2-1">
        <title>This paper investigated the capacity of modern Vision</title>
        <p>Language Models to reason about the structure of events.</p>
        <p>We augmented a curated dataset on causal reasoning
with visual stimuli, and designed five tasks of
increasing dificulty to asses how well the evaluated systems
handle causal and temporal relationships, particularly
when the logical flow of events diverges from their visual
presentation. The central finding of our experiments is a
profound vulnerability of the tested VLMs to an
"iconicity bias." This manifests as a sharp decline in accuracy for
anti-iconic relations, revealing a dependency on
perceptual order over abstract logic. This weakness in abstract
reasoning is likely rooted in an equally fragile
foundational visual understanding. Recent studies using
controlled evaluation frameworks [25], have in fact shown
that VLMs struggle to robustly identify even fundamen- stimuli. While the ExpliCa dataset contains mostly
contal object properties (like color or shape) and their basic crete, everyday scenarios, searching for relations
bespatial relations. Indeed, their performance is heavily tween their abstractness and the performances of the
dependent on positional biases, with objects at the center model may yield more robust findings.
of an image being recognized more reliably than those
at the periphery. If models fail to build a stable and
reliable representation of a single scene, their ability to References
infer complex causal and temporal relationships across
multiple scenes becomes inherently compromised. The [1] A. Lenci, Understanding natural language
macroscopic failures we observed (e.g., the iconicity bias) understanding systems. a critical analysis,
can therefore be seen as a direct consequence of these 2023. URL: https://arxiv.org/abs/2303.04229.
microscopic weaknesses. Furthermore, our analysis indi- arXiv:2303.04229.
cates that this reasoning is not purely logical; it may also [2] C. D. Manning, Human language understanding &amp;
be modulated by the models’ safety training, which can reasoning, Daedalus 151 (2022) 127–138. URL: https:
produce inconsistent evaluations of causally coherent //api.semanticscholar.org/CorpusID:248377870.
but sensitive content. Taken together, these results chal- [3] K. Mahowald, A. A. Ivanova, I. A. Blank, N.
Kanlenge the notion that scaling and multimodal pre-training wisher, J. B. Tenenbaum, E. Fedorenko,
Dissoare suficient for achieving robust, human-like reason- ciating language and thought in large language
ing. The models’ reliance on perceptual heuristics points models, 2024. URL: https://arxiv.org/abs/2301.06627.
to a fundamental gap between their pattern-matching
prowess and their ability to model the more complex,
non-sequential nature of real-world events.</p>
        <p>A crucial next step is to investigate whether these
behavioral failures reflect a deeper deficit in the
models’ underlying competence. A more direct evaluation,
drawing on the framework of Hu and Levy [26], would
involve measuring the log-likelihood that models assign
to diferent event structures. However, this approach
faces a significant technical barrier: the public APIs for
state-of-the-art multimodal models, including Gemini
2.0 Flash, do not currently provide access to token-level
log-likelihoods. This constraint makes it impossible to
directly probe their internal probability distributions.
Future work should therefore seek to replicate this study
using open-source VLMs where such access is possible.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Limitations</title>
      <p>
        While the present work provide some interesting
insights, it is fundamental to point out several of its
limitations. First, the two models chosen for the analysis can
be considered as good representatives of open-weights
and closed-weights models in the small to medium-sized
model range; we purposely avoided using larger VLMs as
they tipically come with a high computational
        <xref ref-type="bibr" rid="ref10 ref12">(or mon- 18653/v1/2025.findings-acl.891.
etary)</xref>
        cost. However, we must acknowledge that the [9] Z. Jin, Y. Chen, F. Leeb, L. Gresele, O. Kamal,
paper’s results may not hold for other VLMs. Z. Lyu, K. Blin, F. G. Adauto, M. Kleiman-Weiner,
      </p>
      <p>Second, we leverage CoT prompting, but do not present M. Sachan, B. Schölkopf, Cladder: Assessing causal
here an analysis of the results from the CoT; these could reasoning in language models, 2024. URL: https:
be point to additional insights. In addition to this, we //arxiv.org/abs/2312.04350. arXiv:2312.04350.
must note that we did not perform any prompt-level [10] J. Pearl, D. Mackenzie, The Book of Why: The New
optimization to improve the performances of each model Science of Cause and Efect, 1st ed., Basic Books,
individually. Inc., USA, 2018.</p>
      <p>Third, we do not account for the abstractness of the [11] N. Joshi, A. Saparov, Y. Wang, H. He, Llms are prone
to fallacies in causal inference, 2024. URL: https: of-thought prompting elicits reasoning in large
//arxiv.org/abs/2406.12158. arXiv:2406.12158. language models, 2023. URL: https://arxiv.org/abs/
[12] S. Ashwani, K. Hegde, N. R. Mannuru, M. Jin- 2201.11903. arXiv:2201.11903.
dal, D. S. Sengar, K. C. R. Kathala, D. Banga, [25] M. Rizzoli, S. Alghisi, O. Khomyn, G. Roccabruna,
V. Jain, A. Chadha, Cause and efect: Can S. M. Mousavi, G. Riccardi, Civet: Systematic
evallarge language models truly understand causal- uation of understanding in vlms, 2025. URL: https:
ity?, 2024. URL: https://arxiv.org/abs/2402.18139. //arxiv.org/abs/2506.05146. arXiv:2506.05146.
arXiv:2402.18139. [26] J. Hu, R. Levy, Prompting is not a substitute for
[13] J. Wallat, A. Jatowt, A. Anand, Temporal blind spots probability measurements in large language
modin large language models, 2024. URL: https://arxiv. els, 2023. URL: https://arxiv.org/abs/2305.13264.
org/abs/2401.12078. arXiv:2401.12078. arXiv:2305.13264.
[14] S. Xiong, A. Payani, R. Kompella, F. Fekri,</p>
      <p>Large language models can learn temporal
reasoning, 2024. URL: https://arxiv.org/abs/2401.06853.</p>
      <p>arXiv:2401.06853.
[15] Z. Li, H. Wang, D. Liu, C. Zhang, A. Ma, J. Long,</p>
      <p>W. Cai, Multimodal causal reasoning benchmark:
Challenging vision large language models to dis- A. Prompts
cern causal links across modalities, 2025. URL: https:
//arxiv.org/abs/2408.08105. arXiv:2408.08105.
[16] M. Ventura, M. Toker, N. Calderon, Z. Gekhman, Task 1. Relation Identification (Causal)
Y. Bitton, R. Reichart, Nl-eye: Abductive nli for
images, 2024. URL: https://arxiv.org/abs/2410.02613. The image above c o n t a i n two s e p a r a t e d
arXiv:2410.02613. images : Image a ( on t h e l e f t ) and
[17] M. F. Imam, C. Lyu, A. F. Aji, Can multimodal llms Image b
do visual temporal understanding and reasoning? ( on t h e r i g h t ) . D e s c r i b e t h e e l e m e n t s
the answer is no!, 2025. URL: https://arxiv.org/abs/ i n both images . Now , t h i n k
2501.10674. arXiv:2501.10674. a b o u ta bt shter arcetllayt i o n s h i p between t h e two
[18] L. A. Hendricks, A. Nematzadeh, Probing image- images . Focus on t h e g e n e r a l
language transformers for verb understanding, cause −
2021. URL: https://arxiv.org/abs/2106.09141. and − e f f e c t p a t t e r n r a t h e r than
arXiv:2106.09141. s p e c i f i c d e t a i l s . The a n t e c e d e n t
[19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, happeinss t fhier setv eanntd tdhiarte c t l y c a u s e s
G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, a n o t h e r e v e n t ( t h e c a u s e ) . The
J. Clark, G. Krueger, I. Sutskever, Learning trans- c o n s e q u e n t i s
ferable visual models from natural language super- t h e e v e n t t h a t happens a s a r e s u l t o f
vision, 2021. URL: https://arxiv.org/abs/2103.00020. t h e a n t e c e d e n t ( t h e e f f e c t ) . I f
arXiv:2103.00020. t h e cIomnasgeeq uae nits and Image b i s t h e
[20] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, a n t e c e d e n t , r e s p o n d with Image a .</p>
      <p>A. Kadian, A. A.-D. et al., The llama 3 herd of I f Image b
models, 2024. URL: https://arxiv.org/abs/2407.21783. i s t h e c o n s e q u e n t and Image a i s t h e
arXiv:2407.21783. a n t e c e d e n t , r e s p o n d with Image b .
[21] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, p r o v i dDeo enxoptl a n a t i o n s , a d d i t i o n a l t e x t
R. S. et al., Gemini: A family of highly capable mul- or commentary .
timodal models, 2025. URL: https://arxiv.org/abs/
2312.11805. arXiv:2312.11805.
[22] Google DeepMind, Gemini 2.0 flash – model card,
2025. URL: https://ai.google.dev/gemini-api/docs/
models, model card published April 15, 2025.
[23] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis,</p>
      <p>A. Smola, Multimodal chain-of-thought reasoning
in language models, 2024. URL: https://arxiv.org/
abs/2302.00923. arXiv:2302.00923.
[24] J. Wei, X. Wang, D. Schuurmans, M. Bosma,</p>
      <p>B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou,
ChainTask 1. Relation Identification (Temporal)</p>
      <sec id="sec-3-1">
        <title>Task 3. Connective Selection</title>
        <p>The image above c o n t a i n two s e p a r a t e d
images : Image a ( on t h e l e f t ) and</p>
        <p>Image b
( on t h e r i g h t ) . D e s c r i b e t h e e l e m e n t s
i n both images . Now , t h i n k a b o u t
t h e
t e m p o r a l r e l a t i o n s h i p between t h e two
images . Focus on t h e s e q u e n c e o f
e v e n t s
r a t h e r than s p e c i f i c d e t a i l s . I f Image
a f o l l o w s Image b , r e s p o n d with</p>
        <p>Image a .</p>
        <p>I f Image b f o l l o w s Image a , r e s p o n d
with Image b . Do not p r o v i d e
e x p l a n a t i o n s ,
a d d i t i o n a l t e x t or commentary .
E v a l u a t e t h e a c c e p t a b i l i t y o f
s e n t e n c e s t h a t d e s c r i b e two
e v e n t s l i n k e d by c o n n e c t i v e s : ’ so
’ , ’ because ’ , ’ a f t e r ’ , and ’ then
’ . R a t e each s e n t e n c e on a s c a l e
from
1 t o 10 b a s e d on how w e l l t h e
c o n n e c t i v e e x p r e s s e s t h e
r e l a t i o n s h i p between
t h e e v e n t s . Each e v e n t i s a l s o
v i s u a l l y r e p r e s e n t e d by an image :
t h e l e f t image
c o r r e s p o n d s t o t h e f i r s t s e n t e n c e and
t h e r i g h t image c o r r e s p o n d s t o
t h e second
s e n t e n c e . S e n t e n c e : s e n t e n c e a
c o n n e c t i v e s e n t e n c e b . P r o v i d e
o n l y a n u m e r i c a l
r a t i n g between 1 and 1 0 , w i t h o u t</p>
        <p>e x p l a n a t i o n s .</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI), Gemini (Google), and
Grammarly in order to: Generate images, Paraphrase and reword, Improve writing style, and
Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited
the content as needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>arXiv:2301</source>
          .
          <fpage>06627</fpage>
          . [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>A survey of vision-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>language</surname>
          </string-name>
          pre-trained models,
          <year>2022</year>
          . URL: https://
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          arxiv.org/abs/2202.10936. arXiv:
          <volume>2202</volume>
          .
          <fpage>10936</fpage>
          . [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , Vision-
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>future trends</source>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2210.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          09263. arXiv:
          <volume>2210</volume>
          .
          <fpage>09263</fpage>
          . [6]
          <string-name>
            <given-names>L. W.</given-names>
            <surname>Barsalou</surname>
          </string-name>
          , Grounded cognition: Past,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <volume>2</volume>
          (
          <year>2010</year>
          )
          <fpage>716</fpage>
          -
          <lpage>724</lpage>
          . doi:
          <volume>10</volume>
          .1111/j.1756-
          <fpage>8765</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <year>2010</year>
          .01115.x. [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thomason</surname>
          </string-name>
          , J. Andreas,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>grounds language</source>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <year>2004</year>
          .10151. arXiv:
          <year>2004</year>
          .
          <volume>10151</volume>
          . [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Miliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auriemma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondielli</surname>
          </string-name>
          , E. Cher-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>tion for Computational Linguistics: ACL</source>
          <year>2025</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>enna</surname>
          </string-name>
          , Austria,
          <year>2025</year>
          , pp.
          <fpage>17335</fpage>
          -
          <lpage>17355</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          //aclanthology.org/
          <year>2025</year>
          .findings-acl.
          <volume>891</volume>
          /. doi: 10.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>