<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LLMs Struggle on Explicit Causality in Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Bondielli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martina Miliani</string-name>
          <email>martina.miliani@fileli.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Paglione</string-name>
          <email>l.paglione1@studenti.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Serena Auriemma</string-name>
          <email>serena.auriemma@phd.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Passaro</string-name>
          <email>lucia.passaro@unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CoLing Lab, Department of Philology</institution>
          ,
          <addr-line>Literature and Linguistics</addr-line>
          ,
          <institution>University of Pisa</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Pisa</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The ability to recognize and interpret causal relations is fundamental for building robust intelligent systems. Recent research has focused on developing benchmarks and tasks to evaluate the inferential and causal reasoning capabilities of LLMs, such as the Pairwise Causal Discovery (PCD) task. However, most of these resources are limited to English. In this paper, we present ExpliCITA, a translation of the English ExpliCa dataset [1], which is the first publicly available dataset for joint temporal-causal reasoning in Italian, enabling evaluation of LLMs on Italian PCD. We conduct an extensive empirical study across 20 Italian and multilingual models of varying sizes and training strategies, combining a perplexity-based evaluation of causal reasoning competence with multiple-choice prompting tasks in both zero-shot and few-shot settings. Our results show that all tested models, including the GPT family, struggle with the ExpliCITA PCD task, more so than with the original English ExpliCa, in both evaluation scenarios. Moreover, native Italian models do not outperform fine-tuned multilingual alternatives. Consistent with prior findings, we observe that the linguistic competence of models, measured using perplexity-based metrics, is higher than their respective performances, measured via accuracy on prompting results; however, this gap tends to narrow with increasing model size. Finally, a per-class performance analysis reveals that models handle causal relations relatively better than temporal ones.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLMs</kwd>
        <kwd>Causal Reasoning</kwd>
        <kwd>Language Resources</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Benchmarking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>two sentences as input to the model (i.e., “Martina has
less chances of getting the flu” and “Martina has been
vaccinated against the flu” ), and to ask the model if the
ifrst sentence is a consequence of the first with a yes/no
question (in this case, groundtruth: “yes” ) [1, 10].</p>
      <p>Temporality plays a crucial role in the context of
causality, as every causal relation inherently implies a temporal
one: If an event A causes an event B, A must
necessarily occur (or begin to occur) before B. Conversely, the
presence of a temporal relation between two events does
not necessarily imply a causal link. For this reason, we
extended the PCD task to include the identification of
temporal relations, to explicitly disentangle the interplay
between causality and temporal sequencing.</p>
      <p>To address this issue, in previous works we introduced
the ExpliCa (Explicit Causality) benchmark [1], ofering
a more controlled experimental setup that jointly
addresses temporal and causal reasoning. ExpliCa presents
pairs of sentences, each describing a distinct event,
without any surface-level linguistic cues for temporal and
causal relation, except for a connective that explicitly
encodes both the type of relation (i.e., causal and temporal),
and the order between the two events. For example, in [1],
we asked the models to choose which of four connectives
(so, because, then, and after) best represents the relation
between the sentences “Martina has less chances of
getting the flu” and “Martina has been vaccinated against the
lfu” (in this case, groundtruth: “because”.)</p>
      <p>Despite these progresses, resources for joint</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <sec id="sec-2-1">
        <title>Recognizing causal relations is a core human cognitive</title>
        <p>skill. Causal understanding is in fact fundamental to
intelligent reasoning [2]. Thus, a strong AI system should
be capable of performing causal reasoning.</p>
        <p>The past few years have in fact seen a vigorous debate
about the extent to which large language models (LLMs)
are actually capable of genuine inference, beyond mere
pattern matching [3, 4, 5]. Among the inferences a model
should be able to perform lies the causal one. Therefore,
several benchmarks targeting causality have emerged
recently [6, 7, 8].</p>
        <p>A popular evaluation paradigm for causal reasoning is
Pairwise Causal Discovery (PCD), which aims to detect
pairwise causal relations from observational data. In a
PCD task a model must determine if a causal link exists
between two events, along with the direction of causality
[9, 10]. A common way to frame this task is to give
in languages other than English. At the same time, datasets focus on presenting a contextual scenario to test
a rich ecosystem of LLMs pre-trained on, or adapted causal inference [14, 6, 15, 16, 17, 18], while others
chalto, languages other than English, including Italian, is lenge NLP systems to identify causal relations directly
rapidly emerging. on the text [19, 16, 20], also along with temporal ones</p>
        <p>To partially fill this gap, we introduce ExpliCITA [21, 22, 23, 24, 25]. ExpliCITA stems from ExpliCa [1], a
(Explicit Causality in ITAlian). ExpliCITA is an Ital- dataset developed to evaluate the ability of LLMs to detect
ian adaptation of ExpliCa and we believe it is the first explicit causal and temporal relations between events. In
benchmark dedicated to joint temporal and causal rea- ExpliCa, relations are annotated via crowdsourcing and
soning in Italian. are signaled exclusively through a connective linking a</p>
        <p>We also leverage the evaluation framework for ExpliCa pair of sentences, carefully stripped of any additional
to conduct the first large-scale evaluation of Italian lan- contextual or lexical cues. This controlled setup
miniguage models on causal reasoning. The framework allows mizes the influence of surrounding context and enables
us to test both competence (what the model “knows” a more focused assessment of the model’s reasoning on
about the probability distribution of linguistic events) explicit relational cues.
via perplexity, and performance (how it applies that Due to its design, ExpliCITA shares its structure with
“knowledge”) via prompting [11, 12]. Specifically, the other datasets that frame implicit causal relations in a
prompting task is formulated as a multiple-choice task, sentence-pair format, where each sentence expresses an
where models have to select the appropriate connective individual event. Notable among these are the COPA
in a cloze-style prompt. We explore diferent generation dataset [26], the e-CARE dataset [27], and tasks from the
settings: greedy decoding and the Outlines framework BIG-Bench benchmark [28], which also test models on
[13], under both zero- and few-shot regimes. Our evalua- explicit causal reasoning. COPA and e-CARE were both
tion includes a total 20 models across a spectrum of sev- incorporated into the original ExpliCa dataset.
eral sizes and training approaches: i.) seven native Italian While resources for English are abundant, the
availabilmodels trained from scratch, ii.) four multilingual models ity of non-English datasets for causal reasoning remains
ifne -tuned on Italian, iii.) three open-weights multilin- limited. Nevertheless, contributions exist for Spanish
gual models, iv.) an open-weight reasoning-specialized [29], German [30], Arabic [31], and Persian [32]. Among
LLM, and v.) five commercial systems from the GPT multilingual eforts, MECI [ 20] stands out as a resource
family. where causal relations are annotated across several
lan</p>
        <p>We make both the data and code available on GitHub guage editions of Wikipedia.
to replicate our experiments.1 Causal reasoning, and related tasks such as Pairwise
Our contribution is twofold: Causal Discovery (PCD), belongs to a broader class of
inference-based tasks in natural language understanding.
• we present ExpliCITA, the first dataset for joint These tasks aim to evaluate a model’s ability to derive
temporal-causal reasoning in Italian; implicit information from textual input, whether through
• we deliver an extensive empirical study across logical entailment, causal attribution, or commonsense
20 Italian and multilingual models, following a associations. Within this wider inference landscape,
Natrobust evaluation framework combining an evalu- ural Language Inference (NLI) benchmarks like XNLI
ation via perplexity with multiple-choice prompt- [33] test models on cross-lingual entailment across 15
ing in several settings. This allows us to highlight languages, while datasets such as X-CSQA [34] focus
strengths, weaknesses, and performance varia- on cross-lingual commonsense reasoning in a
questiontion across model types and sizes. answering format.</p>
        <p>In the Italian context[35], the first dataset for textual</p>
        <p>The remainder of the paper is organized as follows: entailment was introduced during the EVALITA 2009
Section 2 reviews related work; Section 3 introduces the evaluation campaign, comprising 800 sentence pairs
deExpliCITA dataset; Section 4 details the experimental rived from Wikipedia revision histories [36]. More
resetup; and Section 5 presents and discusses the results. cently, the HellaSwag-it dataset, an adaptation of the
original HellaSwag dataset [37], was developed to test
commonsense inference by asking models to choose the
2. Related Works most plausible ending to a given scenario. Additionally,
for causal reasoning, the COPA dataset was translated
The study of causality and its linguistic expressions has into Italian (and other languages) as part of the XCOPA
recently regained momentum, particularly in the con- project [38]. Both XCOPA-it and HellaSwag-it were
intext of evaluating the reasoning capabilities of large lan- tegrated into ItaEval [39], a benchmark for evaluating
guage models (LLMs). In this domain, many evaluation LLMs on Italian commonsense and factual reasoning.
1https://github.com/Unipisa/ExpliCITA ItaEval was featured in the 2024 Italian NLP evaluation
campaign, CALAMITA [40], which included a wide range are: so (Causal, Iconic), because (Causal, Anti-Iconic), then
of datasets to test commonsense and factual knowledge. (Temporal, Iconic), and after (Temporal, Anti-Iconic).
Among them, Gita [41] is particularly relevant here: it A defining feature of the dataset is that the connective
focuses on physical commonsense in Italian, present- is the sole linguistic cue indicating the semantic
relaing pairs of plausible and implausible stories composed tion between sentence pairs. To ensure a controlled and
of sentence sequences. To the best of our knowledge, challenging evaluation of causal reasoning, the dataset
ExpliCITA is the first dataset specifically dedicated to excludes any additional explicit marker, such as causal
evaluating explicit causal and temporal reasoning in a verbs, and removes anaphoric references by avoiding
controlled setting for the Italian language. personal pronouns. This design compels models to rely
exclusively on event semantics and the connective itself,
without support from broader contextual cues.
3. The ExpliCITA Dataset The dataset was then annotated via crowdsourcing by
English native speakers. Specifically, annotators were
The ExpliCITA Dataset is a direct translation of ExpliCa asked to rate the acceptability of a sentence pair linked by
[1]. The original dataset was designed as a benchmark one of the connectives. Each sentence pair, in both orders,
for evaluating explicit causal reasoning in LLMs, with a
particular focus on distinguishing causal relations from witeitmhsa)llwpaosssriabtleedcboynn1e5ctpiavretsic(i6p0a0n×ts.2F×or4ea=ch4s8e0n0tetontcael
temporal ones, using the PCD task. A thorough descrip- pair in both orders of presentation, the connective with
tion of the dataset and its properties is reported in [1]. In the highest acceptability rating was considered as the
the following, we highlight some of its key aspects. ground truth. Note that the ground truth based on human</p>
        <p>Approximately a third of the items in ExpliCa are based ratings do not overlap perfectly with the original
distincon other existing datasets [42, 28, 27]. The remaining two tion in CAUSAL, TEMPORAL, and UNRELATED groups
thirds are manually crafted. In total, 600 items are in the made by authors when building the sentence pairs.
dataset. Each item of the dataset comprises a sentence To build ExpliCITA from ExpliCa, we followed a
semipair S1 and S2, where each sentence describes an event. automatic translation procedure. First, we used ChatGPT</p>
        <p>The dataset has two key dimensions, namely the type via the web interface2 to translate each sentence from
of relation and the order of presentation. As for the type of the 600 pairs independently. Then, each sentence was
relation, the items were selected by authors to be equally manually evaluated to address errors in the automatic
divided into three main subsets: i.) CAUSAL, where the translation. Errors ranged from mistakes in gender
asrelationship is causal, and possibly of temporal prece- signment (e.g., “Luca è stata [...]”) to completely missing
dence; ii.) TEMPORAL, where the relation is only of tem- idiomatic expressions (e.g., “Marco ran the red light”,
poral precedence, without causality; iii.) UNRELATED,
that includes thematically related sentences that are nei- translated as “Marco ha corso la luce rossa” instead of
ther causally nor temporally related. Potential biases in
“lMataiorncosènpeaesdseadtomcoalnruosaslov”)e.rAificasitgionnific.aFnotrnEuxmpblieCrIoTfAtr,awnselexical elements are controlled for using Mutual Informa- used the following four connectives:
tion between lexical elements of the sentence pairs. This
is done to avoid having very diferent lexemes in the UN- Quindi - Indicates a causal relation in the iconic order.
RELATED group with respect to the other groups. The The event in S1 causes the event in S2.
diferences in the association strengths between lexemes
in the three groups are not statistically significant. Perché - Indicates a causal relation in the anti-iconic</p>
        <p>As for the order of presentation, it can be either order. The event in S1 is caused by the event in S2.
ICONIC (in short form Ic), if the sequence of events ex- E poi - Indicates a temporal relation in the iconic order.
pressed in the two sentences matches their chronological The event in S1 temporally precedes the event in S2.
and/or logical-causal order (e.g., “S1 then S2”), or
ANTIICONIC (in short form, A-Ic), if the sequence of events Dopo che - Indicates a temporal relation in the
antiexpressed in the two sentences is inverted compared iconic order. The event in S1 follows the event in S2.
to their chronological and/or logical-causal order (e.g.,
the efect is mentioned before the cause: “S2 because The choice of multi-token expression for the temporal
S1”). Note that, for each sentence pair, the dataset in- connectives is due to the fact that no suficiently frequent
cludes both the Iconic and Anti-Iconic order for a total single word in Italian conveys the proper meaning.
of 600 × 2 = 1, 200 items. ExpliCITA includes each sentence pair in both orders</p>
        <p>The type of relation and the order of presentation are of presentation. Thus, the number of data points is 600 ×
expressed via one out of four connectives, that act as lin- 2 = 1, 200. We consider as our ground truth the results
guistic cues to explicitly signal the nature of the relation- of the crowdsourcing experiment for ExpliCa [1]. In
ship. In the English version of the dataset, the connectives</p>
      </sec>
      <sec id="sec-2-2">
        <title>2Accessed on December 2024</title>
        <p>Group
Connective
Quindi (Caus., Ic)
Perché (Caus., A-Ic)
E poi (Temp., Ic)
Dopo che (Temp., A-Ic)
CAUSAL TEMPORAL UNRELATED Total</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental Setting</title>
      <p>model is presented with S1 and S2 and a list of choices,
each representing a connective. The task is to provide
the correct choice. We experiment in both zero-shot and
few-shot scenarios. For the few-shot, the models saw one
example for each connective, for a total of four examples.
To avoid biases in the choices, both the order of options
to choose from and the position of the correct answer
is randomized. Note however that all models saw the
same exact prompt for any item in the dataset. We use
accuracy as our main metric. To distinguish from APS,
we refer to values obtained via prompting as Accuracy
on Prompt Execution (APX).</p>
      <p>The template for the prompt is shown in Appendix A.
We used the Jinja template syntax.3 The prompt is not a
direct translation but it is heavily inspired to the one used
in [1]. First, we provide the models with the description
and format of the task; for the few-shot scenario, we
provide the examples; then, we give clear instructions
for how to complete the task; finally, we describe the
task. Since we use both pre-trained only and instruction</p>
      <sec id="sec-3-1">
        <title>The goal of our experiments is to test LLMs on the PCD</title>
        <p>task of the ExpliCITA dataset from two perspectives. On
the one hand, we want to assess the linguistic
competence of the model: the fact that it encodes some
linguistic knowledge about causal and temporal relations. fine-tuned models, we used a template that would enable
We do so by leveraging a perplexity-based evaluation. also pre-trained only models to answer. Note that we did
On the other hand, we want to address the actual per- not implement specific templating strategies (e.g., chat
formance of the model on the dataset. We do so via a formatting, special tokens, etc.) for any model, and we
prompt-based evaluation in which the model has to fed all the models with exactly the same prompt.
solve our PCD task, by identifying the correct connective The only exception was GPT, which was prompted
for a sentence pair. Our main goal is to evaluate Italian using the chat format, as required by the model’s API.
LLMs on Italian data. In addition to native Italian LLMs, However, the content of the prompt was the same as the
we also consider other model classes. Specifically, we ac- one used for all other models, without the addition of any
count for i.) Italian fine-tuned models, i.e. open-weights custom system messages, special tokens, or
instructionmodels fine-tuned on Italian, ii.) open-weights multilin- specific formatting.
gual models, iii.) open-weights reasoning models, and iv.) We used a markdown-like syntax to highlight the
secclosed commercial models. All tested models are listed tions of the prompt. We acknowledge that not formatting
in Section 4.1. the prompt for each model may hinder performances in
some cases. However, we argue that this ensure a more
fair evaluation. The only exception was made for the
Perplexity-Based Evaluation. This experiment is an reasoning model, for which we also include the &lt;think&gt;
exact replica of the one conducted in [1]. For each sen- token at the end of the prompt, to ensure that the
Chaintence pair in the dataset (i.e., in both orders of presen- of-thought is started.
tation), we derive one sentence for each connective, in We used a greedy decoding strategy for all
experithe form “S1 {{ connective }} S2”. We obtain ments, that is we always sample the next most likely
1, 200 × 4 = 4, 800 sentences in total. For each of them, token at each generation step. We let each model
generwe compute a model’s perplexity (PPL) over the whole ate a maximum of 20 tokens in their response. For the
sentence. We then rank the four sentences based on PPL, reasoning model, we let it generate a maximum of 10,000
and consider the one with the lowest value as the “model tokens. All models, with the exception of GPT variants,
connective choice”. Finally, we compute the accuracy of where used in their HuggingFace implementation.4
the model choices against the ground truth. We call this A notable issue with unconstrained text generation is
metric Accuracy on Perplexity Score (APS). that less performing models may yield text that do not
conform to the standard asked for in the prompt. This
remains true also for cases, like ours, where the expected
answer can be the direct continuation of the prompt,
rather than the answer to a question or the turn in a
Prompt-Based Evaluation. For the prompt-based
evaluation, we asked the models to identify the correct
connective to use between S1 and S2. We chose to focus
on a standard multiple-choice task, as it is one of the most
widely used formats for evaluating LLMs, and replicates
one of the prompting experiments in [1]. In the task, the
3https://jinja.palletsprojects.com
4huggingface.co
conversation. To alleviate this issue, we proceeded in cerbero-7b variants [47]. They are respectively fine
two ways. First, we implemented a post-processing strat- tuned versions of LLaMA-2, LLaMA-3 and Mistral.
egy based on a set of regular expressions to parse each
model response and extract one answer. The regexes Open LLMs: We also evaluated the performances of
were designed to extract one and only one option from strong contenders in the Open LLM space. To do so, we
the generated text. In cases where multiple answers or no selected Meta’s LLaMA-3.1-8B [48] and two versions of
answer were detected, it was counted as a mistake for the Google’s Gemma3 [49], namely the 4B and 12B ones.
model. In Section 5, we report the results of the model af- Reasoning LLMs: We also tested one reasoning model,
ter this post-processing. Some models consistently failed namely DeepSeek-R1-Distill-Llama-8B [50], a
disto provide appropriate answers in this setting. tilled version of DeepSeek-R1 using LLaMA-3.1-8B. This</p>
        <p>Second, we employed Outlines [13],5 a Python library allows us to explore how reasoning impact performances
built to provide structured text generation with LLMs on our PCD task.
(e.g., with type constraints, following regular expressions,
or providing json-formatted outputs). In the case of mul- Commercial models: Finally, we tested the
GPTtiple choices, it uses masking on the output probabilities 4x family as representative of commercial
closedto restrict the model outputs to a set of valid completions source models. We evaluated both gpt-4o and
[13]. In our case, the possible completions are the “A”, gpt-4o-mini [51], and all the GPT-4.1 variants
“B”, “C”, and “D” options for the tasks. This approach (gpt-4.1, gpt-4.1-mini, and gpt-4.1-nano) [52].
has become quite popular in the literature and has been
adopted in several recent studies on generative LLMs Depending on its size, each model required a time
[12]. Note that Outlines was not used for the GPT vari- between 0.5 and 1 GPU hours to complete its run, that
ants and one of the open-weights tested models, namely includes both the zero-shot and few-shot experiments,
Gemma3. In fact, all GPT models consistently yielded each consisting of: i.) generation with greedy decoding;
properly formatted outputs, making an additional evalu- ii.) generation with Outlines; and iii.) PPL scores
compuation redundant (recall that the next-token prediction is tation. The DeepSeek-R1-Distill-Llama-8B model
performed in a greedy fashion) and economically costly. required around 10 GPU hours in total, due to its much
Moreover, a known bug in the current Outlines and Hug- higher demand for test-time compute. Experiments with
gingFace implementations prevents all Gemma3 models the GPT-4x family were conducted using the oficial
Opeto be run through Outlines at this stage. nAI Batch API.6 The code for replicating the experiments
is available on GitHub.</p>
        <sec id="sec-3-1-1">
          <title>4.1. Tested Models</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>We chose to experiment on a variety of models and model</title>
        <p>classes, to gain a broader and clearer picture of the
problem. Our main goal was to evaluate native Italian LLMs
on the PCD task. Thus, we considered the following
native Italian model families/variants:</p>
      </sec>
      <sec id="sec-3-3">
        <title>In this Section we present and discuss the results. We</title>
        <p>ifrst look at the overall results based on Accuracy of
models on the PCD task of ExpliCITA, in terms of both i.)
linguistic competence with APS, and ii.) performance
Minerva [43]. We considered all model sizes of the with APX in zero- and few- shot experiments, with and
Minerva family (from 350M to 7B), including both the without Outlines. Then, we present additional results by
Instruction fine-tuned and pre-trained only ones. considering two aspects. On the one hand, we look at the
distribution of answers for each model, to highlight
posVelvet [44]. We experimented with both available mod- sible biases and failures in providing an answer. On the
els, namely Velvet-2B and Velvet-14B. other hand, we look at per-class performances, to
understand whether the tested LLMs show biases in modelling
specific aspects of temporal and causal reasoning.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results and Discussion</title>
      <sec id="sec-4-1">
        <title>We highlight that we were not able to ran experiments</title>
        <p>on the Italia-9B model due to issues with its loading via
the HuggingFace library.</p>
        <p>We also chose to experiment with non-native Italian
models for a clear and fair comparison. These can be
distinguished into four classes:
Italian Fine-Tuned models: This class includes
LLaMAntino-2-chat-7b-hf-UltraChat-ITA [45],
LLaMAntino-3-ANITA-8B-Inst-DPO-ITA [46] and</p>
      </sec>
      <sec id="sec-4-2">
        <title>5https://github.com/dottxt-ai/outlines</title>
        <sec id="sec-4-2-1">
          <title>5.1. Overall Results</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Our main findings for the evaluation of LLMs on Ex</title>
        <p>pliCITA are summarised in Figure 1. The Figure shows
the Accuracy of all tested models, in all scenarios. We
divide the plot by model family, and sort each family by
the model size.</p>
      </sec>
      <sec id="sec-4-4">
        <title>6https://platform.openai.com/docs/guides/batch</title>
        <p>The results are in line with the experiments reported
for ExpliCa [1]. We highlight several interesting aspects
in the following.
low (e.g., below 0.1). In other cases, the use of Outlines
seems less influential. Nevertheless, the same accuracy
may be obtained from a significantly diferent
distribution of answers, as will be discussed in the following
Sections.</p>
      </sec>
      <sec id="sec-4-5">
        <title>Overall performance. As for the raw performances,</title>
        <p>all models except the GPT-4x family show rather poor
or at least somewhat brittle performances. The only Model sizes. As shown in [1], we observe that the size
models capable of approaching GPT-level performances of the model is relevant for its downstream performances.
are DeepSeek-R1 and Gemma3-12B. However, this is In the open-weights model classes, the two best
performachieved either with the inclusion of reasoning for ing models are Gemma3 and Velvet, respectively in the
DeepSeek, or only in a specific setting for Gemma. 12B and 14B variants. Both also display above average
APS scores. However, it is also interesting to note that
while Gemma3-4B was not able to solve the task at all,
the 2B variant of Velvet was consistent in its performance,
which closely match those of some larger models.</p>
      </sec>
      <sec id="sec-4-6">
        <title>Zero- vs Few-Shot. As for the diference in zero-shot</title>
        <p>and few-shot settings, the GPT-4x family is again the
only one where there is a clear and consistent trend,
in this case in favour of the few-shot setting. In other
cases, the few-shot examples are not always beneficial: Competence vs. Performance. It is important to
nofor some models (e.g., Gemma3-12B, LLaMAntino-2 and tice that APS is always better than APX, with the sole
Minerva-3B) it appears to be detrimental, while for other exception of the Gemma-3-12B model. This further
corit is inefective. However, for Minerva-7B we observe roborates some of the findings in [ 1]: while models’
interthat while for the pre-trained variant the examples are nal representations and probability distribution encode,
detrimental, this is not true for the instruction-tuned one. at least to some extent, knowledge about causal and
temThis is possibly due to the instruction-tuning dataset of poral relations, this knowledge is not fully accessed via
the model. prompting. This is also in line with other research [11].
Moreover, it was shown in [1] that the gap between APS
Impact of Outlines. It appears to be beneficial mostly and APX shrinks with the size of the model. Given the
for cases where zero- or few-shot performances are quite wide array of tested open-weights model, we can further
corroborate this hypothesis by looking at Figure 2. We
can clearly see that the rate of improvement in APX as
models grow in size (red trendline) is higher than their
respective rate of improvements in APS (blue trendline)
on the task.</p>
      </sec>
      <sec id="sec-4-7">
        <title>We also highlight the following relevant findings associated to specific model classes:</title>
        <p>Italian Models are weak; Native Italian pre-training
is not beneficial. Native Italian models do not show
relevant improvements with respect to fine-tuned
alternatives, neither at the same size, nor at larger sizes. The
Velvet family appears to provide relatively solid results at
all scales; in contrast, smaller models in the Minerva
family appear to be less robust on ExpliCITA. The fine-tuned
Italian models display similar, if not better, performances
than native ones. This could lead us to question whether
it’s truly necessary to train LLMs from scratch on Italian
data. Results suggest that, albeit limited to this case study,
it is not.</p>
        <p>GPTs struggle. On ExpliCa, the GPT model family
displayed performances that couldn’t reach 0.8 Accuracy
[1]. Changing the language of the dataset and the prompt
highlight a stark contrast: the drop in performances for
the same model is around .20 points, and even newer
models cannot reach a 70% accuracy. Considering the
fact that the task has remained exactly the same, and that
GPT “speaks” fluent Italian, this may be indication that
current LLMs are still limited in terms of actual causal
reasoning, and still reliant on their internal probabilistic
representations of texts.</p>
        <p>Test-time compute is beneficial. We observed that
the performances of the distilled DeepSeek-R1 drastically
improve when it is allowed to use its “reasoning” abilities.
This is particularly interesting, as it somewhat contrasts
with the expectation that the task not require particular
forms of reasoning, which may be instead required when
modelling phenomena such as implicit causal relations.
This issue will be further addressed in future works. We
also note that while answers were provided in Italian,
the chain-of-thought enclosed in the &lt;think&gt; tokens is
almost exclusively in English.</p>
        <sec id="sec-4-7-1">
          <title>5.2. Additional Analyses</title>
        </sec>
      </sec>
      <sec id="sec-4-8">
        <title>Besides evaluating the accuracy of models on ExpliCITA, we also consider two other aspects that allow us to further understand the behaviour of the tested models in our setting.</title>
      </sec>
      <sec id="sec-4-9">
        <title>Distribution of Answers. First, we explore how mod</title>
        <p>els actually answered to the multiple-choice task. The
distribution of answers with greedy decoding and with
outlines in the zero-shot setting is shown in Figure 3. We
leave out the visualization of the few-shot setting due to
space limitations, but they are very similar in nature.</p>
      </sec>
      <sec id="sec-4-10">
        <title>We observe that some models consistently fail to pro</title>
        <p>vide an adequate answer, thus drastically lowering their
performances. For example, it is possible that when
ANITA actually answered it did so correctly, but it was
able to answer on a very small fraction of the questions.</p>
        <p>Moreover, although we applied post-processing to the
model responses (see Sec. 4), we still observed persistent
failure modes, primarily due to the model’s inability to consequence via one of these two connectives. The
confollow the expected output format. Such behaviors can be struction “S1 connective S2” is therefore typical
broadly described as faithful hallucinations caused by in- for causal relationships.
structional inconsistency [53], in which the model’s out- In contrast, there is greater variability in how
temput is not properly aligned with the user’s request. These poral sequential relationships can be expressed in
Italfailures often consisted in limitations in the number of ian. These can be conveyed through temporal
conjuncrequested output tokens, which the models were unable tions such as “e poi” or “dopo che”, as well as through
adto respect, unintended rewriting of the input question, verbs and adverbial expressions such as “precedentemente”
or, more generally, a lack of adherence to the structure (“previously”), “successivamente” (“subsequently”), or
and intent of the prompt. “poco fa” (“a short while ago”). Equally frequent are cases</p>
        <p>We also observe that several models have a strong in which temporal relations are conveyed solely through
preference for a specific answer, which is often either “A” verb tense agreement between the two clauses, for
inor “C”. This is in line with research on biases in multiple stance, through a past–present combination to express
choice tasks [54]. This is corroborated by the fact that, anteriority between S1 and S2. Compared to causal
relaeven with Outlines, these models still tend to prefer a tionships, the temporal dimension is thus more
susceptispecific answer over the others. ble to variability, both in terms of the range of
constructions available to express the same temporal relation in
Italian, and in terms of the diversity of contexts in which
the same temporal adverb might occur.</p>
        <p>Indeed, while causality pertains to a subset of verbs
and situational contexts, temporal information, whether
implicit or explicit, is present in all events expressed by
a verb. This variability afects the generalization
capabilities of the models, especially the smaller ones. In fact,
larger models seem better able to properly evaluate the
context and identify the correct relationship between
events.</p>
      </sec>
      <sec id="sec-4-11">
        <title>Per-class Performances. Finally, Figure 4 shows the</title>
        <p>Precision and Recall performances of each model, divided
by class. Again, we look at the zero-shot scenario and
leave out the few-shot one due to space limitations. By
looking at the plot, three main observations can be made.
First, the GPT-4x models are the most consistent across
classes, with only a few notable exceptions for the
smallest models. Second, we observe that some of the models
display a relatively strong bias towards a single or a pair
of answers. Finally, if we zoom out and look at the bigger
picture, we see that models have a slight preference
towards causal relationships. The less biased models are the
two biggest ones, namely gpt-4.1 and gpt-4o. This may
further suggest that at smaller scales models rely more
on distributional properties of words (e.g., causal
connectives often imply a temporal relationship as well, but not
vice versa) and are more sensitive to frequency efects
linked to word combinations frequently encountered
during training. In Italian, in fact, causal connectives such as
“perché” or “quindi” are often used in syntactic
constructions where the premise is explicitly connected to the</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions and Future Works</title>
      <sec id="sec-5-1">
        <title>In this paper, we presented the ExpliCITA dataset, the</title>
        <p>Italian translation of ExpliCa [1]. The dataset is designed
to evaluate explicit temporal and causal reasoning in
LLMs. We also replicated part of the experiments made
on ExpliCa with several LLMs, including i.)
nativelytrained Italian models, ii.) multilingual models fine-tuned
on Italian, iii.) multilingual open-weights models, iv.) a
multilingual reasoning open open-weights model, and v.)
closed-weights commercial models from OpenAI.</p>
        <p>Our findings can be summarized as follows. First,
consistently with [1], we observe two key facts. On the one
hand, all tested models, including GPT, struggle to solve
the task, in Italian more so than in English, both in the
zero- and few-shot setting. We also see that this
struggle is also due to their inability to reliably provide the
answers required by the task, which is only partially
alleviated by using the decoding method of Outlines. On
the other hand, we observe that linguistic competence
of models, measured with the APS, is consistently better
than the respective performance when prompted.
However, we see that this gap between APS and prompted
accuracy tends to reduce with the model size.</p>
        <p>Second, we observe that native Italian models are no
better than the fine-tuned alternatives when it comes to</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>This work has been supported by the PNRR MUR project</title>
        <p>PE0000013-FAIR (Spoke 1), funded by the European
Commission under the NextGeneration EU programme, and
the EU EIC project EMERGE (Grant No. 101070918).
the ExpliCITA PCD task.</p>
        <p>Third, we see that leveraging test-time compute
appears to be beneficial for the task, possibly suggesting
that the reasoning training is important to boost the
ability to recognize semantic relations between events,
even when these are linguistically expressed. We plan
to conduct a more systematic investigation of the efects
of both chain-of-thought reasoning and Outlines across
diferent models and languages. This will include an
indepth error analysis aimed at understanding when and
why such prompting strategies are efective, and whether
their benefits depend on the structure of the prompt, the
language used for reasoning (e.g., English vs. Italian), or
the intrinsic capabilities of the models themselves.</p>
        <p>Finally, we observe a slight improvement in
managing the causal aspect of the relationship rather than the
temporal one, highlighted by the per-class performances.</p>
        <p>In the future, we plan to systematically compare the
results obtained without chat-specific templating to those
obtained by prompting each model using its native chat
format. This will help better isolate the impact of
instruction tuning and formatting on model performance.</p>
        <p>Furthermore, although a direct comparison with
traditional NLP systems was beyond the scope of this work,
future research could explore whether LLMs provide a
competitive advantage in explicit causal reasoning (i.e.,
without task-specific training) compared to lightweight,
specialized models. Finally, as part of future work, we
plan to experiment with implicit causality as well. We
also aim to further explore the impact of reasoning and
test-time-compute on the performance of models on both
explicit and implicit causal relations.
B. Han, Unveiling causal reasoning in large lan- [28] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb,
guage models: Reality or mirage?, Advances in A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta,
Neural Information Processing Systems 37 (2024) A. Garriga-Alonso, et al., Beyond the
imita96640–96670. tion game: Quantifying and extrapolating the
ca[17] D. Mariko, H. Abi Akl, K. Trottier, M. El-Haj, The pabilities of language models, arXiv preprint
ifnancial causality extraction shared task (fincausal arXiv:2206.04615 (2022).
2022), in: Proceedings of the 4th Financial Narrative [29] J. R. Portela, N. Perez, R. Manrique, Esnlir: A
spanProcessing Workshop@ LREC2022, 2022, pp. 105– ish multi-genre dataset with causal relationships,
107. arXiv preprint arXiv:2503.08803 (2025).
[18] A. Romanou, S. Montariol, D. Paul, L. Laugier, [30] I. Rehbein, J. Ruppenhofer, A new resource for
K. Aberer, A. Bosselut, Crab: Assessing the strength german causal language, in: Proceedings of the
of causal relationships between real-world events, Twelfth Language Resources and Evaluation
Conin: Proceedings of the 2023 Conference on Empiri- ference, 2020, pp. 5968–5977.
cal Methods in Natural Language Processing, 2023, [31] J. Sadek, F. Meziane, Learning causality for
arabicpp. 15198–15216. proclitics, Procedia computer science 142 (2018)
[19] P. Hosseini, D. A. Broniatowski, M. Diab, Predict- 141–149.</p>
        <p>ing directionality in causal relations in text, arXiv [32] Z. Rahimi, M. ShamsFard, Persian causality corpus
preprint arXiv:2103.13606 (2021). (percause) and the causality detection benchmark,
[20] V. D. Lai, A. P. B. Veyseh, M. Van Nguyen, F. Dernon- arXiv preprint arXiv:2106.14165 (2021).
court, T. H. Nguyen, Meci: A multilingual dataset [33] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R.
for event causality identification, in: Proceedings of Bowman, H. Schwenk, V. Stoyanov, Xnli:
Evaluatthe 29th international conference on computational ing cross-lingual sentence representations, arXiv
linguistics, 2022, pp. 2346–2356. preprint arXiv:1809.05053 (2018).
[21] J. Dunietz, L. Levin, J. G. Carbonell, The because [34] B. Y. Lin, S. Lee, X. Qiao, X. Ren, Common sense
corpus 2.0: Annotating causality and overlapping beyond english: Evaluating and improving
multilinrelations, in: Proceedings of the 11th Linguistic gual language models for commonsense reasoning,
Annotation Workshop, 2017, pp. 95–104. arXiv preprint arXiv:2106.06937 (2021).
[22] Q. Ning, Z. Feng, H. Wu, D. Roth, Joint reasoning [35] L. C. Passaro, M. Di Maro, V. Basile, D. Croce,
for temporal and causal relations, in: Proceedings Lessons learned from evalita 2020 and thirteen
of the 56th Annual Meeting of the Association for years of evaluation of italian language technology,
Computational Linguistics (Volume 1: Long Papers), IJCoL. Italian Journal of Computational Linguistics
2018, pp. 2278–2288. 6 (2020) 79–102.
[23] P. Mirza, R. Sprugnoli, S. Tonelli, M. Speranza, An- [36] J. Bos, F. M. Zanzotto, M. Pennacchiotti, Textual
notating causality in the tempeval-3 corpus, in: entailment at evalita 2009, Proceedings of EVALITA
Proceedings of the EACL 2014 Workshop on Com- 2009 (2009) 2.
putational Approaches to Causality in Language [37] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi,
(CAtoCL), 2014, pp. 10–19. HellaSwag: Can a machine really finish your
sen[24] T. Caselli, P. Vossen, The event storyline corpus: A tence?, in: A. Korhonen, D. Traum, L. Màrquez
new benchmark for causal and temporal relation (Eds.), Proceedings of the 57th Annual Meeting
extraction, in: Proceedings of the Events and Stories of the Association for Computational
Linguisin the News Workshop, 2017, pp. 77–86. tics, Association for Computational Linguistics,
[25] N. Mostafazadeh, A. Grealish, N. Chambers, J. Allen, Florence, Italy, 2019, pp. 4791–4800. URL: https:
L. Vanderwende, Caters: Causal and temporal rela- //aclanthology.org/P19-1472/. doi:10.18653/v1/
tion scheme for semantic annotation of event struc- P19-1472.
tures, in: Proceedings of the Fourth Workshop on [38] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vulić,
Events, 2016, pp. 51–61. A. Korhonen, Xcopa: A multilingual dataset for
[26] M. Roemmele, C. A. Bejan, A. S. Gordon, Choice causal commonsense reasoning, arXiv preprint
of plausible alternatives: An evaluation of com- arXiv:2005.00333 (2020).
monsense causal reasoning, in: 2011 AAAI spring [39] G. Attanasio, M. La Quatra, A. Santilli, B. Savoldi,
symposium series, 2011. et al., Itaeval: A calamita challenge, in: Proceedings
[27] L. Du, X. Ding, K. Xiong, T. Liu, B. Qin, e-care: a new of the Tenth Italian Conference on Computational
dataset for exploring explainable causal reasoning, Linguistics (CLiC-it 2024), 2024.
in: Proceedings of the 60th Annual Meeting of the [40] G. Attanasio, P. Basile, F. Borazio, D. Croce, M.
FranAssociation for Computational Linguistics (Volume cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M.
Ri1: Long Papers), 2022, pp. 432–446. naldi, et al., Calamita: Challenge the abilities of
language models in italian, in: Proceedings of the tion in machine generated text: A survey (2024).
10th Italian Conference on Computational Linguis- [54] C. Zheng, H. Zhou, F. Meng, J. Zhou, M. Huang,
tics (CLiC-it 2024), Pisa, Italy, 2024. Large language models are not robust multiple
[41] G. Pensa, B. Altuna, I. Gonzalez-Dios, A multi- choice selectors, in: The Twelfth International
layered approach to physical commonsense un- Conference on Learning Representations, ICLR
derstanding: Creation and evaluation of an ital- 2024, Vienna, Austria, May 7-11, 2024,
OpenReian dataset, in: Proceedings of the 2024 Joint In- view.net, 2024. URL: https://openreview.net/forum?
ternational Conference on Computational Linguis- id=shr9PXz7T0.
tics, Language Resources and Evaluation
(LREC</p>
        <p>COLING 2024), 2024, pp. 819–831.
[42] L. D. Wanzare, A. Zarcone, S. Thater, M. Pinkal, A. Prompt template</p>
        <p>A crowdsourced database of event sequence
descriptions for the acquisition of high-quality script An example of the ExpliCITA PCD task, framed as a
knowledge, in: Proceedings of the Tenth Inter- multiple-choice prompting task, is provided in the box
national Conference on Language Resources and below.</p>
        <p>Evaluation (LREC’16), 2016, pp. 3494–3501.
[43] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S. Co- Multiple-choice Prompt
nia, E. Barba, S. Orlandini, G. Fiameni, R.
Navigli, Minerva LLMs: The first family of large
language models trained from scratch on Italian # Compito di scelta multipla
idRCta.o2tSna0pf,2er4iurn)eg,:nnCFcoE.elDUio(eREnldlW’CsO.oo)r,mrlPkeptrstouhacto,aepAteidP.oiLrnnoeagcnlsecLeoii,dfnSitgn.hugMeiss1,ot0Pnictithsseam(I,CtaIaLtlgaiianlCyni-,, Le#Tdcac#aopoiiFrprrspDrsrriaaceaoesrsetrppsaetogcroa’llirnaai2.iedztf,arieoPsoenercennprteeeuedilttnrsaoaadcallecelauldolglaniollsClivleeticproagsroemaetabmrpabrodpieellitiaaodtopeliepss.aspcaredAoearurollrveeteolraalaegcfirordnaaacenemosvlnidnimecintaasistemntpictivporctooiimasvvli.amozernirieIoiselnlenparteoaetluaspotedlaaoulregecoc.otoilcatmnfeartpraepiasesituiotuc’,aeolF’emrreaeqnnsuetteeel1l.o
2024, pp. 707–719. URL: https://aclanthology.org/ ## Formato del Compito
2024.clicit-1.77/. FFrraassee 21:: [[ FFrraassee 21]]
[44] A. Team, Almawave presents velvet: The sustain- Opzioni :
able and high-performance italian ai, 2025. URL: AB .. [[ ppaarroollaa AB]]
https://www.almawave.com. DC.. [[ ppaarroollaa DC]]
[45] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, Risposta : [ Lettera dell ’ opzione corrispondente a l l a parola corretta ]
G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod- {% i f examples %}
els for efective text generation in italian language, #{%# Efosermepxiample in examples %}
2023. arXiv:2312.09993. #F#r#aseEse1m:p{i o{ example . S1 } }
[46] M. Polignano, P. Basile, G. Semeraro, Advanced Frase 2: { { example . S2 } }
natural-based interaction for the italian language: AO.pz{i{oneix: ample . option_A } }</p>
        <p>Llamantino-3-anita, 2024. arXiv:2405.07101. CB.. {{ {{ eexxaammppllee .. ooppttiioonn__CB }} }}
[47] F. A. Galatolo, M. G. Cimino, Cerbero-7b: A leap for- D. { { example . option_D } }
ward in language-specific llms through enhanced {R%ispenosdtfaor: %{{} example . correct_answer } }
chat corpus generation and evaluation, arXiv {% endif %}
preprint arXiv:2311.15698 (2023). #1#. LIsetgrguiziaottneintdaeml enCtoempliato Frase 1 e la Frase 2;
[48] 2A0.24G. . eUtRaLl:., Thhtetpsl:l/a/maraxiv3.orhge/ardbs/o2f407m.2o1d7e8l3s,. el32e.. cdEoSuseeaerlmeeznfiinrtoaeans.ial A’,TleTln’eEoenNplclZzoiI’OooNnrdeEdei:lnlceeos rcriprniiasvrpiocoulnenideeflt noictraesnmointpaeool l;a"foRrpinsapirtooeslat,a ci"nhe∗∗mSmaOneLgiOelr∗ioa∗ llcaooglliecgaa
arXiv:2407.21783. mleetgtleiroa cdoelllel g’aoplzeiondeue (Af r, aBs i, Cn,elocaDm) pocorrirsisppoosntade, nated easlel mappioarola che
[49] G. Team, Gemma 3 technical report, " Risposta : C " .</p>
        <p>2025. URL: https://arxiv.org/abs/2503.19786. #F#rasCeom1p:ito{ {: sentence_a } }
arXiv:2503.19786. Frase 2: { { sentence_b } }
[50] DeepSeek-AI, Deepseek-r1: Incentivizing reason- O{{pzoiopntiio:ns } }
ing capability in llms via reinforcement learn- Risposta :
ing, 2025. URL: https://arxiv.org/abs/2501.12948.</p>
        <p>arXiv:2501.12948.
[51] OpenAI, Gpt-4o system card, 2024. URL: https://</p>
        <p>arxiv.org/abs/2410.21276. arXiv:2410.21276.
[52] O. AI, Introducing gpt-4.1 in the api, 2025. URL:</p>
        <p>https://openai.com/index/gpt-4-1/.</p>
        <p>[53] A. Saxena, P. Bhattacharyya, Hallucination
detecDeclaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order
to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using
these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>