-

LLMs Struggle on Explicit Causality in Italian

Alessandro Bondielli

0 1

Martina Miliani

martina.miliani@fileli.unipi.it 0

Luca Paglione

l.paglione1@studenti.unipi.it 0

Serena Auriemma

serena.auriemma@phd.unipi.it 0

Lucia Passaro

lucia.passaro@unipi.it 0 1

Alessandro Lenci

0 0 CoLing Lab, Department of Philology , Literature and Linguistics , University of Pisa 1 Department of Computer Science, University of Pisa

2025

The ability to recognize and interpret causal relations is fundamental for building robust intelligent systems. Recent research has focused on developing benchmarks and tasks to evaluate the inferential and causal reasoning capabilities of LLMs, such as the Pairwise Causal Discovery (PCD) task. However, most of these resources are limited to English. In this paper, we present ExpliCITA, a translation of the English ExpliCa dataset [1], which is the first publicly available dataset for joint temporal-causal reasoning in Italian, enabling evaluation of LLMs on Italian PCD. We conduct an extensive empirical study across 20 Italian and multilingual models of varying sizes and training strategies, combining a perplexity-based evaluation of causal reasoning competence with multiple-choice prompting tasks in both zero-shot and few-shot settings. Our results show that all tested models, including the GPT family, struggle with the ExpliCITA PCD task, more so than with the original English ExpliCa, in both evaluation scenarios. Moreover, native Italian models do not outperform fine-tuned multilingual alternatives. Consistent with prior findings, we observe that the linguistic competence of models, measured using perplexity-based metrics, is higher than their respective performances, measured via accuracy on prompting results; however, this gap tends to narrow with increasing model size. Finally, a per-class performance analysis reveals that models handle causal relations relatively better than temporal ones.

eol>LLMs Causal Reasoning Language Resources Evaluation Benchmarking

two sentences as input to the model (i.e., “Martina has less chances of getting the flu” and “Martina has been vaccinated against the flu” ), and to ask the model if the ifrst sentence is a consequence of the first with a yes/no question (in this case, groundtruth: “yes” ) [1, 10].

Temporality plays a crucial role in the context of causality, as every causal relation inherently implies a temporal one: If an event A causes an event B, A must necessarily occur (or begin to occur) before B. Conversely, the presence of a temporal relation between two events does not necessarily imply a causal link. For this reason, we extended the PCD task to include the identification of temporal relations, to explicitly disentangle the interplay between causality and temporal sequencing.

To address this issue, in previous works we introduced the ExpliCa (Explicit Causality) benchmark [1], ofering a more controlled experimental setup that jointly addresses temporal and causal reasoning. ExpliCa presents pairs of sentences, each describing a distinct event, without any surface-level linguistic cues for temporal and causal relation, except for a connective that explicitly encodes both the type of relation (i.e., causal and temporal), and the order between the two events. For example, in [1], we asked the models to choose which of four connectives (so, because, then, and after) best represents the relation between the sentences “Martina has less chances of getting the flu” and “Martina has been vaccinated against the lfu” (in this case, groundtruth: “because”.)

Despite these progresses, resources for joint

1. Introduction Recognizing causal relations is a core human cognitive

skill. Causal understanding is in fact fundamental to intelligent reasoning [2]. Thus, a strong AI system should be capable of performing causal reasoning.

The past few years have in fact seen a vigorous debate about the extent to which large language models (LLMs) are actually capable of genuine inference, beyond mere pattern matching [3, 4, 5]. Among the inferences a model should be able to perform lies the causal one. Therefore, several benchmarks targeting causality have emerged recently [6, 7, 8].

A popular evaluation paradigm for causal reasoning is Pairwise Causal Discovery (PCD), which aims to detect pairwise causal relations from observational data. In a PCD task a model must determine if a causal link exists between two events, along with the direction of causality [9, 10]. A common way to frame this task is to give in languages other than English. At the same time, datasets focus on presenting a contextual scenario to test a rich ecosystem of LLMs pre-trained on, or adapted causal inference [14, 6, 15, 16, 17, 18], while others chalto, languages other than English, including Italian, is lenge NLP systems to identify causal relations directly rapidly emerging. on the text [19, 16, 20], also along with temporal ones

To partially fill this gap, we introduce ExpliCITA [21, 22, 23, 24, 25]. ExpliCITA stems from ExpliCa [1], a (Explicit Causality in ITAlian). ExpliCITA is an Ital- dataset developed to evaluate the ability of LLMs to detect ian adaptation of ExpliCa and we believe it is the first explicit causal and temporal relations between events. In benchmark dedicated to joint temporal and causal rea- ExpliCa, relations are annotated via crowdsourcing and soning in Italian. are signaled exclusively through a connective linking a

We also leverage the evaluation framework for ExpliCa pair of sentences, carefully stripped of any additional to conduct the first large-scale evaluation of Italian lan- contextual or lexical cues. This controlled setup miniguage models on causal reasoning. The framework allows mizes the influence of surrounding context and enables us to test both competence (what the model “knows” a more focused assessment of the model’s reasoning on about the probability distribution of linguistic events) explicit relational cues. via perplexity, and performance (how it applies that Due to its design, ExpliCITA shares its structure with “knowledge”) via prompting [11, 12]. Specifically, the other datasets that frame implicit causal relations in a prompting task is formulated as a multiple-choice task, sentence-pair format, where each sentence expresses an where models have to select the appropriate connective individual event. Notable among these are the COPA in a cloze-style prompt. We explore diferent generation dataset [26], the e-CARE dataset [27], and tasks from the settings: greedy decoding and the Outlines framework BIG-Bench benchmark [28], which also test models on [13], under both zero- and few-shot regimes. Our evalua- explicit causal reasoning. COPA and e-CARE were both tion includes a total 20 models across a spectrum of sev- incorporated into the original ExpliCa dataset. eral sizes and training approaches: i.) seven native Italian While resources for English are abundant, the availabilmodels trained from scratch, ii.) four multilingual models ity of non-English datasets for causal reasoning remains ifne -tuned on Italian, iii.) three open-weights multilin- limited. Nevertheless, contributions exist for Spanish gual models, iv.) an open-weight reasoning-specialized [29], German [30], Arabic [31], and Persian [32]. Among LLM, and v.) five commercial systems from the GPT multilingual eforts, MECI [ 20] stands out as a resource family. where causal relations are annotated across several lan

We make both the data and code available on GitHub guage editions of Wikipedia. to replicate our experiments.1 Causal reasoning, and related tasks such as Pairwise Our contribution is twofold: Causal Discovery (PCD), belongs to a broader class of inference-based tasks in natural language understanding. • we present ExpliCITA, the first dataset for joint These tasks aim to evaluate a model’s ability to derive temporal-causal reasoning in Italian; implicit information from textual input, whether through • we deliver an extensive empirical study across logical entailment, causal attribution, or commonsense 20 Italian and multilingual models, following a associations. Within this wider inference landscape, Natrobust evaluation framework combining an evalu- ural Language Inference (NLI) benchmarks like XNLI ation via perplexity with multiple-choice prompt- [33] test models on cross-lingual entailment across 15 ing in several settings. This allows us to highlight languages, while datasets such as X-CSQA [34] focus strengths, weaknesses, and performance varia- on cross-lingual commonsense reasoning in a questiontion across model types and sizes. answering format.

In the Italian context[35], the first dataset for textual

The remainder of the paper is organized as follows: entailment was introduced during the EVALITA 2009 Section 2 reviews related work; Section 3 introduces the evaluation campaign, comprising 800 sentence pairs deExpliCITA dataset; Section 4 details the experimental rived from Wikipedia revision histories [36]. More resetup; and Section 5 presents and discusses the results. cently, the HellaSwag-it dataset, an adaptation of the original HellaSwag dataset [37], was developed to test commonsense inference by asking models to choose the 2. Related Works most plausible ending to a given scenario. Additionally, for causal reasoning, the COPA dataset was translated The study of causality and its linguistic expressions has into Italian (and other languages) as part of the XCOPA recently regained momentum, particularly in the con- project [38]. Both XCOPA-it and HellaSwag-it were intext of evaluating the reasoning capabilities of large lan- tegrated into ItaEval [39], a benchmark for evaluating guage models (LLMs). In this domain, many evaluation LLMs on Italian commonsense and factual reasoning. 1https://github.com/Unipisa/ExpliCITA ItaEval was featured in the 2024 Italian NLP evaluation campaign, CALAMITA [40], which included a wide range are: so (Causal, Iconic), because (Causal, Anti-Iconic), then of datasets to test commonsense and factual knowledge. (Temporal, Iconic), and after (Temporal, Anti-Iconic). Among them, Gita [41] is particularly relevant here: it A defining feature of the dataset is that the connective focuses on physical commonsense in Italian, present- is the sole linguistic cue indicating the semantic relaing pairs of plausible and implausible stories composed tion between sentence pairs. To ensure a controlled and of sentence sequences. To the best of our knowledge, challenging evaluation of causal reasoning, the dataset ExpliCITA is the first dataset specifically dedicated to excludes any additional explicit marker, such as causal evaluating explicit causal and temporal reasoning in a verbs, and removes anaphoric references by avoiding controlled setting for the Italian language. personal pronouns. This design compels models to rely exclusively on event semantics and the connective itself, without support from broader contextual cues. 3. The ExpliCITA Dataset The dataset was then annotated via crowdsourcing by English native speakers. Specifically, annotators were The ExpliCITA Dataset is a direct translation of ExpliCa asked to rate the acceptability of a sentence pair linked by [1]. The original dataset was designed as a benchmark one of the connectives. Each sentence pair, in both orders, for evaluating explicit causal reasoning in LLMs, with a particular focus on distinguishing causal relations from witeitmhsa)llwpaosssriabtleedcboynn1e5ctpiavretsic(i6p0a0n×ts.2F×or4ea=ch4s8e0n0tetontcael temporal ones, using the PCD task. A thorough descrip- pair in both orders of presentation, the connective with tion of the dataset and its properties is reported in [1]. In the highest acceptability rating was considered as the the following, we highlight some of its key aspects. ground truth. Note that the ground truth based on human

Approximately a third of the items in ExpliCa are based ratings do not overlap perfectly with the original distincon other existing datasets [42, 28, 27]. The remaining two tion in CAUSAL, TEMPORAL, and UNRELATED groups thirds are manually crafted. In total, 600 items are in the made by authors when building the sentence pairs. dataset. Each item of the dataset comprises a sentence To build ExpliCITA from ExpliCa, we followed a semipair S1 and S2, where each sentence describes an event. automatic translation procedure. First, we used ChatGPT

The dataset has two key dimensions, namely the type via the web interface2 to translate each sentence from of relation and the order of presentation. As for the type of the 600 pairs independently. Then, each sentence was relation, the items were selected by authors to be equally manually evaluated to address errors in the automatic divided into three main subsets: i.) CAUSAL, where the translation. Errors ranged from mistakes in gender asrelationship is causal, and possibly of temporal prece- signment (e.g., “Luca è stata [...]”) to completely missing dence; ii.) TEMPORAL, where the relation is only of tem- idiomatic expressions (e.g., “Marco ran the red light”, poral precedence, without causality; iii.) UNRELATED, that includes thematically related sentences that are nei- translated as “Marco ha corso la luce rossa” instead of ther causally nor temporally related. Potential biases in “lMataiorncosènpeaesdseadtomcoalnruosaslov”)e.rAificasitgionnific.aFnotrnEuxmpblieCrIoTfAtr,awnselexical elements are controlled for using Mutual Informa- used the following four connectives: tion between lexical elements of the sentence pairs. This is done to avoid having very diferent lexemes in the UN- Quindi - Indicates a causal relation in the iconic order. RELATED group with respect to the other groups. The The event in S1 causes the event in S2. diferences in the association strengths between lexemes in the three groups are not statistically significant. Perché - Indicates a causal relation in the anti-iconic

As for the order of presentation, it can be either order. The event in S1 is caused by the event in S2. ICONIC (in short form Ic), if the sequence of events ex- E poi - Indicates a temporal relation in the iconic order. pressed in the two sentences matches their chronological The event in S1 temporally precedes the event in S2. and/or logical-causal order (e.g., “S1 then S2”), or ANTIICONIC (in short form, A-Ic), if the sequence of events Dopo che - Indicates a temporal relation in the antiexpressed in the two sentences is inverted compared iconic order. The event in S1 follows the event in S2. to their chronological and/or logical-causal order (e.g., the efect is mentioned before the cause: “S2 because The choice of multi-token expression for the temporal S1”). Note that, for each sentence pair, the dataset in- connectives is due to the fact that no suficiently frequent cludes both the Iconic and Anti-Iconic order for a total single word in Italian conveys the proper meaning. of 600 × 2 = 1, 200 items. ExpliCITA includes each sentence pair in both orders

The type of relation and the order of presentation are of presentation. Thus, the number of data points is 600 × expressed via one out of four connectives, that act as lin- 2 = 1, 200. We consider as our ground truth the results guistic cues to explicitly signal the nature of the relation- of the crowdsourcing experiment for ExpliCa [1]. In ship. In the English version of the dataset, the connectives

2Accessed on December 2024

Group Connective Quindi (Caus., Ic) Perché (Caus., A-Ic) E poi (Temp., Ic) Dopo che (Temp., A-Ic) CAUSAL TEMPORAL UNRELATED Total

4. Experimental Setting

model is presented with S1 and S2 and a list of choices, each representing a connective. The task is to provide the correct choice. We experiment in both zero-shot and few-shot scenarios. For the few-shot, the models saw one example for each connective, for a total of four examples. To avoid biases in the choices, both the order of options to choose from and the position of the correct answer is randomized. Note however that all models saw the same exact prompt for any item in the dataset. We use accuracy as our main metric. To distinguish from APS, we refer to values obtained via prompting as Accuracy on Prompt Execution (APX).

The template for the prompt is shown in Appendix A. We used the Jinja template syntax.3 The prompt is not a direct translation but it is heavily inspired to the one used in [1]. First, we provide the models with the description and format of the task; for the few-shot scenario, we provide the examples; then, we give clear instructions for how to complete the task; finally, we describe the task. Since we use both pre-trained only and instruction

The goal of our experiments is to test LLMs on the PCD

task of the ExpliCITA dataset from two perspectives. On the one hand, we want to assess the linguistic competence of the model: the fact that it encodes some linguistic knowledge about causal and temporal relations. fine-tuned models, we used a template that would enable We do so by leveraging a perplexity-based evaluation. also pre-trained only models to answer. Note that we did On the other hand, we want to address the actual per- not implement specific templating strategies (e.g., chat formance of the model on the dataset. We do so via a formatting, special tokens, etc.) for any model, and we prompt-based evaluation in which the model has to fed all the models with exactly the same prompt. solve our PCD task, by identifying the correct connective The only exception was GPT, which was prompted for a sentence pair. Our main goal is to evaluate Italian using the chat format, as required by the model’s API. LLMs on Italian data. In addition to native Italian LLMs, However, the content of the prompt was the same as the we also consider other model classes. Specifically, we ac- one used for all other models, without the addition of any count for i.) Italian fine-tuned models, i.e. open-weights custom system messages, special tokens, or instructionmodels fine-tuned on Italian, ii.) open-weights multilin- specific formatting. gual models, iii.) open-weights reasoning models, and iv.) We used a markdown-like syntax to highlight the secclosed commercial models. All tested models are listed tions of the prompt. We acknowledge that not formatting in Section 4.1. the prompt for each model may hinder performances in some cases. However, we argue that this ensure a more fair evaluation. The only exception was made for the Perplexity-Based Evaluation. This experiment is an reasoning model, for which we also include the <think> exact replica of the one conducted in [1]. For each sen- token at the end of the prompt, to ensure that the Chaintence pair in the dataset (i.e., in both orders of presen- of-thought is started. tation), we derive one sentence for each connective, in We used a greedy decoding strategy for all experithe form “S1 {{ connective }} S2”. We obtain ments, that is we always sample the next most likely 1, 200 × 4 = 4, 800 sentences in total. For each of them, token at each generation step. We let each model generwe compute a model’s perplexity (PPL) over the whole ate a maximum of 20 tokens in their response. For the sentence. We then rank the four sentences based on PPL, reasoning model, we let it generate a maximum of 10,000 and consider the one with the lowest value as the “model tokens. All models, with the exception of GPT variants, connective choice”. Finally, we compute the accuracy of where used in their HuggingFace implementation.4 the model choices against the ground truth. We call this A notable issue with unconstrained text generation is metric Accuracy on Perplexity Score (APS). that less performing models may yield text that do not conform to the standard asked for in the prompt. This remains true also for cases, like ours, where the expected answer can be the direct continuation of the prompt, rather than the answer to a question or the turn in a Prompt-Based Evaluation. For the prompt-based evaluation, we asked the models to identify the correct connective to use between S1 and S2. We chose to focus on a standard multiple-choice task, as it is one of the most widely used formats for evaluating LLMs, and replicates one of the prompting experiments in [1]. In the task, the 3https://jinja.palletsprojects.com 4huggingface.co conversation. To alleviate this issue, we proceeded in cerbero-7b variants [47]. They are respectively fine two ways. First, we implemented a post-processing strat- tuned versions of LLaMA-2, LLaMA-3 and Mistral. egy based on a set of regular expressions to parse each model response and extract one answer. The regexes Open LLMs: We also evaluated the performances of were designed to extract one and only one option from strong contenders in the Open LLM space. To do so, we the generated text. In cases where multiple answers or no selected Meta’s LLaMA-3.1-8B [48] and two versions of answer were detected, it was counted as a mistake for the Google’s Gemma3 [49], namely the 4B and 12B ones. model. In Section 5, we report the results of the model af- Reasoning LLMs: We also tested one reasoning model, ter this post-processing. Some models consistently failed namely DeepSeek-R1-Distill-Llama-8B [50], a disto provide appropriate answers in this setting. tilled version of DeepSeek-R1 using LLaMA-3.1-8B. This

Second, we employed Outlines [13],5 a Python library allows us to explore how reasoning impact performances built to provide structured text generation with LLMs on our PCD task. (e.g., with type constraints, following regular expressions, or providing json-formatted outputs). In the case of mul- Commercial models: Finally, we tested the GPTtiple choices, it uses masking on the output probabilities 4x family as representative of commercial closedto restrict the model outputs to a set of valid completions source models. We evaluated both gpt-4o and [13]. In our case, the possible completions are the “A”, gpt-4o-mini [51], and all the GPT-4.1 variants “B”, “C”, and “D” options for the tasks. This approach (gpt-4.1, gpt-4.1-mini, and gpt-4.1-nano) [52]. has become quite popular in the literature and has been adopted in several recent studies on generative LLMs Depending on its size, each model required a time [12]. Note that Outlines was not used for the GPT vari- between 0.5 and 1 GPU hours to complete its run, that ants and one of the open-weights tested models, namely includes both the zero-shot and few-shot experiments, Gemma3. In fact, all GPT models consistently yielded each consisting of: i.) generation with greedy decoding; properly formatted outputs, making an additional evalu- ii.) generation with Outlines; and iii.) PPL scores compuation redundant (recall that the next-token prediction is tation. The DeepSeek-R1-Distill-Llama-8B model performed in a greedy fashion) and economically costly. required around 10 GPU hours in total, due to its much Moreover, a known bug in the current Outlines and Hug- higher demand for test-time compute. Experiments with gingFace implementations prevents all Gemma3 models the GPT-4x family were conducted using the oficial Opeto be run through Outlines at this stage. nAI Batch API.6 The code for replicating the experiments is available on GitHub.

4.1. Tested Models We chose to experiment on a variety of models and model

classes, to gain a broader and clearer picture of the problem. Our main goal was to evaluate native Italian LLMs on the PCD task. Thus, we considered the following native Italian model families/variants:

In this Section we present and discuss the results. We

ifrst look at the overall results based on Accuracy of models on the PCD task of ExpliCITA, in terms of both i.) linguistic competence with APS, and ii.) performance Minerva [43]. We considered all model sizes of the with APX in zero- and few- shot experiments, with and Minerva family (from 350M to 7B), including both the without Outlines. Then, we present additional results by Instruction fine-tuned and pre-trained only ones. considering two aspects. On the one hand, we look at the distribution of answers for each model, to highlight posVelvet [44]. We experimented with both available mod- sible biases and failures in providing an answer. On the els, namely Velvet-2B and Velvet-14B. other hand, we look at per-class performances, to understand whether the tested LLMs show biases in modelling specific aspects of temporal and causal reasoning.

5. Results and Discussion We highlight that we were not able to ran experiments

on the Italia-9B model due to issues with its loading via the HuggingFace library.

We also chose to experiment with non-native Italian models for a clear and fair comparison. These can be distinguished into four classes: Italian Fine-Tuned models: This class includes LLaMAntino-2-chat-7b-hf-UltraChat-ITA [45], LLaMAntino-3-ANITA-8B-Inst-DPO-ITA [46] and

5https://github.com/dottxt-ai/outlines 5.1. Overall Results Our main findings for the evaluation of LLMs on Ex

pliCITA are summarised in Figure 1. The Figure shows the Accuracy of all tested models, in all scenarios. We divide the plot by model family, and sort each family by the model size.

6https://platform.openai.com/docs/guides/batch

The results are in line with the experiments reported for ExpliCa [1]. We highlight several interesting aspects in the following. low (e.g., below 0.1). In other cases, the use of Outlines seems less influential. Nevertheless, the same accuracy may be obtained from a significantly diferent distribution of answers, as will be discussed in the following Sections.

Overall performance. As for the raw performances,

all models except the GPT-4x family show rather poor or at least somewhat brittle performances. The only Model sizes. As shown in [1], we observe that the size models capable of approaching GPT-level performances of the model is relevant for its downstream performances. are DeepSeek-R1 and Gemma3-12B. However, this is In the open-weights model classes, the two best performachieved either with the inclusion of reasoning for ing models are Gemma3 and Velvet, respectively in the DeepSeek, or only in a specific setting for Gemma. 12B and 14B variants. Both also display above average APS scores. However, it is also interesting to note that while Gemma3-4B was not able to solve the task at all, the 2B variant of Velvet was consistent in its performance, which closely match those of some larger models.

Zero- vs Few-Shot. As for the diference in zero-shot

and few-shot settings, the GPT-4x family is again the only one where there is a clear and consistent trend, in this case in favour of the few-shot setting. In other cases, the few-shot examples are not always beneficial: Competence vs. Performance. It is important to nofor some models (e.g., Gemma3-12B, LLaMAntino-2 and tice that APS is always better than APX, with the sole Minerva-3B) it appears to be detrimental, while for other exception of the Gemma-3-12B model. This further corit is inefective. However, for Minerva-7B we observe roborates some of the findings in [ 1]: while models’ interthat while for the pre-trained variant the examples are nal representations and probability distribution encode, detrimental, this is not true for the instruction-tuned one. at least to some extent, knowledge about causal and temThis is possibly due to the instruction-tuning dataset of poral relations, this knowledge is not fully accessed via the model. prompting. This is also in line with other research [11]. Moreover, it was shown in [1] that the gap between APS Impact of Outlines. It appears to be beneficial mostly and APX shrinks with the size of the model. Given the for cases where zero- or few-shot performances are quite wide array of tested open-weights model, we can further corroborate this hypothesis by looking at Figure 2. We can clearly see that the rate of improvement in APX as models grow in size (red trendline) is higher than their respective rate of improvements in APS (blue trendline) on the task.

We also highlight the following relevant findings associated to specific model classes:

Italian Models are weak; Native Italian pre-training is not beneficial. Native Italian models do not show relevant improvements with respect to fine-tuned alternatives, neither at the same size, nor at larger sizes. The Velvet family appears to provide relatively solid results at all scales; in contrast, smaller models in the Minerva family appear to be less robust on ExpliCITA. The fine-tuned Italian models display similar, if not better, performances than native ones. This could lead us to question whether it’s truly necessary to train LLMs from scratch on Italian data. Results suggest that, albeit limited to this case study, it is not.

GPTs struggle. On ExpliCa, the GPT model family displayed performances that couldn’t reach 0.8 Accuracy [1]. Changing the language of the dataset and the prompt highlight a stark contrast: the drop in performances for the same model is around .20 points, and even newer models cannot reach a 70% accuracy. Considering the fact that the task has remained exactly the same, and that GPT “speaks” fluent Italian, this may be indication that current LLMs are still limited in terms of actual causal reasoning, and still reliant on their internal probabilistic representations of texts.

Test-time compute is beneficial. We observed that the performances of the distilled DeepSeek-R1 drastically improve when it is allowed to use its “reasoning” abilities. This is particularly interesting, as it somewhat contrasts with the expectation that the task not require particular forms of reasoning, which may be instead required when modelling phenomena such as implicit causal relations. This issue will be further addressed in future works. We also note that while answers were provided in Italian, the chain-of-thought enclosed in the <think> tokens is almost exclusively in English.

5.2. Additional Analyses Besides evaluating the accuracy of models on ExpliCITA, we also consider two other aspects that allow us to further understand the behaviour of the tested models in our setting. Distribution of Answers. First, we explore how mod

els actually answered to the multiple-choice task. The distribution of answers with greedy decoding and with outlines in the zero-shot setting is shown in Figure 3. We leave out the visualization of the few-shot setting due to space limitations, but they are very similar in nature.

We observe that some models consistently fail to pro

vide an adequate answer, thus drastically lowering their performances. For example, it is possible that when ANITA actually answered it did so correctly, but it was able to answer on a very small fraction of the questions.

Moreover, although we applied post-processing to the model responses (see Sec. 4), we still observed persistent failure modes, primarily due to the model’s inability to consequence via one of these two connectives. The confollow the expected output format. Such behaviors can be struction “S1 connective S2” is therefore typical broadly described as faithful hallucinations caused by in- for causal relationships. structional inconsistency [53], in which the model’s out- In contrast, there is greater variability in how temput is not properly aligned with the user’s request. These poral sequential relationships can be expressed in Italfailures often consisted in limitations in the number of ian. These can be conveyed through temporal conjuncrequested output tokens, which the models were unable tions such as “e poi” or “dopo che”, as well as through adto respect, unintended rewriting of the input question, verbs and adverbial expressions such as “precedentemente” or, more generally, a lack of adherence to the structure (“previously”), “successivamente” (“subsequently”), or and intent of the prompt. “poco fa” (“a short while ago”). Equally frequent are cases

We also observe that several models have a strong in which temporal relations are conveyed solely through preference for a specific answer, which is often either “A” verb tense agreement between the two clauses, for inor “C”. This is in line with research on biases in multiple stance, through a past–present combination to express choice tasks [54]. This is corroborated by the fact that, anteriority between S1 and S2. Compared to causal relaeven with Outlines, these models still tend to prefer a tionships, the temporal dimension is thus more susceptispecific answer over the others. ble to variability, both in terms of the range of constructions available to express the same temporal relation in Italian, and in terms of the diversity of contexts in which the same temporal adverb might occur.

Indeed, while causality pertains to a subset of verbs and situational contexts, temporal information, whether implicit or explicit, is present in all events expressed by a verb. This variability afects the generalization capabilities of the models, especially the smaller ones. In fact, larger models seem better able to properly evaluate the context and identify the correct relationship between events.

Per-class Performances. Finally, Figure 4 shows the

Precision and Recall performances of each model, divided by class. Again, we look at the zero-shot scenario and leave out the few-shot one due to space limitations. By looking at the plot, three main observations can be made. First, the GPT-4x models are the most consistent across classes, with only a few notable exceptions for the smallest models. Second, we observe that some of the models display a relatively strong bias towards a single or a pair of answers. Finally, if we zoom out and look at the bigger picture, we see that models have a slight preference towards causal relationships. The less biased models are the two biggest ones, namely gpt-4.1 and gpt-4o. This may further suggest that at smaller scales models rely more on distributional properties of words (e.g., causal connectives often imply a temporal relationship as well, but not vice versa) and are more sensitive to frequency efects linked to word combinations frequently encountered during training. In Italian, in fact, causal connectives such as “perché” or “quindi” are often used in syntactic constructions where the premise is explicitly connected to the

6. Conclusions and Future Works In this paper, we presented the ExpliCITA dataset, the

Italian translation of ExpliCa [1]. The dataset is designed to evaluate explicit temporal and causal reasoning in LLMs. We also replicated part of the experiments made on ExpliCa with several LLMs, including i.) nativelytrained Italian models, ii.) multilingual models fine-tuned on Italian, iii.) multilingual open-weights models, iv.) a multilingual reasoning open open-weights model, and v.) closed-weights commercial models from OpenAI.

Our findings can be summarized as follows. First, consistently with [1], we observe two key facts. On the one hand, all tested models, including GPT, struggle to solve the task, in Italian more so than in English, both in the zero- and few-shot setting. We also see that this struggle is also due to their inability to reliably provide the answers required by the task, which is only partially alleviated by using the decoding method of Outlines. On the other hand, we observe that linguistic competence of models, measured with the APS, is consistently better than the respective performance when prompted. However, we see that this gap between APS and prompted accuracy tends to reduce with the model size.

Second, we observe that native Italian models are no better than the fine-tuned alternatives when it comes to

Acknowledgments This work has been supported by the PNRR MUR project

PE0000013-FAIR (Spoke 1), funded by the European Commission under the NextGeneration EU programme, and the EU EIC project EMERGE (Grant No. 101070918). the ExpliCITA PCD task.

Third, we see that leveraging test-time compute appears to be beneficial for the task, possibly suggesting that the reasoning training is important to boost the ability to recognize semantic relations between events, even when these are linguistically expressed. We plan to conduct a more systematic investigation of the efects of both chain-of-thought reasoning and Outlines across diferent models and languages. This will include an indepth error analysis aimed at understanding when and why such prompting strategies are efective, and whether their benefits depend on the structure of the prompt, the language used for reasoning (e.g., English vs. Italian), or the intrinsic capabilities of the models themselves.

Finally, we observe a slight improvement in managing the causal aspect of the relationship rather than the temporal one, highlighted by the per-class performances.

In the future, we plan to systematically compare the results obtained without chat-specific templating to those obtained by prompting each model using its native chat format. This will help better isolate the impact of instruction tuning and formatting on model performance.

Furthermore, although a direct comparison with traditional NLP systems was beyond the scope of this work, future research could explore whether LLMs provide a competitive advantage in explicit causal reasoning (i.e., without task-specific training) compared to lightweight, specialized models. Finally, as part of future work, we plan to experiment with implicit causality as well. We also aim to further explore the impact of reasoning and test-time-compute on the performance of models on both explicit and implicit causal relations. B. Han, Unveiling causal reasoning in large lan- [28] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, guage models: Reality or mirage?, Advances in A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, Neural Information Processing Systems 37 (2024) A. Garriga-Alonso, et al., Beyond the imita96640–96670. tion game: Quantifying and extrapolating the ca[17] D. Mariko, H. Abi Akl, K. Trottier, M. El-Haj, The pabilities of language models, arXiv preprint ifnancial causality extraction shared task (fincausal arXiv:2206.04615 (2022). 2022), in: Proceedings of the 4th Financial Narrative [29] J. R. Portela, N. Perez, R. Manrique, Esnlir: A spanProcessing Workshop@ LREC2022, 2022, pp. 105– ish multi-genre dataset with causal relationships, 107. arXiv preprint arXiv:2503.08803 (2025). [18] A. Romanou, S. Montariol, D. Paul, L. Laugier, [30] I. Rehbein, J. Ruppenhofer, A new resource for K. Aberer, A. Bosselut, Crab: Assessing the strength german causal language, in: Proceedings of the of causal relationships between real-world events, Twelfth Language Resources and Evaluation Conin: Proceedings of the 2023 Conference on Empiri- ference, 2020, pp. 5968–5977. cal Methods in Natural Language Processing, 2023, [31] J. Sadek, F. Meziane, Learning causality for arabicpp. 15198–15216. proclitics, Procedia computer science 142 (2018) [19] P. Hosseini, D. A. Broniatowski, M. Diab, Predict- 141–149.

ing directionality in causal relations in text, arXiv [32] Z. Rahimi, M. ShamsFard, Persian causality corpus preprint arXiv:2103.13606 (2021). (percause) and the causality detection benchmark, [20] V. D. Lai, A. P. B. Veyseh, M. Van Nguyen, F. Dernon- arXiv preprint arXiv:2106.14165 (2021). court, T. H. Nguyen, Meci: A multilingual dataset [33] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. for event causality identification, in: Proceedings of Bowman, H. Schwenk, V. Stoyanov, Xnli: Evaluatthe 29th international conference on computational ing cross-lingual sentence representations, arXiv linguistics, 2022, pp. 2346–2356. preprint arXiv:1809.05053 (2018). [21] J. Dunietz, L. Levin, J. G. Carbonell, The because [34] B. Y. Lin, S. Lee, X. Qiao, X. Ren, Common sense corpus 2.0: Annotating causality and overlapping beyond english: Evaluating and improving multilinrelations, in: Proceedings of the 11th Linguistic gual language models for commonsense reasoning, Annotation Workshop, 2017, pp. 95–104. arXiv preprint arXiv:2106.06937 (2021). [22] Q. Ning, Z. Feng, H. Wu, D. Roth, Joint reasoning [35] L. C. Passaro, M. Di Maro, V. Basile, D. Croce, for temporal and causal relations, in: Proceedings Lessons learned from evalita 2020 and thirteen of the 56th Annual Meeting of the Association for years of evaluation of italian language technology, Computational Linguistics (Volume 1: Long Papers), IJCoL. Italian Journal of Computational Linguistics 2018, pp. 2278–2288. 6 (2020) 79–102. [23] P. Mirza, R. Sprugnoli, S. Tonelli, M. Speranza, An- [36] J. Bos, F. M. Zanzotto, M. Pennacchiotti, Textual notating causality in the tempeval-3 corpus, in: entailment at evalita 2009, Proceedings of EVALITA Proceedings of the EACL 2014 Workshop on Com- 2009 (2009) 2. putational Approaches to Causality in Language [37] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, (CAtoCL), 2014, pp. 10–19. HellaSwag: Can a machine really finish your sen[24] T. Caselli, P. Vossen, The event storyline corpus: A tence?, in: A. Korhonen, D. Traum, L. Màrquez new benchmark for causal and temporal relation (Eds.), Proceedings of the 57th Annual Meeting extraction, in: Proceedings of the Events and Stories of the Association for Computational Linguisin the News Workshop, 2017, pp. 77–86. tics, Association for Computational Linguistics, [25] N. Mostafazadeh, A. Grealish, N. Chambers, J. Allen, Florence, Italy, 2019, pp. 4791–4800. URL: https: L. Vanderwende, Caters: Causal and temporal rela- //aclanthology.org/P19-1472/. doi:10.18653/v1/ tion scheme for semantic annotation of event struc- P19-1472. tures, in: Proceedings of the Fourth Workshop on [38] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vulić, Events, 2016, pp. 51–61. A. Korhonen, Xcopa: A multilingual dataset for [26] M. Roemmele, C. A. Bejan, A. S. Gordon, Choice causal commonsense reasoning, arXiv preprint of plausible alternatives: An evaluation of com- arXiv:2005.00333 (2020). monsense causal reasoning, in: 2011 AAAI spring [39] G. Attanasio, M. La Quatra, A. Santilli, B. Savoldi, symposium series, 2011. et al., Itaeval: A calamita challenge, in: Proceedings [27] L. Du, X. Ding, K. Xiong, T. Liu, B. Qin, e-care: a new of the Tenth Italian Conference on Computational dataset for exploring explainable causal reasoning, Linguistics (CLiC-it 2024), 2024. in: Proceedings of the 60th Annual Meeting of the [40] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. FranAssociation for Computational Linguistics (Volume cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri1: Long Papers), 2022, pp. 432–446. naldi, et al., Calamita: Challenge the abilities of language models in italian, in: Proceedings of the tion in machine generated text: A survey (2024). 10th Italian Conference on Computational Linguis- [54] C. Zheng, H. Zhou, F. Meng, J. Zhou, M. Huang, tics (CLiC-it 2024), Pisa, Italy, 2024. Large language models are not robust multiple [41] G. Pensa, B. Altuna, I. Gonzalez-Dios, A multi- choice selectors, in: The Twelfth International layered approach to physical commonsense un- Conference on Learning Representations, ICLR derstanding: Creation and evaluation of an ital- 2024, Vienna, Austria, May 7-11, 2024, OpenReian dataset, in: Proceedings of the 2024 Joint In- view.net, 2024. URL: https://openreview.net/forum? ternational Conference on Computational Linguis- id=shr9PXz7T0. tics, Language Resources and Evaluation (LREC

COLING 2024), 2024, pp. 819–831. [42] L. D. Wanzare, A. Zarcone, S. Thater, M. Pinkal, A. Prompt template

A crowdsourced database of event sequence descriptions for the acquisition of high-quality script An example of the ExpliCITA PCD task, framed as a knowledge, in: Proceedings of the Tenth Inter- multiple-choice prompting task, is provided in the box national Conference on Language Resources and below.

Evaluation (LREC’16), 2016, pp. 3494–3501. [43] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S. Co- Multiple-choice Prompt nia, E. Barba, S. Orlandini, G. Fiameni, R. Navigli, Minerva LLMs: The first family of large language models trained from scratch on Italian # Compito di scelta multipla idRCta.o2tSna0pf,2er4iurn)eg,:nnCFcoE.elDUio(eREnldlW’CsO.oo)r,mrlPkeptrstouhacto,aepAteidP.oiLrnnoeagcnlsecLeoii,dfnSitgn.hugMeiss1,ot0Pnictithsseam(I,CtaIaLtlgaiianlCyni-,, Le#Tdcac#aopoiiFrprrspDrsrriaaceaoesrsetrppsaetogcroa’llirnaai2.iedztf,arieoPsoenercennprteeeuedilttnrsaoaadcallecelauldolglaniollsClivleeticproagsroemaetabmrpabrodpieellitiaaodtopeliepss.aspcaredAoearurollrveeteolraalaegcfirordnaaacenemosvlnidnimecintaasistemntpictivporctooiimasvvli.amozernirieIoiselnlenparteoaetluaspotedlaaoulregecoc.otoilcatmnfeartpraepiasesituiotuc’,aeolF’emrreaeqnnsuetteeel1l.o 2024, pp. 707–719. URL: https://aclanthology.org/ ## Formato del Compito 2024.clicit-1.77/. FFrraassee 21:: [[ FFrraassee 21]] [44] A. Team, Almawave presents velvet: The sustain- Opzioni : able and high-performance italian ai, 2025. URL: AB .. [[ ppaarroollaa AB]] https://www.almawave.com. DC.. [[ ppaarroollaa DC]] [45] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, Risposta : [ Lettera dell ’ opzione corrispondente a l l a parola corretta ] G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod- {% i f examples %} els for efective text generation in italian language, #{%# Efosermepxiample in examples %} 2023. arXiv:2312.09993. #F#r#aseEse1m:p{i o{ example . S1 } } [46] M. Polignano, P. Basile, G. Semeraro, Advanced Frase 2: { { example . S2 } } natural-based interaction for the italian language: AO.pz{i{oneix: ample . option_A } }

Llamantino-3-anita, 2024. arXiv:2405.07101. CB.. {{ {{ eexxaammppllee .. ooppttiioonn__CB }} }} [47] F. A. Galatolo, M. G. Cimino, Cerbero-7b: A leap for- D. { { example . option_D } } ward in language-specific llms through enhanced {R%ispenosdtfaor: %{{} example . correct_answer } } chat corpus generation and evaluation, arXiv {% endif %} preprint arXiv:2311.15698 (2023). #1#. LIsetgrguiziaottneintdaeml enCtoempliato Frase 1 e la Frase 2; [48] 2A0.24G. . eUtRaLl:., Thhtetpsl:l/a/maraxiv3.orhge/ardbs/o2f407m.2o1d7e8l3s,. el32e.. cdEoSuseeaerlmeeznfiinrtoaeans.ial A’,TleTln’eEoenNplclZzoiI’OooNnrdeEdei:lnlceeos rcriprniiasvrpiocoulnenideeflt noictraesnmointpaeool l;a"foRrpinsapirtooeslat,a ci"nhe∗∗mSmaOneLgiOelr∗ioa∗ llcaooglliecgaa arXiv:2407.21783. mleetgtleiroa cdoelllel g’aoplzeiondeue (Af r, aBs i, Cn,elocaDm) pocorrirsisppoosntade, nated easlel mappioarola che [49] G. Team, Gemma 3 technical report, " Risposta : C " .

2025. URL: https://arxiv.org/abs/2503.19786. #F#rasCeom1p:ito{ {: sentence_a } } arXiv:2503.19786. Frase 2: { { sentence_b } } [50] DeepSeek-AI, Deepseek-r1: Incentivizing reason- O{{pzoiopntiio:ns } } ing capability in llms via reinforcement learn- Risposta : ing, 2025. URL: https://arxiv.org/abs/2501.12948.

arXiv:2501.12948. [51] OpenAI, Gpt-4o system card, 2024. URL: https://

arxiv.org/abs/2410.21276. arXiv:2410.21276. [52] O. AI, Introducing gpt-4.1 in the api, 2025. URL:

https://openai.com/index/gpt-4-1/.

[53] A. Saxena, P. Bhattacharyya, Hallucination detecDeclaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.