GITA4CALAMITA -Evaluating the Physical Commonsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge

GITA4CALAMITA -Evaluating the Physical Commonsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge GiuliaPensa giulia.pensa.tr@gmail.com EHU University of the Basque Country UPV EkhiAzurmendi ekhi.azurmendi@ehu.eus HiTZ Center -Ixa University of the Basque Country UPV/EHU JulenEtxaniz julen.etxaniz@ehu.eus HiTZ Center -Ixa University of the Basque Country UPV/EHU BegoñaAltuna begona.altuna@ehu.eus HiTZ Center -Ixa University of the Basque Country UPV/EHU ItziarGonzalez-Dios itziar.gonzalezd@ehu.eus HiTZ Center -Ixa University of the Basque Country UPV/EHU T Marco ha aperto il frigo Marco ha preso il latte dal frigo Marco ha versato il latte nella tazza. Marco ha bevuto il latte Marco ha preso la tazza Marco ha aperto il frigo Marco ha preso il latte dal frigo Marco ha preso la tazza Marco ha bevuto il latte Marco ha versato il latte nella tazza Marco ha preso il latte dal frigo Marco ha chiuso il frigo Marco ha preso la tazza ha versato il latte nella tazza. Marco ha bevuto il latte Marco GITA4CALAMITA -Evaluating the Physical Commonsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge 1613-0073 94F39287F948EBAF15175662BFC91C55 GROBID - A machine learning software for extracting information from scholarly documents Physical commonsense reasoning large language models Italian benchmark sentence 1 sentence 2 sentence 3 sentence 4 sentence 5

In the context of the CALAMITA Challenge, we investigate the physical commonsense reasoning capabilities of large language models (LLMs) and introduce a methodology to assess their understanding of the physical world. To this end, we use a test set designed to evaluate physical commonsense reasoning in LLMs for the Italian language. We present a tiered dataset, named the Graded Italian Annotated dataset (GITA), which is written and annotated by a professional linguist. This dataset enables us to focus on three distinct levels of commonsense understanding. Our benchmark aims to evaluate three specific tasks: identifying plausible and implausible stories within our dataset, identifying the conflict that generates an implausible story, and identifying the physical states that make a story implausible. We perform these tasks using LLAMA3, Gemma2 and Mistral. Our findings reveal that, although the models may excel at high-level classification tasks, their reasoning is inconsistent and unverifiable, as they fail to capture intermediate evidence.

Challenge: Introduction and Motivation

Physical commonsense understanding refers to the ability to comprehend the physical world and the events that transpire within it. This capability is a crucial component of human intelligence, enabling us to reason about our environment, anticipate future occurrences, and navigate our surroundings effortlessly, and recently there has been notable advancement in the development of large language models (LLMs) that can produce human-like language and execute a variety of language-related tasks.

LLMs have exhibited promising outcomes in grasping common sense in particular situations [1,2]. Nevertheless, it is widely recognized that the most precise evaluation of their capabilities is attained when assessing their performance in specific end tasks [3,4]. The evaluation often emphasizes the capacity of LLMs to replicate relatively straightforward tasks, rather than their authentic proficiency in reasoning and comprehending language [5,6]. As a result, there remains uncertainty regarding machines' ability to truly perform reasoning and whether the existing issues in this regard have been sufficiently addressed.

In this context, our aim is to contribute to this challenge developing an original Italian benchmark that can be used to assess the ability of language models to understand physical commonsense in a more truthful way, focusing not only on end tasks, but also on intermediate layer tasks.

In this paper, we present GITA4CALAMITA, the Graded Italian Annotated dataset for the CALAMITA challenge [7]. GITA4CALAMITA is an adapted version of the GITA dataset proposed in [8]. In particular, we decided to revise the physical states annotation and adapt it to this challenge. The first version of GITA dataset is available in our repository under the license CC BY-NC-SA 4.0. 1 . The GITA4CALAMITA dataset is manually compiled by a professional linguist, which allows for this multi-layered evaluation of the reasoning process. With the creation of an Italian dataset we gain the linguistic and cultural perspective of Italian, while commonsense research in Natural Language Processing (NLP) has largely been focused on the English language.

Challenge: Description

Our aim in this challenge is to assess the understanding of physical commonsense in LLMs for Italian. We configure our assessment proposal in the following terms:

1. given an original dataset of plausible/implausible stories related to physical commonsense, systems must identify the plausible and implausible stories; 2. systems must recognize the conflicting sentences that generate the conflict in implausible stories; 3. systems must spot the underlying physical states that cause conflict in implausible stories.

The recognition of plausible/implausible stories is the end task envisaged in this benchmark, which must be justified by the second-level and third-level steps. In Figure 1 we present a story pair from the GITA4CALAMITA dataset and the relation between the layers of annotation. Story A is a plausible story, Story B is the corresponding implausible story where the first and the second sentences are in conflict: Marco closes the refrigerator and cannot take the milk out of it. In the right part of the figure we can see the reasoning steps that the system must follow and resolve. This example is presented in English for clarity, but our entire dataset is in Italian.

We introduce a series of tasks that constitute a humaninterpretable reasoning process, supported by a chain of evidence, reflecting the assessment methodology outlined above. To explain this approach, we present the tasks from the deepest to the shallowest, mirroring human reasoning:

Physical state classification: Leveraging our physical state annotations, systems must recognize the involved physical states in the conflicting sentences of implausible stories. If we look at the example in 1, we are able to identify the problematic physical state "open" as cause of implausibility.

Conflict detection: Next, the task of conflict detection entails identifying sentence pairs of the form Si → Sj. Here, Sj represents the breakpoint, indicating the point at which the story becomes implausible based on the given context. Si serves as the evidence that explains the breakpoint, typically causing a conflicting world state.

Story classification:

The end task revolves around determining the plausibility of two stories. This determination is based on the conflicts detected within the two stories. By considering the presence of conflicts, the model can assess the viability and coherence of each story, facilitating the classification of the more plausible one.

By incorporating physical state classification, conflict detection, and story classification, we analyze the aspects of coherent reasoning, supported by evidence-driven analysis.

Data description

The GITA4CALAMITA dataset is composed by plausible and implausible stories. To compose the dataset, we focused on concrete actions that could be visualized in the physical world, avoiding mental actions such as "to think" or "to like". We created 5-sentence stories, giving context and requiring reasoning over multiple sentences. In all the stories, we avoided nonsensical sentences, in fact, each sentence is plausible alone, but could be implausible if associated with another specific sentence in an implausible story. With these characteristics, the task requires reasoning over the entire context.

An essential part of our evaluation process is constituted by the presence of physical state annotation. Systems must identify the underlying physical states that make a story not plausible in our physical world. During the creation of this dataset, we took into account 14 physical attributes that were included in the annotation phase, and we composed stories that contained those attributes. Following the work of [9] and [10], these are the 14 physical states that we wanted to have in our stories:

• location, conscious, dressed, wet, exist, clean, power, functional, in pieces, open, temperature, solid, occupied, edible.

Dataset creation

In the first two rows of Table 1 we can see an example of plausible story from the GITA4CALAMITA dataset

Table 1

Example of a plausible story, an implausible story from the Order dataset, and an implausible story from the Cloze dataset.

together with the English translation. In this example, the human actor is Marco, and the five sentences are ordered in the required way: the action of opening something, picking something up and using it. We can see that some of the previously listed physical states appear: Marco is conscious because he is doing something, the refrigerator is open because the actor can take something out of it, the cup is not occupied by anything and can be functional.

We aimed to minimize subjectivity and limit potential confounding factors from complex language usage. By using simple language, we were able to shift our focus away from linguistic processing and semantic phenomena, allowing us to concentrate more on examining machines' reasoning abilities, particularly their physical commonsense understanding. Consequently, we created our simple sentences in a straightforward declarative structure, typically starting with the agent of the story, followed by a verb, a direct object, and optionally, an indirect object.

Implausible stories are built upon the plausible ones, preserving the same actor and objects; in doing so we ensured that implausible variations remained coherent and believable, and we avoided nonsensical information. To create implausible stories, we implemented two different methods:

1. we switched the order of two sentences; 2. we substituted a plausible sentence with an implausible one.

These two methods resulted in two different partitions of our dataset: the Order dataset of implausible stories, and the Cloze dataset of implausible stories respectively.

Order implausible stories

The plausible stories only work in the causal sequence that we created. In the first row of Table 1, there is an example of a plausible story. In the third row, we see the corresponding implausible story for the order dataset, in which Marco, first, takes the milk out from the refrigerator and then open the refrigerator, generating a physically impossible situation: it is not possible to take something out of a closed refrigerator. By switching the first and the second sentences, we created an implausible story. In the entire dataset, we decided to generate implausible stories changing the order of only two sentences for story.

Cloze implausible stories

The second approach involves the substitution of a sentence from the plausible story with a new sentence. Although the new sentence itself is not inherently implausible, its placement within the sequence renders it implausible. In Table 1, the first sentence of the line F (Cloze), in the fifth row, was changed: Marco closes the refrigerator before taking out the milk. Again, the action is physically impossible: if the refrigerator is closed, nothing can be taken out from it.

Origin of data

GITA4CALAMITA is a new version of [8], which is based on [11]. Our main objective was to create an Italian dataset, manually annotated, to assess a pre-trained language model on physical commonsense tiered tasks. To create the stories, we took inspiration from the Story Cloze Test [12] and ROCStories Corpora [13]. The Story Cloze Test compiles four-sentence stories with a missing ending so that a system chooses the most appropriate conclusion; the ROCStories Corpora is composed of fivesentence stories about everyday life for story generation.

Annotation details

GITA4CALAMITA is annotated on three levels. In the first level, we annotated the plausibility/implausibility of a story with TRUE or FALSE. In the second level, in implausible stories we indicated between which sentences the conflict was, and in the third level we labelled the involved physical states in each sentence.

In the dataset, a plausible story is identified using a story number, while implausible stories are identified using the same story number as the plausible version, but with an additional C or O after the story number, where the letter C refers to the Cloze dataset, and the letter O refers to the Order dataset. Each story has been annotated using these elements: story id, worker id, actor of the story, objects of the story, physical states, sentences of the story, as well as number of sentences, and conflicting sentences, among others. The complete list and the specific meaning of each element are in Appendix A.

In each implausible story, we annotated the physical state that caused a conflict between two sentences. We annotated both Order and Cloze implausible stories according to the corresponding physical state involved. If we consider the stories in Table 1, both implausible stories (C and O) are annotated using the physical state "open", In fact, in both implausible stories the conflict is related to the openness of the refrigerator: in both cases the refrigerator appears closed when Marco tries to take the milk out of it. There are cases where for one plausible story there are two implausible stories that are implausible for two different reasons, hence the annotated physical state is different.

To ensure consistency and reduce human effort, we developed a custom environment and a Python script to streamline the annotation process. This semi-automated annotation process helped us process sentences from different story types, extract entities and actors, and organize them for manual annotation. The script provided a user-friendly terminal interface, and it is available in our repository. In terms of annotation efficiency, manually annotating one plausible story and two implausible ones typically took around 50 minutes. However, using our semi-automated annotation interface, we were able to complete the same task in approximately 20 minutes. Consequently, instead of the estimated 100 hours for annotating the entire dataset, we reduced the time to around 40 hours. Additionally, some annotations required review and occasional revisions, hence we estimated that the overall effort was of approximately 50-55 hours. An example of a complete annotation can be found in Appendix B.

Data format

The GITA4CALAMITA dataset was created and annotated in a JSON format. The following example is story 0-C0 of our dataset, the first implausible Cloze story.

{

"0 − C0 " : { " s t o r y _ i d " : 0 , " w o r k e r _ i d " : "GAP " , " t y p e " : " c l o z e " , " i d x " : 0 , " aug " : f a l s e , " a c t o r " : " Marco " , " l o c a t i o n " : " c u c i n a " , " o b j e c t s " : " f r i g o , l a t t e , t a z z a , c u c c h i a i o " , " s e n t e n c e s " : [ " Marco ha c h i u s o i l f r i g o . " , " Marco ha p r e s o i l l a t t e d a l f r i g o . " , " Marco ha p r e s o l a t a z z a . " , " Marco ha p r e s o i l c u c c h i a i o . " , " Marco ha messo i l c u c c h i a i o n e l l a t a z z a . " ] , " l e n g t h " : 5 , " e x a m p l e _ i d " : "0 − C0 " , " p l a u s i b l e " : f a l s e , " b r e a k p o i n t " : 1 , " c o n f l _ s e n t s " : [ 0 ] , " c o n f l _ p a i r s " : [ 0 , 1 ] } }

Example of prompts used for zero or/and few shots

For each of the three proposed tasks we use a different prompt:

• Task 1: Please read the following story and answer if the story is plausible taking into account the order of the events. Please answer with true or false.

Task 2:

The following story is implausible. Identify the breakpoint, and then select the sentence responsible for the implausibility. Please identify the breakpoint sentence and the conflicting sentence.

Task 3:

The following story is implausible. Identify the physical state that causes the conflict in the story. These are the descriptions of each physical state: Power: Indicates whether an object is powered or not, relevant for electrical devices.

Location: Refers to the spatial position of an entity, either human or object. Exist: Denotes whether an object is present or has disappeared. Clean: Refers to the cleanliness of an entity, indicating whether it is clean or dirty. Edible: Identifies whether an object is fit for consumption. Wet: Denotes whether an object or person is in a wet or dry state. Functional: Refers to whether an object is in working condition or broken. Wearing: Applies to humans, indicating whether they are dressed or not. Open: Refers to whether an object (e.g., a door or container) is open or closed. Conscious: Denotes whether a human is conscious or unconscious. Temperature: Refers to the relative temperature of an entity, e.g., hot or cold. Solid: Describes whether an object is in a solid state. Occupied: Indicates whether an object (e.g., a container) is occupied or contains something. In pieces: Refers to whether an object is intact or has been broken into pieces. Select one of them after reading the story.

We select some examples from our GITA4CALAMITA dataset to be used as few-shot examples. For some of the tests we randomly select the examples, for others, we base our choice on their variability. We select stories where all possible combination of conflicting sentences were happening; at the same time, within the selected stories we try to include most of the physical states annotated.

Detailed data statistics

The GITA4CALAMITA dataset is an Italian test composed by a total of 356 stories. The statistics of the GITA4CALAMITA dataset are in Table 2.

Measures

Metrics

The metrics involved in our tasks for the GITA4CALAMITA benchmark are the following ones:

• Accuracy assesses the traditional measure of end task accuracy, which quantifies the proportion of testing examples where plausible stories and implausible stories are accurately identified. • Consistency measures the proportion of testing examples where not only the implausible story is correctly identified, but also the conflicting sentence pair for the implausible story is accurately identified. The aim is to demonstrate the model's consistency in recognizing conflicts when reasoning about plausibility. • Verifiability evaluates the proportion of testing examples where not only the implausible story and the conflicting sentence pair for the implausible story are correctly identified, but also the underlying physical states that contribute to the conflict are accurately identified. This demonstrates that the detected conflict can be validated through a correct understanding of the underlying implausible change of physical states.

Taking into consideration the three different metrics, in Table 3 we report the results in our test set. We perform experiments using the base and instruct Llama 3.1, Gemma 2 and Mistral models of various sizes. Each metric is obtained from a different task, where models are evaluated in the instances that are only guessed correctly in the previous tasks. All tasks are evaluated in a 3-shot setting, using random examples from the test set. For models that support system prompt (Llama3.1 models), the description of each task is included there, for models that do not support it (Gemma2 and Mistral models) the task description is included in the first user input. Each few-shot instance is formatted as a multiturn conversation between user and assistant. Next, we describe the main findings from these results.

Model Size and Performance: Generally, larger models (e.g., Llama-3.1 70B) outperform smaller models across the metrics. The 70B Llama-3.1 models show improvements over their 8B counterparts, particularly in consistency and verifiability. Gemma2 models also show improvements when bigger models are used. There are two exceptions in the case of the accuracy: Gemma2-Instruct 9B and Llama-3.1-Instruct 8B achieve better results than their bigger counterparts Gemma2 27B and Llama3 70B. They also outperform the base models.

Limitations

This study has some limitations that should be acknowledged. Firstly, only one prompt was tested for each task, which may not fully capture the potential variability in performance. Additionally, the models used were multilingual but not specifically tailored for the Italian language, potentially affecting the accuracy of the results for Italian-specific tasks. Furthermore, the dataset used in this study was limited to stories within the household domain, which may not generalize well to other contexts.

Ethical issues

The dataset contains stories that may prototypically occur in Italian households. While most of these narratives are likely to be familiar to a broad audience, people from different cultural backgrounds may find some of the stories less frequent.

Figure 1 :1Figure 1: Representation of story pair from GITA

Table 22Statistics of GITA4CALAMITAGITA4CALAMITAplausible stories117implausible stories (ORDER)122implausible stories (CLOZE)117total stories356

Table 33Results of the base and instruct Llama 3.1, Gemma 2 and Mistral models of various sizes Instruction Tuning Effects: Instruction-tuned versions (e.g., Gemma-2-Instruct, Llama-3.1-Instruct) typically outperform their base counterparts. There are exceptions such as order accuracy for LLama 3.1 70B and Gemma 2 9B. However, Mistral-V0.3-Instruct is very similar or worse than the base model and generally is more biased, it tends to classify as plausible the stories and it performs better in Cloze than in Order.Cloze, Order and Plausible Most models perform generally better on Cloze examples compared to Orderexamples. This is consistent across models and metrics. Models are generally better in Cloze and Order than in Plausible. This could be explained by the bias of the models to answer true or false when they are asked if the story is plausible. Models also see double implausible few-shot examples, which could also cause models to give that answer more frequently.

Acknowledgments

This work has been partially funded by: • DeepR3 (TED2021-130295B-C31) funded by MCIN/AEI/10.13039/501100011033 and European Union NextGeneration EU/PRTR. • Disargue (TED2021-130810B-C21) MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR. • DeepKnowledge (PID2021-127777OB-C21) MCIN/AEI/10.13039/501100011033 and by FEDER, EU. • Ixa group A type research group (IT1570-22) Basque Government • IKER-GAITU project 11:4711:23:410:23/0808 by Basque Government

A. Annotations in the dataset

These are the attributes that encode the metadata and linguistic information in the GITA dataset:

• story_id: refers to the number of the story for both plausible and implausible stories. • worker_id: refers to the name assigned to a specific worker during the creation of the story. • type: refers to cloze or order and it is a label used only in implausible stories. • idx: refers to the implausible dataset, where there is more than one implausible story for a given story number; for example, if we have more than one implausible version of a plausible story (we created more than an implausible story changing the order of our sentences more than once), the index number indicates to which implausible example we are referring. • aug: refers to possible automatic data augmentation techniques that can be taken into account for future works to resolve an overfitting problem. • actor: refers to the human agent of the story.

• location: refers to the room where the story takes place. • objects: refers to all the inanimate entities that we find into each story. • sentences: includes the 5 sentences in the story.

• length: refers to the number of sentences in each story. • example_id: corresponds to the story number and includes letters for implausible stories.

• plausible: is TRUE when the story is plausible and FALSE when it is implausible. • breakpoint: refers to the sentence where the story becomes implausible, where the conflict becomes evident; in plausible stories the breakpoint is always -1. • conlict_sents: refers to the other sentence in the story that together with the breakpoint sentence makes the story implausible; in plausible stories this field is blank. • conlict_pairs: refers to the conflict pair of sentences, gathering the two previous labels; in plausible stories this field is blank. • states: includes all the physical states annotations for all the stories. Marco ha p r e s o i l l a t t e .

B. Annotation environment

Marco ha p r e s o l a t a z z a .

Marco ha p r e s o i l c u c c h i a i o .

Marco ha messo i l c u c c h i a i o n e l l a t a z z a . l e n g t h :

Towards Reasoning in Large Language Models: A Survey JHuang KC .-CChang 10.18653/v1/2023.findings-acl.67 Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics

Toronto, Canada

2023 WinoGrande: An Adversarial Winograd Schema Challenge at Scale KSakaguchi RLBras CBhagavatula YChoi 10.1145/3474381 Commun. ACM 64 2021 A Review on Fairness in Machine Learning DPessach EShmueli 10.1145/3494672 ACM Comput. Surv 55 2022 Benchmarks for Automated Commonsense Reasoning: A Survey EDavis ACM Comput. Surv <idno type="DOI">10.1145/3615355</idno> <ptr target="https://doi.org/10.1145/3615355.doi:10.1145/3615355" /> <imprint/> </monogr> <note>just Accepted</note> </biblStruct> <biblStruct xml:id="b5"> <analytic> <title level="a" type="main">How Can We Accelerate Progress Towards Human-like Linguistic Generalization? TLinzen 10.18653/v1/2020.acl-main.465 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data EMBender AKoller 10.18653/v1/2020.acl-main.463 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian GAttanasio PBasile FBorazio DCroce MFrancis JGili EMusacchio MNissim VPatti MRinaldi DScalena Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024) CEUR Workshop Proceedings the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

Pisa, Italy

December 4 -December 6, 2024. 2024 A Multilayered Approach to Physical Commonsense Understanding: Creation and Evaluation of an Italian Dataset GPensa BAltuna IGonzalez-Dios Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) NCalzolari M.-YKan VHoste ALenci SSakti NXue the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, Italia

ELRA and ICCL 2024 Physical Causality of Action Verbs in Grounded Language Understanding QGao MDoering SYang JChai 10.18653/v1/P16-1171 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics Long Papers the 54th Annual Meeting of the Association for Computational Linguistics

Berlin, Germany

2016 1 Association for Computational Linguistics Simulating Action Dynamics with Neural Process Networks ABosselut OLevy AHoltzman CEnnis DFox YChoi CoRR abs/1711.05313 2017 Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding SStorks QGao YZhang JChai 10.18653/v1/2021.findings-emnlp.422 Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics

Punta Cana, Dominican Republic

2021 LSDSem 2017 Shared Task: The Story Cloze Test NMostafazadeh MRoth ALouis NChambers JAllen 10.18653/v1/W17-0906 Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, Association for Computational Linguistics the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, Association for Computational Linguistics

Valencia, Spain

2017 A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories NMostafazadeh NChambers XHe DParikh DBatra LVanderwende PKohli JAllen 10.18653/v1/N16-1098 Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics

San Diego, California

2016