-

J. Etxaniz);

1613-0073

GITA4CALAMITA - Evaluating the Physical Com monsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge

Giulia Pensa

Ekhi Azurmendi

0 1

Julen Etxaniz

julen.etxaniz@ehu.eus 0 1

Begoña Altuna

begona.altuna@ehu.eus 0 1

Itziar Gonzalez-Dios

itziar.gonzalezd@ehu.eus 0 1

University of the Basque Country UPV/EHU

0 0 CLiC-it 2024: Tenth Italian Conference on Computational Linguistics 1 HiTZ Center - Ixa, University of the Basque Country UPV/EHU

2099

000 0 0002

In the context of the CALAMITA Challenge, we investigate the physical commonsense reasoning capabilities of large language models (LLMs) and introduce a methodology to assess their understanding of the physical world. To this end, we use a test set designed to evaluate physical commonsense reasoning in LLMs for the Italian language. We present a tiered dataset, named the Graded Italian Annotated dataset (GITA), which is written and annotated by a professional linguist. This dataset enables us to focus on three distinct levels of commonsense understanding. Our benchmark aims to evaluate three specific tasks: identifying plausible and implausible stories within our dataset, identifying the conflict that generates an implausible story, and identifying the physical states that make a story implausible. We perform these tasks using LLAMA3, Gemma2 and Mistral. Our findings reveal that, although the models may excel at high-level classification tasks, their reasoning is inconsistent and unverifiable, as they fail to capture intermediate evidence.

Multi-layered Physical commonsense reasoning large language models Italian benchmark

CALAMITA

1. Challenge: Introduction and Motivation

ity to comprehend the physical world and the events that transpire within it. This capability is a crucial component of human intelligence, enabling us to reason about our environment, anticipate future occurrences, and navigate our surroundings efortlessly, and recently there has been notable advancement in the development of large language models (LLMs) that can produce human-like language and execute a variety of language-related tasks.

LLMs have exhibited promising outcomes in grasping common sense in particular situations [ 1, 2 ]. Nevertheless, it is widely recognized that the most precise evaluation of their capabilities is attained when assessing their performance in specific end tasks [ 3, 4 ]. The evaluation often emphasizes the capacity of LLMs to replicate relatively straightforward tasks, rather than their authentic

LGOBE https://github.com/GiuliaAPensa (G. Pensa)

In this paper, we present GITA4CALAMITA, the Graded Italian Annotated dataset for the CALAMITA challenge [7]. GITA4CALAMITA is an adapted version of the GITA dataset proposed in [8]. In particular, we decided to revise the physical states annotation and adapt it to this challenge. The first version of GITA dataset is available in our repository under the license CC BYNC-SA 4.0.1. The GITA4CALAMITA dataset is manually compiled by a professional linguist, which allows for this multi-layered evaluation of the reasoning process. With the creation of an Italian dataset we gain the linguistic and cultural perspective of Italian, while commonsense research in Natural Language Processing CEUR

ceur-ws.org

2. Challenge: Description

Our aim in this challenge is to assess the understanding of physical commonsense in LLMs for Italian. We configure our assessment proposal in the following terms: 1. given an original dataset of plausible/implausible stories related to physical commonsense, systems must identify the plausible and implausible stories; 2. systems must recognize the conflicting sentences that generate the conflict in implausible stories; 3. systems must spot the underlying physical states that cause conflict in implausible stories.

Story classification: The end task revolves around determining the plausibility of two stories. This determination is based on the conflicts detected within the two stories. By considering the presence of conflicts, the model can assess the viability and coherence of each story, facilitating the classification of the more plausible one.

By incorporating physical state classification, conflict detection, and story classification, we analyze the aspects of coherent reasoning, supported by evidence-driven analysis.

3. Data description

The recognition of plausible/implausible stories is the end task envisaged in this benchmark, which must be The GITA4CALAMITA dataset is composed by plausijustified by the second-level and third-level steps. In Fig- ble and implausible stories. To compose the dataset, we ure 1 we present a story pair from the GITA4CALAMITA focused on concrete actions that could be visualized in dataset and the relation between the layers of annotation. the physical world, avoiding mental actions such as “to Story A is a plausible story, Story B is the corresponding think” or “to like”. We created 5-sentence stories, giving implausible story where the first and the second sen- context and requiring reasoning over multiple sentences. tences are in conflict: Marco closes the refrigerator and In all the stories, we avoided nonsensical sentences, in cannot take the milk out of it. In the right part of the fact, each sentence is plausible alone, but could be imifgure we can see the reasoning steps that the system plausible if associated with another specific sentence in must follow and resolve. This example is presented in an implausible story. With these characteristics, the task English for clarity, but our entire dataset is in Italian. requires reasoning over the entire context.

We introduce a series of tasks that constitute a human- An essential part of our evaluation process is constiinterpretable reasoning process, supported by a chain tuted by the presence of physical state annotation. Sysof evidence, reflecting the assessment methodology out- tems must identify the underlying physical states that lined above. To explain this approach, we present the make a story not plausible in our physical world. Durtasks from the deepest to the shallowest, mirroring hu- ing the creation of this dataset, we took into account 14 man reasoning: physical attributes that were included in the annotation

Physical state classification: Leveraging our phys- phase, and we composed stories that contained those atical state annotations, systems must recognize the in- tributes. Following the work of [9] and [10], these are the volved physical states in the conflicting sentences of im- 14 physical states that we wanted to have in our stories: plausible stories. If we look at the example in 1, we are • location, conscious, dressed, wet, exist, clean, able to identify the problematic physical state “open” as power, functional, in pieces, open, temperature, cause of implausibility. solid, occupied, edible.

Conflict detection: Next, the task of conflict detection entails identifying sentence pairs of the form Si → Sj.

Here, Sj represents the breakpoint, indicating the point 3.1. Dataset creation at which the story becomes implausible based on the given context. Si serves as the evidence that explains the In the first two rows of Table 1 we can see an example breakpoint, typically causing a conflicting world state. of plausible story from the GITA4CALAMITA dataset sentence 1

Marco ha aperto il frigo. Marco opened the refrigerator. Marco ha preso il latte dal frigo. Marco took the milk from the refrigerator. Marco ha chiuso il frigo. Marco closed the refrigerator.

sentence 2

Marco ha preso il latte dal frigo. Marco took the milk from the refrigerator. Marco ha aperto il frigo. Marco opened the refrigerator. Marco ha preso il latte dal frigo. Marco took the milk from the refrigerator.

sentence 3

Marco ha preso la tazza. Marco took the cup. Marco ha preso la tazza. Marco took the cup. Marco ha preso la tazza. Marco took the cup.

sentence 4

Marco ha versato il latte nella tazza. Marco poured the milk into the cup. Marco ha versato il latte nella tazza. Marco poured the milk into the cup. Marco ha versato il latte nella tazza. Marco poured the milk into the cup.

sentence 5

Marco ha bevuto il latte. Marco drank the milk. Marco ha bevuto il latte. Marco drank the milk. Marco ha bevuto il latte. Marco drank the milk.

together with the English translation. In this example, the 3.1.1. Order implausible stories human actor is Marco, and the five sentences are ordered in the required way: the action of opening something, The plausible stories only work in the causal sequence picking something up and using it. We can see that some that we created. In the first row of Table 1, there is an of the previously listed physical states appear: Marco is example of a plausible story. In the third row, we see the conscious because he is doing something, the refrigerator corresponding implausible story for the order dataset, is open because the actor can take something out of it, the in which Marco, first, takes the milk out from the cup is not occupied by anything and can be functional. refrigerator and then open the refrigerator, generating a

We aimed to minimize subjectivity and limit poten- physically impossible situation: it is not possible to take tial confounding factors from complex language usage. something out of a closed refrigerator. By switching the By using simple language, we were able to shift our fo- ifrst and the second sentences, we created an implausible cus away from linguistic processing and semantic phe- story. In the entire dataset, we decided to generate nomena, allowing us to concentrate more on examining implausible stories changing the order of only two machines’ reasoning abilities, particularly their physical sentences for story. commonsense understanding. Consequently, we created our simple sentences in a straightforward declarative structure, typically starting with the agent of the story, 3.1.2. Cloze implausible stories followed by a verb, a direct object, and optionally, an The second approach involves the substitution of a senindirect object. tence from the plausible story with a new sentence. Al

Implausible stories are built upon the plausible ones, though the new sentence itself is not inherently implausipreserving the same actor and objects; in doing so we en- ble, its placement within the sequence renders it implausured that implausible variations remained coherent and sible. In Table 1, the first sentence of the line F (Cloze), in believable, and we avoided nonsensical information. To the fith row, was changed: Marco closes the refrigerator create implausible stories, we implemented two diferent before taking out the milk. Again, the action is physically methods: impossible: if the refrigerator is closed, nothing can be taken out from it. 1. we switched the order of two sentences; 2. we substituted a plausible sentence with an im

plausible one.

These two methods resulted in two diferent partitions of our dataset: the Order dataset of implausible stories, and the Cloze dataset of implausible stories respectively. 3.2. Origin of data GITA4CALAMITA is a new version of [8], which is based on [11]. Our main objective was to create an Italian dataset, manually annotated, to assess a pre-trained language model on physical commonsense tiered tasks. To create the stories, we took inspiration from the Story Cloze Test [12] and ROCStories Corpora [13]. The Story Cloze Test compiles four-sentence stories with a missing ending so that a system chooses the most appropriate conclusion; the ROCStories Corpora is composed of fivesentence stories about everyday life for story generation. 3.3. Annotation details GITA4CALAMITA is annotated on three levels. In the ifrst level, we annotated the plausibility/implausibility of a story with TRUE or FALSE. In the second level, in implausible stories we indicated between which sentences the conflict was, and in the third level we labelled the involved physical states in each sentence.

In the dataset, a plausible story is identified using a story number, while implausible stories are identified using the same story number as the plausible version, but with an additional C or O after the story number, where the letter C refers to the Cloze dataset, and the letter O refers to the Order dataset. Each story has been annotated using these elements: story id, worker id, actor of the story, objects of the story, physical states, sentences of the story, as well as number of sentences, and conflicting sentences, among others. The complete list and the specific meaning of each element are in Appendix A.

In each implausible story, we annotated the physical state that caused a conflict between two sentences. We annotated both Order and Cloze implausible stories according to the corresponding physical state involved. If we consider the stories in Table 1, both implausible stories (C and O) are annotated using the physical state “open”, In fact, in both implausible stories the conflict is related to the openness of the refrigerator: in both cases the refrigerator appears closed when Marco tries to take the milk out of it. There are cases where for one plausible story there are two implausible stories that are implausible for two diferent reasons, hence the annotated physical state is diferent.

To ensure consistency and reduce human efort, we developed a custom environment and a Python script to streamline the annotation process. This semi-automated annotation process helped us process sentences from diferent story types, extract entities and actors, and organize them for manual annotation. The script provided a user-friendly terminal interface, and it is available in our repository. In terms of annotation eficiency, manually annotating one plausible story and two implausible ones typically took around 50 minutes. However, using our semi-automated annotation interface, we were able to complete the same task in approximately 20 minutes.

Consequently, instead of the estimated 100 hours for annotating the entire dataset, we reduced the time to around 40 hours. Additionally, some annotations required review and occasional revisions, hence we estimated that the overall efort was of approximately 50-55 hours. An example of a complete annotation can be found in Appendix B. 3.4. Data format The GITA4CALAMITA dataset was created and annotated in a JSON format. The following example is story 0-C0 of our dataset, the first implausible Cloze story. { } ”0 − C0 ” : { ” s t o r y _ i d ” : 0 , ” w o r k e r _ i d ” : ”GAP ” , ” t y p e ” : ” c l o z e ” , ” i d x ” : 0 , ” aug ” : f a l s e , ” a c t o r ” : ” Marco ” , ” l o c a t i o n ” : ” c u c i n a ” , ” o b j e c t s ” : ” f r i g o , l a t t e ,

t a z z a , c u c c h i a i o ” , ” s e n t e n c e s ” : [ ” Marco ha c h i u s o i l f r i g o

. ” , ” Marco ha p r e s o i l l a t t e

d a l f r i g o . ” , ” Marco ha p r e s o l a t a z z a

. ” , ” Marco ha p r e s o i l

c u c c h i a i o . ” , ” Marco ha messo i l c u c c h i a i o n e l l a t a z z a . ” } ] , ” l e n g t h ” : 5 , ” e x a m p l e _ i d ” : ”0 − C0 ” , ” p l a u s i b l e ” : f a l s e , ” b r e a k p o i n t ” : 1 , ” c o n f l _ s e n t s ” : [ 0 ] , ” c o n f l _ p a i r s ” : [ 0 , 1 ] 3.5. Example of prompts used for zero

or/and few shots For each of the three proposed tasks we use a diferent prompt: • Task 1: Please read the following story and answer if the story is plausible taking into account the order of the events. Please answer with true or false.

Task 2: The following story is implausible. Identify the breakpoint, and then select the sentence responsible for the implausibility. Please identify the breakpoint sentence and the conflicting sentence.

Task 3: The following story is implausible. Identify the physical state that causes the conflict in the story. These are the descriptions of each physical state: Power: Indicates whether an object is powered or not, relevant for electrical devices.

Location: Refers to the spatial position of an entity, either human or object. Exist: Denotes whether an object is present or has disappeared.

Clean: Refers to the cleanliness of an entity, indicating whether it is clean or dirty. Edible: Identiifes whether an object is fit for consumption. Wet: Denotes whether an object or person is in a wet or dry state. Functional: Refers to whether an object is in working condition or broken. Wearing: Applies to humans, indicating whether they are dressed or not. Open: Refers to whether an object (e.g., a door or container) is open or closed.

Conscious: Denotes whether a human is conscious or unconscious. Temperature: Refers to the relative temperature of an entity, e.g., hot or cold. Solid: Describes whether an object is in a solid state. Occupied: Indicates whether an object (e.g., a container) is occupied or contains something. In pieces: Refers to whether an object is intact or has been broken into pieces. Select one of them after reading the story.

4. Metrics

The metrics involved in our tasks for the GITA4CALAMITA benchmark are the following ones: • Accuracy assesses the traditional measure of end task accuracy, which quantifies the proportion of testing examples where plausible stories and implausible stories are accurately identified. • Consistency measures the proportion of testing examples where not only the implausible story is correctly identified, but also the conflicting sentence pair for the implausible story is accurately identified. The aim is to demonstrate the model’s consistency in recognizing conflicts when reasoning about plausibility. • Verifiability evaluates the proportion of testing examples where not only the implausible story and the conflicting sentence pair for the implausible story are correctly identified, but also the underlying physical states that contribute to the conflict are accurately identified. This demonstrates that the detected conflict can be validated through a correct understanding of the underlying implausible change of physical states.

Taking into consideration the three diferent metrics, in Table 3 we report the results in our test set. We perform experiments using the base and instruct Llama 3.1,

We select some examples from our GITA4CALAMITA Gemma 2 and Mistral models of various sizes. Each metdataset to be used as few-shot examples. For some of the ric is obtained from a diferent task, where models are tests we randomly select the examples, for others, we base evaluated in the instances that are only guessed correctly our choice on their variability. We select stories where in the previous tasks. All tasks are evaluated in a 3-shot all possible combination of conflicting sentences were setting, using random examples from the test set. For happening; at the same time, within the selected stories models that support system prompt (Llama3.1 models), we try to include most of the physical states annotated. the description of each task is included there, for models that do not support it (Gemma2 and Mistral models) the 3.6. Detailed data statistics task description is included in the first user input. Each few-shot instance is formatted as a multiturn conversaThe GITA4CALAMITA dataset is an Italian test com- tion between user and assistant. Next, we describe the posed by a total of 356 stories. The statistics of the main findings from these results.

GITA4CALAMITA dataset are in Table 2.

Measures plausible stories implausible stories (ORDER) implausible stories (CLOZE) total stories

GITA4CALAMITA 117 122 117 356 Model Size and Performance: Generally, larger models (e.g., Llama-3.1 70B) outperform smaller models across the metrics. The 70B Llama-3.1 models show improvements over their 8B counterparts, particularly in consistency and verifiability. Gemma2 models also show improvements when bigger models are used. There are two exceptions in the case of the accuracy: Gemma2-Instruct 9B and Llama-3.1-Instruct 8B achieve better results than their bigger counterparts Gemma2 27B and Llama3 70B.

They also outperform the base models.

Consistency

Cloze Plausible

Overall

Order

Overall Instruction Tuning Efects: Instruction-tuned versions (e.g., Gemma-2-Instruct, Llama-3.1-Instruct) typically outperform their base counterparts. There are ex- This work has been partially funded by: ceptions such as order accuracy for LLama 3.1 70B and Gemma 2 9B. However, Mistral-V0.3-Instruct is very similar or worse than the base model and generally is more biased, it tends to classify as plausible the stories and it performs better in Cloze than in Order.

Acknowledgments

Cloze, Order and Plausible Most models perform generally better on Cloze examples compared to Order examples. This is consistent across models and metrics. Models are generally better in Cloze and Order than in Plausible. This could be explained by the bias of the models to answer true or false when they are asked if the story is plausible. Models also see double implausible few-shot examples, which could also cause models to give that answer more frequently.

5. Limitations

This study has some limitations that should be acknowledged. Firstly, only one prompt was tested for each task, which may not fully capture the potential variability in performance. Additionally, the models used were multilingual but not specifically tailored for the Italian language, potentially afecting the accuracy of the results for Italian-specific tasks. Furthermore, the dataset used in this study was limited to stories within the household domain, which may not generalize well to other contexts.

6. Ethical issues

The dataset contains stories that may prototypically occur in Italian households. While most of these narratives are likely to be familiar to a broad audience, people from diferent cultural backgrounds may find some of the stories less frequent. • DeepR3 (TED2021-130295B-C31) funded by

MCIN/AEI/10.13039/501100011033 and European

Union NextGeneration EU/PRTR. • Disargue (TED2021-130810B-C21)

MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR. • DeepKnowledge (PID2021-127777OB-C21)

MCIN/AEI/10.13039/501100011033 and by

FEDER, EU. • Ixa group A type research group (IT1570-22)

Basque Government • IKER-GAITU project 11:4711:23:410:23/0808 by

Basque Government (2023). URL: https://doi.org/10.1145/3615355. doi:10. 2021, pp. 4902–4918. URL: https://aclanthology.org/ 1145/3615355, just Accepted. 2021.findings-emnlp.422. doi:10.18653/v1/2021. [5] T. Linzen, How Can We Accelerate Progress To- findings- emnlp.422.

wards Human-like Linguistic Generalization?, in: [12] N. Mostafazadeh, M. Roth, A. Louis, N. ChamProceedings of the 58th Annual Meeting of the As- bers, J. Allen, LSDSem 2017 Shared Task: The sociation for Computational Linguistics, Associa- Story Cloze Test, in: Proceedings of the 2nd tion for Computational Linguistics, Online, 2020, Workshop on Linking Models of Lexical, Sentenpp. 5210–5217. URL: https://aclanthology.org/2020. tial and Discourse-level Semantics, Association for acl-main.465. doi:10.18653/v1/2020.acl- main. Computational Linguistics, Valencia, Spain, 2017, 465. pp. 46–51. URL: https://aclanthology.org/W17-0906. [6] E. M. Bender, A. Koller, Climbing towards NLU: doi:10.18653/v1/W17- 0906.

On Meaning, Form, and Understanding in the [13] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, Age of Data, in: Proceedings of the 58th An- D. Batra, L. Vanderwende, P. Kohli, J. Allen, A Cornual Meeting of the Association for Computa- pus and Cloze Evaluation for Deeper Understanding tional Linguistics, Association for Computational of Commonsense Stories, in: Proceedings of the Linguistics, Online, 2020, pp. 5185–5198. URL: 2016 Conference of the North American Chapter https://aclanthology.org/2020.acl-main.463. doi:10. of the Association for Computational Linguistics: 18653/v1/2020.acl- main.463. Human Language Technologies, Association for [7] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran- Computational Linguistics, San Diego, California, cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri- 2016, pp. 839–849. URL: https://aclanthology.org/ naldi, D. Scalena, CALAMITA: Challenge the Abili- N16-1098. doi:10.18653/v1/N16- 1098. ties of LAnguage Models in ITAlian, in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy, Decem- A. Annotations in the dataset ber 4 - December 6, 2024, CEUR Workshop Proceedings, CEUR-WS.org, 2024. These are the attributes that encode the metadata and [8] G. Pensa, B. Altuna, I. Gonzalez-Dios, A Multi- linguistic information in the GITA dataset: layered Approach to Physical Commonsense Understanding: Creation and Evaluation of an Ital- • story_id: refers to the number of the story for ian Dataset, in: N. Calzolari, M.-Y. Kan, V. Hoste, both plausible and implausible stories. A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the • worker_id: refers to the name assigned to a spe2024 Joint International Conference on Computa- cific worker during the creation of the story. tional Linguistics, Language Resources and Evalua- • type: refers to cloze or order and it is a label used tion (LREC-COLING 2024), ELRA and ICCL, Torino, only in implausible stories.

Italia, 2024, pp. 819–831. URL: https://aclanthology. • idx: refers to the implausible dataset, where there org/2024.lrec-main.74. is more than one implausible story for a given [9] Q. Gao, M. Doering, S. Yang, J. Chai, Physical story number; for example, if we have more than Causality of Action Verbs in Grounded Language one implausible version of a plausible story (we Understanding, in: Proceedings of the 54th An- created more than an implausible story changnual Meeting of the Association for Computational ing the order of our sentences more than once), Linguistics (Volume 1: Long Papers), Association the index number indicates to which implausible for Computational Linguistics, Berlin, Germany, example we are referring. 2016, pp. 1814–1824. URL: https://aclanthology.org/ • aug: refers to possible automatic data augmentaP16-1171. doi:10.18653/v1/P16- 1171. tion techniques that can be taken into account for [10] A. Bosselut, O. Levy, A. Holtzman, C. Ennis, future works to resolve an overfitting problem.

D. Fox, Y. Choi, Simulating Action Dynam- • actor: refers to the human agent of the story. ics with Neural Process Networks, CoRR • location: refers to the room where the story abs/1711.05313 (2017). URL: http://arxiv.org/abs/ takes place.

1711.05313. arXiv:1711.05313. • objects: refers to all the inanimate entities that [11] S. Storks, Q. Gao, Y. Zhang, J. Chai, Tiered Rea- we find into each story.

soning for Intuitive Physics: Toward Verifiable • sentences: includes the 5 sentences in the story. Commonsense Language Understanding, in: Find- • length: refers to the number of sentences in each ings of the Association for Computational Lin- story. guistics: EMNLP 2021, Association for Computa- • example_id: corresponds to the story number tional Linguistics, Punta Cana, Dominican Republic, and includes letters for implausible stories. • plausible: is TRUE when the story is plausible b r e a k p o i n t :

and FALSE when it is implausible. −1 • breakpoint: refers to the sentence where the c o n f l _ s e n t s ( type o n l y [ ] ) : story becomes implausible, where the conflict be- [ ] comes evident; in plausible stories the breakpoint is always -1. • conlict_sents: refers to the other sentence in the story that together with the breakpoint sentence makes the story implausible; in plausible stories this field is blank. • conlict_pairs: refers to the conflict pair of sentences, gathering the two previous labels; in plausible stories this field is blank. • states: includes all the physical states annotations for all the stories.

B. Annotation environment

‘ 0 ’ s t o r y _ i d (NO q u o t e s , NO l e t t e r , o n l y number ) : 0 w o r k e r _ i d ( i n q u o t e s ) : ‘GAP ’ type ( n u l l f o r p o s i t i v e , o r d e r , o r c l o z e , in q u o t e s ) : n u l l i d x ( n u l l , o r same a s NUMBER in s t o r y number ) : n u l l aug ( f a l s e ) : f a l s e l o c a t i o n ( in q u o t e s ) : ‘ c u c i n a ’ s e n t e n c e s : Marco ha a p e r t o i l f r i g o . Marco ha p r e s o i l l a t t e . Marco ha p r e s o l a t a z z a . Marco ha p r e s o i l c u c c h i a i o . Marco ha messo i l c u c c h i a i o n e l l a t a z z a . l e n g t h : 5 e x a m p l e _ i d ( same a s s t o r y number , i n q u o t e s ) :

[1]

Huang , K. C.-C. Chang , Towards Reasoning in Large Language Models: A Survey, in: Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 1049 - 1065 . URL: https://aclanthology.org/ 2023 .findings-acl. 67 . doi: 10 .18653/v1/ 2023 .findings- acl.67.

[2]

Sakaguchi ,

R. L.

Bras ,

Bhagavatula , Y. Choi, WinoGrande: An Adversarial Winograd Schema Challenge at Scale , Commun. ACM 64 ( 2021 ) 99 - 106 . URL: https://doi.org/10.1145/3474381. doi: 10 .1145/3474381.

[3]

Pessach ,

Shmueli , A Review on Fairness in Machine Learning , ACM Comput. Surv . 55 ( 2022 ). URL: https://doi.org/10.1145/3494672. doi: 10 .1145/ 3494672.

[4]

Davis , Benchmarks for Automated Commonsense Reasoning: A Survey , ACM Comput. Surv.