<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Etxaniz);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>GITA4CALAMITA - Evaluating the Physical Com monsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giulia Pensa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ekhi Azurmendi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julen Etxaniz</string-name>
          <email>julen.etxaniz@ehu.eus</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Begoña Altuna</string-name>
          <email>begona.altuna@ehu.eus</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Itziar Gonzalez-Dios</string-name>
          <email>itziar.gonzalezd@ehu.eus</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>University of the Basque Country UPV/EHU</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>HiTZ Center - Ixa, University of the Basque Country UPV/EHU</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2099</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In the context of the CALAMITA Challenge, we investigate the physical commonsense reasoning capabilities of large language models (LLMs) and introduce a methodology to assess their understanding of the physical world. To this end, we use a test set designed to evaluate physical commonsense reasoning in LLMs for the Italian language. We present a tiered dataset, named the Graded Italian Annotated dataset (GITA), which is written and annotated by a professional linguist. This dataset enables us to focus on three distinct levels of commonsense understanding. Our benchmark aims to evaluate three specific tasks: identifying plausible and implausible stories within our dataset, identifying the conflict that generates an implausible story, and identifying the physical states that make a story implausible. We perform these tasks using LLAMA3, Gemma2 and Mistral. Our findings reveal that, although the models may excel at high-level classification tasks, their reasoning is inconsistent and unverifiable, as they fail to capture intermediate evidence.</p>
      </abstract>
      <kwd-group>
        <kwd>Multi-layered</kwd>
        <kwd>Physical commonsense reasoning</kwd>
        <kwd>large language models</kwd>
        <kwd>Italian benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>A</p>
      <p>CALAMITA</p>
    </sec>
    <sec id="sec-2">
      <title>1. Challenge: Introduction and Motivation</title>
      <p>ity to comprehend the physical world and the events that
transpire within it. This capability is a crucial component
of human intelligence, enabling us to reason about our
environment, anticipate future occurrences, and
navigate our surroundings efortlessly, and recently there has
been notable advancement in the development of large
language models (LLMs) that can produce human-like
language and execute a variety of language-related tasks.</p>
      <p>
        LLMs have exhibited promising outcomes in grasping
common sense in particular situations [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
Nevertheless, it is widely recognized that the most precise
evaluation of their capabilities is attained when assessing their
performance in specific end tasks [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. The evaluation
often emphasizes the capacity of LLMs to replicate
relatively straightforward tasks, rather than their authentic
      </p>
      <p>LGOBE
https://github.com/GiuliaAPensa (G. Pensa)</p>
      <p>In this paper, we present GITA4CALAMITA, the
Graded Italian Annotated dataset for the CALAMITA
challenge [7]. GITA4CALAMITA is an adapted version
of the GITA dataset proposed in [8]. In particular, we
decided to revise the physical states annotation and adapt
it to this challenge. The first version of GITA dataset
is available in our repository under the license CC
BYNC-SA 4.0.1. The GITA4CALAMITA dataset is
manually compiled by a professional linguist, which allows
for this multi-layered evaluation of the reasoning
process. With the creation of an Italian dataset we gain
the linguistic and cultural perspective of Italian, while
commonsense research in Natural Language Processing
CEUR</p>
      <p>ceur-ws.org</p>
    </sec>
    <sec id="sec-3">
      <title>2. Challenge: Description</title>
      <p>Our aim in this challenge is to assess the understanding of
physical commonsense in LLMs for Italian. We configure
our assessment proposal in the following terms:
1. given an original dataset of plausible/implausible
stories related to physical commonsense, systems
must identify the plausible and implausible
stories;
2. systems must recognize the conflicting sentences
that generate the conflict in implausible stories;
3. systems must spot the underlying physical states
that cause conflict in implausible stories.</p>
      <p>Story classification: The end task revolves around
determining the plausibility of two stories. This
determination is based on the conflicts detected within the
two stories. By considering the presence of conflicts,
the model can assess the viability and coherence of each
story, facilitating the classification of the more plausible
one.</p>
      <p>By incorporating physical state classification, conflict
detection, and story classification, we analyze the aspects
of coherent reasoning, supported by evidence-driven
analysis.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Data description</title>
      <p>The recognition of plausible/implausible stories is the
end task envisaged in this benchmark, which must be The GITA4CALAMITA dataset is composed by
plausijustified by the second-level and third-level steps. In Fig- ble and implausible stories. To compose the dataset, we
ure 1 we present a story pair from the GITA4CALAMITA focused on concrete actions that could be visualized in
dataset and the relation between the layers of annotation. the physical world, avoiding mental actions such as “to
Story A is a plausible story, Story B is the corresponding think” or “to like”. We created 5-sentence stories, giving
implausible story where the first and the second sen- context and requiring reasoning over multiple sentences.
tences are in conflict: Marco closes the refrigerator and In all the stories, we avoided nonsensical sentences, in
cannot take the milk out of it. In the right part of the fact, each sentence is plausible alone, but could be
imifgure we can see the reasoning steps that the system plausible if associated with another specific sentence in
must follow and resolve. This example is presented in an implausible story. With these characteristics, the task
English for clarity, but our entire dataset is in Italian. requires reasoning over the entire context.</p>
      <p>We introduce a series of tasks that constitute a human- An essential part of our evaluation process is
constiinterpretable reasoning process, supported by a chain tuted by the presence of physical state annotation.
Sysof evidence, reflecting the assessment methodology out- tems must identify the underlying physical states that
lined above. To explain this approach, we present the make a story not plausible in our physical world.
Durtasks from the deepest to the shallowest, mirroring hu- ing the creation of this dataset, we took into account 14
man reasoning: physical attributes that were included in the annotation</p>
      <p>Physical state classification: Leveraging our phys- phase, and we composed stories that contained those
atical state annotations, systems must recognize the in- tributes. Following the work of [9] and [10], these are the
volved physical states in the conflicting sentences of im- 14 physical states that we wanted to have in our stories:
plausible stories. If we look at the example in 1, we are • location, conscious, dressed, wet, exist, clean,
able to identify the problematic physical state “open” as power, functional, in pieces, open, temperature,
cause of implausibility. solid, occupied, edible.</p>
      <p>Conflict detection: Next, the task of conflict
detection entails identifying sentence pairs of the form Si → Sj.</p>
      <p>Here, Sj represents the breakpoint, indicating the point 3.1. Dataset creation
at which the story becomes implausible based on the
given context. Si serves as the evidence that explains the In the first two rows of Table 1 we can see an example
breakpoint, typically causing a conflicting world state. of plausible story from the GITA4CALAMITA dataset
sentence 1</p>
      <sec id="sec-4-1">
        <title>Marco ha aperto il frigo.</title>
        <sec id="sec-4-1-1">
          <title>Marco opened the refrigerator.</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Marco ha preso il latte dal frigo.</title>
        <sec id="sec-4-2-1">
          <title>Marco took the milk from the refrigerator.</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Marco ha chiuso il frigo.</title>
        <sec id="sec-4-3-1">
          <title>Marco closed the refrigerator.</title>
          <p>sentence 2</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Marco ha preso il latte dal frigo.</title>
        <sec id="sec-4-4-1">
          <title>Marco took the milk from the refrigerator.</title>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>Marco ha aperto il frigo.</title>
        <sec id="sec-4-5-1">
          <title>Marco opened the refrigerator.</title>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>Marco ha preso il latte dal frigo.</title>
        <sec id="sec-4-6-1">
          <title>Marco took the milk from the refrigerator.</title>
          <p>sentence 3</p>
        </sec>
      </sec>
      <sec id="sec-4-7">
        <title>Marco ha preso la tazza.</title>
        <sec id="sec-4-7-1">
          <title>Marco took the cup.</title>
        </sec>
      </sec>
      <sec id="sec-4-8">
        <title>Marco ha preso la tazza.</title>
        <sec id="sec-4-8-1">
          <title>Marco took the cup.</title>
        </sec>
      </sec>
      <sec id="sec-4-9">
        <title>Marco ha preso la tazza.</title>
        <sec id="sec-4-9-1">
          <title>Marco took the cup.</title>
          <p>sentence 4</p>
        </sec>
      </sec>
      <sec id="sec-4-10">
        <title>Marco ha versato il latte nella tazza.</title>
        <sec id="sec-4-10-1">
          <title>Marco poured the milk into the cup.</title>
        </sec>
      </sec>
      <sec id="sec-4-11">
        <title>Marco ha versato il latte nella tazza.</title>
        <sec id="sec-4-11-1">
          <title>Marco poured the milk into the cup.</title>
        </sec>
      </sec>
      <sec id="sec-4-12">
        <title>Marco ha versato il latte nella tazza.</title>
        <sec id="sec-4-12-1">
          <title>Marco poured the milk into the cup.</title>
          <p>sentence 5</p>
        </sec>
      </sec>
      <sec id="sec-4-13">
        <title>Marco ha bevuto il latte.</title>
        <sec id="sec-4-13-1">
          <title>Marco drank the milk.</title>
        </sec>
      </sec>
      <sec id="sec-4-14">
        <title>Marco ha bevuto il latte.</title>
        <sec id="sec-4-14-1">
          <title>Marco drank the milk.</title>
        </sec>
      </sec>
      <sec id="sec-4-15">
        <title>Marco ha bevuto il latte.</title>
        <sec id="sec-4-15-1">
          <title>Marco drank the milk.</title>
          <p>together with the English translation. In this example, the 3.1.1. Order implausible stories
human actor is Marco, and the five sentences are ordered
in the required way: the action of opening something, The plausible stories only work in the causal sequence
picking something up and using it. We can see that some that we created. In the first row of Table 1, there is an
of the previously listed physical states appear: Marco is example of a plausible story. In the third row, we see the
conscious because he is doing something, the refrigerator corresponding implausible story for the order dataset,
is open because the actor can take something out of it, the in which Marco, first, takes the milk out from the
cup is not occupied by anything and can be functional. refrigerator and then open the refrigerator, generating a</p>
          <p>We aimed to minimize subjectivity and limit poten- physically impossible situation: it is not possible to take
tial confounding factors from complex language usage. something out of a closed refrigerator. By switching the
By using simple language, we were able to shift our fo- ifrst and the second sentences, we created an implausible
cus away from linguistic processing and semantic phe- story. In the entire dataset, we decided to generate
nomena, allowing us to concentrate more on examining implausible stories changing the order of only two
machines’ reasoning abilities, particularly their physical sentences for story.
commonsense understanding. Consequently, we created
our simple sentences in a straightforward declarative
structure, typically starting with the agent of the story, 3.1.2. Cloze implausible stories
followed by a verb, a direct object, and optionally, an The second approach involves the substitution of a
senindirect object. tence from the plausible story with a new sentence.
Al</p>
          <p>Implausible stories are built upon the plausible ones, though the new sentence itself is not inherently
implausipreserving the same actor and objects; in doing so we en- ble, its placement within the sequence renders it
implausured that implausible variations remained coherent and sible. In Table 1, the first sentence of the line F (Cloze), in
believable, and we avoided nonsensical information. To the fith row, was changed: Marco closes the refrigerator
create implausible stories, we implemented two diferent before taking out the milk. Again, the action is physically
methods: impossible: if the refrigerator is closed, nothing can be
taken out from it.
1. we switched the order of two sentences;
2. we substituted a plausible sentence with an
im</p>
          <p>plausible one.</p>
          <p>These two methods resulted in two diferent partitions
of our dataset: the Order dataset of implausible stories,
and the Cloze dataset of implausible stories respectively.
3.2. Origin of data
GITA4CALAMITA is a new version of [8], which is based
on [11]. Our main objective was to create an Italian
dataset, manually annotated, to assess a pre-trained
language model on physical commonsense tiered tasks. To
create the stories, we took inspiration from the Story
Cloze Test [12] and ROCStories Corpora [13]. The Story
Cloze Test compiles four-sentence stories with a missing
ending so that a system chooses the most appropriate
conclusion; the ROCStories Corpora is composed of
fivesentence stories about everyday life for story generation.
3.3. Annotation details
GITA4CALAMITA is annotated on three levels. In the
ifrst level, we annotated the plausibility/implausibility of
a story with TRUE or FALSE. In the second level, in
implausible stories we indicated between which sentences
the conflict was, and in the third level we labelled the
involved physical states in each sentence.</p>
          <p>In the dataset, a plausible story is identified using a
story number, while implausible stories are identified
using the same story number as the plausible version, but
with an additional C or O after the story number, where
the letter C refers to the Cloze dataset, and the letter O
refers to the Order dataset. Each story has been
annotated using these elements: story id, worker id, actor of
the story, objects of the story, physical states, sentences
of the story, as well as number of sentences, and
conflicting sentences, among others. The complete list and the
specific meaning of each element are in Appendix A.</p>
          <p>In each implausible story, we annotated the physical
state that caused a conflict between two sentences. We
annotated both Order and Cloze implausible stories
according to the corresponding physical state involved. If
we consider the stories in Table 1, both implausible stories
(C and O) are annotated using the physical state “open”,
In fact, in both implausible stories the conflict is related to
the openness of the refrigerator: in both cases the
refrigerator appears closed when Marco tries to take the milk
out of it. There are cases where for one plausible story
there are two implausible stories that are implausible for
two diferent reasons, hence the annotated physical state
is diferent.</p>
          <p>To ensure consistency and reduce human efort, we
developed a custom environment and a Python script to
streamline the annotation process. This semi-automated
annotation process helped us process sentences from
diferent story types, extract entities and actors, and
organize them for manual annotation. The script provided
a user-friendly terminal interface, and it is available in
our repository. In terms of annotation eficiency,
manually annotating one plausible story and two implausible
ones typically took around 50 minutes. However, using
our semi-automated annotation interface, we were able
to complete the same task in approximately 20 minutes.</p>
          <p>Consequently, instead of the estimated 100 hours for
annotating the entire dataset, we reduced the time to around
40 hours. Additionally, some annotations required review
and occasional revisions, hence we estimated that the
overall efort was of approximately 50-55 hours. An
example of a complete annotation can be found in Appendix
B.
3.4. Data format
The GITA4CALAMITA dataset was created and
annotated in a JSON format. The following example is story
0-C0 of our dataset, the first implausible Cloze story.
{
}
”0 − C0 ” : {
” s t o r y _ i d ” : 0 ,
” w o r k e r _ i d ” : ”GAP ” ,
” t y p e ” : ” c l o z e ” ,
” i d x ” : 0 ,
” aug ” : f a l s e ,
” a c t o r ” : ” Marco ” ,
” l o c a t i o n ” : ” c u c i n a ” ,
” o b j e c t s ” : ” f r i g o , l a t t e ,</p>
          <p>t a z z a , c u c c h i a i o ” ,
” s e n t e n c e s ” : [
” Marco ha c h i u s o i l f r i g o</p>
          <p>. ” ,
” Marco ha p r e s o i l l a t t e</p>
          <p>d a l f r i g o . ” ,
” Marco ha p r e s o l a t a z z a</p>
          <p>. ” ,
” Marco ha p r e s o i l</p>
          <p>
            c u c c h i a i o . ” ,
” Marco ha messo i l
c u c c h i a i o n e l l a t a z z a
. ”
}
] ,
” l e n g t h ” : 5 ,
” e x a m p l e _ i d ” : ”0 − C0 ” ,
” p l a u s i b l e ” : f a l s e ,
” b r e a k p o i n t ” : 1 ,
” c o n f l _ s e n t s ” : [ 0 ] ,
” c o n f l _ p a i r s ” : [
            <xref ref-type="bibr" rid="ref1">0 , 1</xref>
            ]
3.5. Example of prompts used for zero
          </p>
          <p>or/and few shots
For each of the three proposed tasks we use a diferent
prompt:
• Task 1: Please read the following story and
answer if the story is plausible taking into account
the order of the events. Please answer with true
or false.</p>
          <p>Task 2: The following story is implausible.
Identify the breakpoint, and then select the sentence
responsible for the implausibility. Please
identify the breakpoint sentence and the conflicting
sentence.</p>
          <p>Task 3: The following story is implausible.
Identify the physical state that causes the conflict in
the story. These are the descriptions of each
physical state: Power: Indicates whether an object
is powered or not, relevant for electrical devices.</p>
          <p>Location: Refers to the spatial position of an
entity, either human or object. Exist: Denotes
whether an object is present or has disappeared.</p>
          <p>Clean: Refers to the cleanliness of an entity,
indicating whether it is clean or dirty. Edible:
Identiifes whether an object is fit for consumption. Wet:
Denotes whether an object or person is in a wet
or dry state. Functional: Refers to whether an
object is in working condition or broken.
Wearing: Applies to humans, indicating whether they
are dressed or not. Open: Refers to whether an
object (e.g., a door or container) is open or closed.</p>
          <p>Conscious: Denotes whether a human is
conscious or unconscious. Temperature: Refers to
the relative temperature of an entity, e.g., hot or
cold. Solid: Describes whether an object is in
a solid state. Occupied: Indicates whether an
object (e.g., a container) is occupied or contains
something. In pieces: Refers to whether an
object is intact or has been broken into pieces. Select
one of them after reading the story.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Metrics</title>
      <p>The metrics involved in our tasks for the
GITA4CALAMITA benchmark are the following
ones:
• Accuracy assesses the traditional measure of end
task accuracy, which quantifies the proportion
of testing examples where plausible stories and
implausible stories are accurately identified.
• Consistency measures the proportion of testing
examples where not only the implausible story is
correctly identified, but also the conflicting
sentence pair for the implausible story is accurately
identified. The aim is to demonstrate the model’s
consistency in recognizing conflicts when
reasoning about plausibility.
• Verifiability evaluates the proportion of testing
examples where not only the implausible story
and the conflicting sentence pair for the
implausible story are correctly identified, but also the
underlying physical states that contribute to the
conflict are accurately identified. This
demonstrates that the detected conflict can be validated
through a correct understanding of the
underlying implausible change of physical states.</p>
      <p>Taking into consideration the three diferent metrics,
in Table 3 we report the results in our test set. We
perform experiments using the base and instruct Llama 3.1,</p>
      <p>We select some examples from our GITA4CALAMITA Gemma 2 and Mistral models of various sizes. Each
metdataset to be used as few-shot examples. For some of the ric is obtained from a diferent task, where models are
tests we randomly select the examples, for others, we base evaluated in the instances that are only guessed correctly
our choice on their variability. We select stories where in the previous tasks. All tasks are evaluated in a 3-shot
all possible combination of conflicting sentences were setting, using random examples from the test set. For
happening; at the same time, within the selected stories models that support system prompt (Llama3.1 models),
we try to include most of the physical states annotated. the description of each task is included there, for models
that do not support it (Gemma2 and Mistral models) the
3.6. Detailed data statistics task description is included in the first user input. Each
few-shot instance is formatted as a multiturn
conversaThe GITA4CALAMITA dataset is an Italian test com- tion between user and assistant. Next, we describe the
posed by a total of 356 stories. The statistics of the main findings from these results.</p>
      <p>GITA4CALAMITA dataset are in Table 2.</p>
      <p>Measures
plausible stories
implausible stories (ORDER)
implausible stories (CLOZE)
total stories</p>
      <p>GITA4CALAMITA
117
122
117
356
Model Size and Performance: Generally, larger
models (e.g., Llama-3.1 70B) outperform smaller models across
the metrics. The 70B Llama-3.1 models show
improvements over their 8B counterparts, particularly in
consistency and verifiability. Gemma2 models also show
improvements when bigger models are used. There are two
exceptions in the case of the accuracy: Gemma2-Instruct
9B and Llama-3.1-Instruct 8B achieve better results than
their bigger counterparts Gemma2 27B and Llama3 70B.</p>
      <p>They also outperform the base models.</p>
      <p>Consistency</p>
      <p>Cloze
Plausible</p>
      <p>Overall</p>
      <p>Order</p>
      <p>Overall
Instruction Tuning Efects: Instruction-tuned
versions (e.g., Gemma-2-Instruct, Llama-3.1-Instruct)
typically outperform their base counterparts. There are ex- This work has been partially funded by:
ceptions such as order accuracy for LLama 3.1 70B and
Gemma 2 9B. However, Mistral-V0.3-Instruct is very
similar or worse than the base model and generally is more
biased, it tends to classify as plausible the stories and it
performs better in Cloze than in Order.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Cloze, Order and Plausible Most models perform
generally better on Cloze examples compared to Order
examples. This is consistent across models and metrics.
Models are generally better in Cloze and Order than in
Plausible. This could be explained by the bias of the
models to answer true or false when they are asked if
the story is plausible. Models also see double implausible
few-shot examples, which could also cause models to
give that answer more frequently.</p>
    </sec>
    <sec id="sec-7">
      <title>5. Limitations</title>
      <p>This study has some limitations that should be
acknowledged. Firstly, only one prompt was tested for each task,
which may not fully capture the potential variability in
performance. Additionally, the models used were
multilingual but not specifically tailored for the Italian
language, potentially afecting the accuracy of the results
for Italian-specific tasks. Furthermore, the dataset used
in this study was limited to stories within the household
domain, which may not generalize well to other contexts.</p>
    </sec>
    <sec id="sec-8">
      <title>6. Ethical issues</title>
      <p>The dataset contains stories that may prototypically
occur in Italian households. While most of these narratives
are likely to be familiar to a broad audience, people from
diferent cultural backgrounds may find some of the
stories less frequent.
• DeepR3 (TED2021-130295B-C31) funded by</p>
      <p>MCIN/AEI/10.13039/501100011033 and European</p>
      <p>Union NextGeneration EU/PRTR.
• Disargue (TED2021-130810B-C21)</p>
      <p>MCIN/AEI/10.13039/501100011033 and
European Union NextGenerationEU/PRTR.
• DeepKnowledge (PID2021-127777OB-C21)</p>
      <p>MCIN/AEI/10.13039/501100011033 and by</p>
      <p>FEDER, EU.
• Ixa group A type research group (IT1570-22)</p>
      <p>Basque Government
• IKER-GAITU project 11:4711:23:410:23/0808 by</p>
      <p>Basque Government
(2023). URL: https://doi.org/10.1145/3615355. doi:10. 2021, pp. 4902–4918. URL: https://aclanthology.org/
1145/3615355, just Accepted. 2021.findings-emnlp.422. doi:10.18653/v1/2021.
[5] T. Linzen, How Can We Accelerate Progress To- findings- emnlp.422.</p>
      <p>wards Human-like Linguistic Generalization?, in: [12] N. Mostafazadeh, M. Roth, A. Louis, N.
ChamProceedings of the 58th Annual Meeting of the As- bers, J. Allen, LSDSem 2017 Shared Task: The
sociation for Computational Linguistics, Associa- Story Cloze Test, in: Proceedings of the 2nd
tion for Computational Linguistics, Online, 2020, Workshop on Linking Models of Lexical,
Sentenpp. 5210–5217. URL: https://aclanthology.org/2020. tial and Discourse-level Semantics, Association for
acl-main.465. doi:10.18653/v1/2020.acl- main. Computational Linguistics, Valencia, Spain, 2017,
465. pp. 46–51. URL: https://aclanthology.org/W17-0906.
[6] E. M. Bender, A. Koller, Climbing towards NLU: doi:10.18653/v1/W17- 0906.</p>
      <p>On Meaning, Form, and Understanding in the [13] N. Mostafazadeh, N. Chambers, X. He, D. Parikh,
Age of Data, in: Proceedings of the 58th An- D. Batra, L. Vanderwende, P. Kohli, J. Allen, A
Cornual Meeting of the Association for Computa- pus and Cloze Evaluation for Deeper Understanding
tional Linguistics, Association for Computational of Commonsense Stories, in: Proceedings of the
Linguistics, Online, 2020, pp. 5185–5198. URL: 2016 Conference of the North American Chapter
https://aclanthology.org/2020.acl-main.463. doi:10. of the Association for Computational Linguistics:
18653/v1/2020.acl- main.463. Human Language Technologies, Association for
[7] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran- Computational Linguistics, San Diego, California,
cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri- 2016, pp. 839–849. URL: https://aclanthology.org/
naldi, D. Scalena, CALAMITA: Challenge the Abili- N16-1098. doi:10.18653/v1/N16- 1098.
ties of LAnguage Models in ITAlian, in:
Proceedings of the 10th Italian Conference on
Computational Linguistics (CLiC-it 2024), Pisa, Italy, Decem- A. Annotations in the dataset
ber 4 - December 6, 2024, CEUR Workshop
Proceedings, CEUR-WS.org, 2024. These are the attributes that encode the metadata and
[8] G. Pensa, B. Altuna, I. Gonzalez-Dios, A Multi- linguistic information in the GITA dataset:
layered Approach to Physical Commonsense
Understanding: Creation and Evaluation of an Ital- • story_id: refers to the number of the story for
ian Dataset, in: N. Calzolari, M.-Y. Kan, V. Hoste, both plausible and implausible stories.
A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the • worker_id: refers to the name assigned to a
spe2024 Joint International Conference on Computa- cific worker during the creation of the story.
tional Linguistics, Language Resources and Evalua- • type: refers to cloze or order and it is a label used
tion (LREC-COLING 2024), ELRA and ICCL, Torino, only in implausible stories.</p>
      <p>Italia, 2024, pp. 819–831. URL: https://aclanthology. • idx: refers to the implausible dataset, where there
org/2024.lrec-main.74. is more than one implausible story for a given
[9] Q. Gao, M. Doering, S. Yang, J. Chai, Physical story number; for example, if we have more than
Causality of Action Verbs in Grounded Language one implausible version of a plausible story (we
Understanding, in: Proceedings of the 54th An- created more than an implausible story
changnual Meeting of the Association for Computational ing the order of our sentences more than once),
Linguistics (Volume 1: Long Papers), Association the index number indicates to which implausible
for Computational Linguistics, Berlin, Germany, example we are referring.
2016, pp. 1814–1824. URL: https://aclanthology.org/ • aug: refers to possible automatic data
augmentaP16-1171. doi:10.18653/v1/P16- 1171. tion techniques that can be taken into account for
[10] A. Bosselut, O. Levy, A. Holtzman, C. Ennis, future works to resolve an overfitting problem.</p>
      <p>D. Fox, Y. Choi, Simulating Action Dynam- • actor: refers to the human agent of the story.
ics with Neural Process Networks, CoRR • location: refers to the room where the story
abs/1711.05313 (2017). URL: http://arxiv.org/abs/ takes place.</p>
      <p>1711.05313. arXiv:1711.05313. • objects: refers to all the inanimate entities that
[11] S. Storks, Q. Gao, Y. Zhang, J. Chai, Tiered Rea- we find into each story.</p>
      <p>soning for Intuitive Physics: Toward Verifiable • sentences: includes the 5 sentences in the story.
Commonsense Language Understanding, in: Find- • length: refers to the number of sentences in each
ings of the Association for Computational Lin- story.
guistics: EMNLP 2021, Association for Computa- • example_id: corresponds to the story number
tional Linguistics, Punta Cana, Dominican Republic, and includes letters for implausible stories.
• plausible: is TRUE when the story is plausible b r e a k p o i n t :</p>
      <p>and FALSE when it is implausible. −1
• breakpoint: refers to the sentence where the c o n f l _ s e n t s ( type o n l y [ ] ) :
story becomes implausible, where the conflict be- [ ]
comes evident; in plausible stories the breakpoint
is always -1.
• conlict_sents: refers to the other sentence in the
story that together with the breakpoint sentence
makes the story implausible; in plausible stories
this field is blank.
• conlict_pairs: refers to the conflict pair of
sentences, gathering the two previous labels; in
plausible stories this field is blank.
• states: includes all the physical states
annotations for all the stories.</p>
    </sec>
    <sec id="sec-9">
      <title>B. Annotation environment</title>
      <p>‘ 0 ’
s t o r y _ i d (NO q u o t e s , NO l e t t e r , o n l y
number ) :
0
w o r k e r _ i d ( i n q u o t e s ) :
‘GAP ’
type ( n u l l f o r p o s i t i v e , o r d e r , o r
c l o z e , in q u o t e s ) :
n u l l
i d x ( n u l l , o r same a s NUMBER in s t o r y
number ) :
n u l l
aug ( f a l s e ) :
f a l s e
l o c a t i o n ( in q u o t e s ) :
‘ c u c i n a ’
s e n t e n c e s :
Marco ha a p e r t o i l f r i g o . Marco ha
p r e s o i l l a t t e . Marco ha p r e s o
l a t a z z a . Marco ha p r e s o i l
c u c c h i a i o . Marco ha messo i l
c u c c h i a i o n e l l a t a z z a .
l e n g t h :
5
e x a m p l e _ i d ( same a s s t o r y number , i n
q u o t e s ) :</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. C.-C. Chang</surname>
          </string-name>
          ,
          <article-title>Towards Reasoning in Large Language Models: A Survey, in: Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>1049</fpage>
          -
          <lpage>1065</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-acl.
          <volume>67</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings- acl.67.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          , Y. Choi,
          <source>WinoGrande: An Adversarial Winograd Schema Challenge at Scale</source>
          ,
          <source>Commun. ACM</source>
          <volume>64</volume>
          (
          <year>2021</year>
          )
          <fpage>99</fpage>
          -
          <lpage>106</lpage>
          . URL: https://doi.org/10.1145/3474381. doi:
          <volume>10</volume>
          .1145/3474381.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pessach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shmueli</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Review on Fairness in Machine Learning</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2022</year>
          ). URL: https://doi.org/10.1145/3494672. doi:
          <volume>10</volume>
          .1145/ 3494672.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <source>Benchmarks for Automated Commonsense Reasoning: A Survey</source>
          ,
          <source>ACM Comput. Surv.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>