<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Rinaldi);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Evaluating Models, Prompting Strategies, and Task Formats: a Case Study on the MACID Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Rinaldi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rossella Varvara</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Gregori</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Amelio Ravelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bologna</institution>
          ,
          <addr-line>Via Cartoleria 5, 40124 Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Florence</institution>
          ,
          <addr-line>Via della Pergola 60, 50126 Firenze</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Trento</institution>
          ,
          <addr-line>Via Giuseppe Verdi 26, 38122 Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Turin</institution>
          ,
          <addr-line>Corso Svizzera 185, 10149 Torino</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In this study, we test the ability of 8 Large Language Models to discriminate closely related action concepts, based on textual descriptions or on video representations. Our aim is to understand if these models can handle the fine-grained action understanding that humans perform with ease, particularly when there are cases of action-predicate mismatches, i.e., the same verb may describe diferent actions, or diferent verbs may refer to the same action. We experiment on the MACID dataset, a dataset of actions representing "pushing" events and manually annotated for action IDs taken from the IMAGACT ontology. We evaluate how prompt complexity and task formats influence models' performance. Particularly, we test three diferent prompts with or without examples, two task formats (binary or multiple choice task), and two modalities (textual or visual). Results indicate that the binary task is not easier than the multiple-choice one, and that few-shot prompting generally improves models' accuracy. Moreover, LLMs perform better when helped by lexical cues: accuracy increases when actions are expressed by diferent verbs, whereas it is lower when actions are expressed by the same verb.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;large language models</kwd>
        <kwd>action concept understanding</kwd>
        <kwd>prompting strategies</kwd>
        <kwd>task definition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a benchmark specifically designed to evaluate the
capacity of models to distinguish between subtly
diferUnderstanding human action is a cornerstone of both lin- ent human actions described using similar or identical
guistic and perceptual intelligence. The close interdepen- linguistic expressions. The MACID dataset provides both
dence between language and vision in human cognition natural language descriptions and corresponding video
is suggested by the Mirror System Hypothesis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which clips of the actions, enabling an evaluation of how visual
considers language as not merely symbolic but grounded grounding can support or enhance linguistic
disambiguain sensorimotor experience. This cognitive grounding tion. In this paper, we aim to test the strengths and
implies that efective language understanding, especially limitations of current LLMs in grounded language
unof action-related expressions, requires grasping subtle derstanding by analyzing the ability of LLMs to resolve
distinctions between closely related actions. Recent ad- action ambiguities from linguistic or visual input. We
vances in large language models (LLMs) and the emer- experiment considering 8 LLMs, two task formats, three
gence of multimodal LLMs, which are capable of jointly prompts of increasing complexity, and two modalities
processing textual and visual inputs, allow the integra- (visual or textual). We compare models’ results to
rantion of perceptual and linguistic reasoning in artificial dom baselines, and we evaluate the role of the lexical
models. However, it remains unclear to what extent these component in the disambiguation of actions.
models can handle the fine-grained action understanding
that humans perform with ease, particularly when
linguistic descriptions are ambiguous or semantically close. 2. Related work
To address this gap, we investigate the performance of
both textual and multimodal LLMs on the MACID dataset 2.1. Action concepts definition
relation between verbs and action concepts is not one- subtle semantic diferences occur among the diferent
to-one: a single verb may express diferent concepts, and items. Data have been manually filtered and annotated
a concept may be lexicalized by multiple verbs. IMA- [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] using the action conceptualization derived from the
GACT’s multimodal approach supports cross-linguistic IMAGACT Multilingual and Multimodal Ontology of
Accomparison and enables accurate mapping between verbs tions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. IMAGACT is a multimodal and multilingual
in multiple languages and their underlying event struc- ontology of actions that provides a fine-grained
categotures, independent of syntactic realization or argument rization of action concepts, each represented by one or
structure. These form-meaning mismatches make action more visual prototypes in the form of recorded videos
concepts foundational for modeling verb semantics in and 3D animations. IMAGACT currently contains 1,010
both theoretical and computational settings. scenes that encompass the action concepts most
commonly referred to in everyday language usage. Scenes
2.2. LLM benchmarking belonging to the same action concept are grouped
together and labeled with a unique identification number.
      </p>
      <p>
        The categorization of action concepts proposed in the
theoretical framework behind IMAGACT has been validated
in a series of experiments with a high inter-annotator
agreement [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], confirming that the theoretical framework
can be considered well-founded and reproducible.
      </p>
      <p>Large Language Models are usually introduced to the
community by showcasing their very high performance
on classic benchmarks. They are very good at solving
complex math problems, writing and debugging code,
or answering multiple-choice questions about common
knowledge. However, this kind of evaluation does not
tell the full story. When LLMs are tested on more realistic 3.2. Task formats
tasks, i.e., closer to what a normal person might do, they
often lose their super-human performance. These models Models are evaluated on two distinct versions of the
still struggle with tasks that truly require human-like MACID dataset. Initially, models are assessed on an
inunderstanding, such as subtle semantic variations, prag- truder detection task in sets of four sentences: three
matic understanding, and so on. So, even if they do very sentences are related to the same action concept while
well on traditional benchmarks, their performance in one is related to a diferent action concept. The goal of
real-life or more everyday human life tasks is still limited. the model is to correctly identify the intruder sentence</p>
      <p>
        Moreover, most of the research and efort in this field within each set, that is, the only one referring to an action
is on the English language. The CALAMITA benchmark concept diferent than the remaining three.
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] represents the first of its kind as an Italian-focused The second experiment is performed on the binarized
collection of tasks that really pose a challenge for com- version of the MACID dataset: models were required to
monsense, factual, and linguistic knowledge in Italian. compare sentence pairs and classify them as either
"different" or "equivalent" with respect to the action concept
expressed by the sentence.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Experimental Setting</title>
      <p>3.1. Data</p>
      <sec id="sec-2-1">
        <title>The data used in this study is taken from the CALAMITA</title>
        <p>
          benchmark [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], specifically from the MACID challenge
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. This dataset is based on a portion of the LSMDC
dataset [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], a collection of short video clips extracted
from movies along with transcriptions of English DVS
(descriptive video services) for visually impaired people.
The LSMDC dataset is the result of the merging of two
previous datasets, both built upon DVS from movies: the
Max Planck Institute für Informatik Movie Description
Dataset (MPII-MD) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and the Montreal Video
Annotation Dataset (M-VAD) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The textual captions were
manually translated into Italian and modified to depict the
action in the corresponding video and to avoid vague
references (e.g., pronouns substituted with common nouns).
        </p>
        <p>The MACID dataset includes video-caption pairs
restricted to a set of similar actions, i.e. to the variation of
actions and action verbs linked to "pushing" events. This
choice was made to define a challenging task, in which</p>
        <sec id="sec-2-1-1">
          <title>3.2.1. Multiple choice</title>
          <p>
            The dataset in the original MACID challenge [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] was
structured on groups of 4 captions, three of which were
annotated as belonging to the same action concept, and
one describing a diferent action type. Each entry in the
dataset is structured as follows:
• id: the quadruple id;
• s1-4: the 4 caption sentences describing the
actions;
• v1-4: the reference ID of the 4 videos depicting
the actions;
• intruder: the number (1-4) of the sentence (and
video) which is the intruder in the group.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Video files are provided in an additional folder, named with a unique reference ID.</title>
      </sec>
      <sec id="sec-2-3">
        <title>An example of the textual data follows.</title>
        <p>(1) I due ragazzi spingono il carrello verso la colonna
(The two boys push the cart toward the column)
[action id: 65431186]
(2) La donna spinge la signora anziana sulla sedia a
rotelle (The woman pushes the elderly lady in the
wheelchair)
[action id: 65431186]
(3) L’uomo spinge a terra l’aggressore (The man pushes
the attacker to the ground)
[action id: 18ad2fa9]
(4) L’infermiere spinge la barella (The nurse pushes the
gurney)
[action id: 65431186]
prompt (MEDIUM) adds to the first more details about
what an action concept is, and what are the main features
which discriminate between close but diferent actions.
The third prompt (LONG) elaborates more on the
theoretical distinction between actions and is enriched with
some explanation about the possible mismatch between
actions and verbs. Finally, we added to the experimental
setting a version of the task without any explanation
(NONE), but with only some examples. All prompts were
formulated in Italian to assess both the models’ sentence
processing capabilities and their ability to correctly
interpret instructions given in the Italian language. All
prompts are reported in the Appendix A.
4.1. Zero or few-shot settings
• id: the quadruple id;
• s1-2: the 2 caption sentences describing the
actions;
• v1-2: the reference IDs of the 2 videos depicting
the actions;
• id1-id2: the action concept IDs of the 2 actions;
• diferent: information about the actions being
diferent (1) or the same (0).</p>
      </sec>
      <sec id="sec-2-4">
        <title>The empirical investigation with diferent prompting</title>
        <p>strategies aimed at finding the optimal balance between
3.2.2. Binary choice instructions given in a concise form and instructions
In order to verify the impact of the task format on this given using a long and verbose language. This
explochallenge, we converted the dataset (as well as the task) ration involved developing three distinct prompts for
into a binary format. This second dataset consists of each dataset variant, alongside an additional experiment
video-caption pairs, together with their action concept utilizing few-shot examples without explicit instructions.
IDs and the information about whether they correspond To expand the analysis on how the instruction given
to the same action type or not. We kept the information in the prompt influences the outcomes, each prompt was
about the quadruple ID to allow comparison between the tested under both zero-shot and few-shot conditions. Five
results from the two formats. The columns in the new examples were selected from the quadruple dataset and
version of the dataset describe the following information: four from the paired dataset, with consistent example
sets maintained throughout the evaluation process. The
selection of five examples from the quadruple dataset
was strategically designed to encompass all possible verb
relationship combinations: one example featuring four
distinct verbs, one with three diferent verbs, one
containing two diferent and two identical verbs, one with
verbs paired identically, and one where all verbs were
identical.
4.2. Textual and visual settings
3.3. Models</p>
      </sec>
      <sec id="sec-2-5">
        <title>In order to test the models on the diferent settings pro</title>
        <p>For this experiment, we tested a bunch of textual models: posed in the MACID’s experiments, we wrote a Python
ifve small models with 7/8/9 billion parameters (Llama3.1, script that interrogates an OpenAI API compatible
backQwen2.5, Aya-expanse, Mistral, Minerva, Gemma2), one end to perform interrogation and evaluation of the
modmedium native-Italian model with 14 billion parameters els. The script loads the data from JSONL files and
formu(Velvet), and one big model with 72 billion parameters lates the diferent complete prompts for each datapoint.
(Qwen2.5). To evaluate the results, the scripts only consider the first
sampled token and check if it corresponds to the expected
outcome. For the experiment on quadruples, only the
4. Prompting strategies ifrst character of the first token is considered and checked
against the number identifying the intruder sentence. In
In both scenarios (multiple or binary choice), we tested the experiment of couples, considered that the model
three prompts, built with incremental information. The was asked to answer either “yes” (sì) or “no” (no), the
ifrst prompt (SHORT) is the same proposed for the origi- ifrst sampled token was converted in lower case and
acnal MACID Challenge, and it is a baseline with just the cents were removed, so that it was possible to check it
necessary information to execute the task. The second regardless of the case or the use of the accent on the word
sì, required in formally correct Italian but that may be Outline3), because the requested output format is
straightomitted without changing the sentence’s meaning even forward and we considered a good adherence to it as part
by native speakers. As a backend, we employed vLLM of the task. Restricting the amount of output tokens to
with Flash Attention 2.7 for optimal performance for all 4 also allowed for a great saving of resources, given the
the 7B, 8B and 14B models. Qwen 2.5 72b was instead high computational costs of autoregressive generation.
accessed using the “OpenRouter” API and loaded with Some models were not able to perfectly adhere to the
BF16 weights. All the models were set to a temperature instructions, but this behavior seems related to some
of 0.0 and a random seed of "27" in order to obtain re- task formats. Aya-expanse-8b does not follow the
reproducible results. All the results were then saved in a quired format with all three prompts when tested for
SQLite database for easy access.1 binary response without examples. Gemma-2-9b
pro</p>
        <p>We decided to purposely opt for a strict evaluation vides unacceptable responses for all the binary task’s
strategy: answers where the model wrote any kind of conditions.4 Minerva-7B-instruct-v1.0, with no diference
text before the actual task’s answer - such as chattering, between prompts and binary/multiple choice tasks, does
boilerplate text, reasoning traces, or unwanted answer’s not adhere in the zero-shot setting, with the exception
formatting - were automatically discarded by the eval- of the short prompt in the binary task.
uation script, that expected the correct answer to be in
the very first characters of the model’s response. This
decision is motivated by the fact that we also wanted to
test the models’ capabilities to strictly adhere to the given
instructions: a model that talks too much or return the
answer in an unwanted format is a model that may pose
problems in production scenario, such as higher costs,
due to the generation of more tokens, or the need to add
post-processing strategies.</p>
        <p>Binary choice task Among the small models
(ranging between 7 and 14 billion parameters),
llama-3.1-8binstruct reaches the best results, with a .696 accuracy
when instructed with the long prompt in a few-shot
setting. This model reaches high accuracy (.689) even with
the short prompt with examples and with the examples
alone, showing generally a preference for the few-shot
setting with respect to the zero one (with a .133 diference
in accuracy between the few and zero-shot setting with
5. Results the long prompt, Table 1).</p>
        <p>Qwen-2.5-72b reaches the highest accuracy (.725)
In this section, we discuss the results obtained across among all models, with the long prompt and the
fewall the experimental scenarios (i.e., prompting strategies, shot setting. However, despite the huge diference in
zero/few-shot, multiple/binary choice). On both task for- parameters, it is outperformed in short_zero setting by
mats, we defined a majority class baseline. The baseline Llama-3.1-8b. As noted above, some models (i.e.,
Minervaaccuracy for the multiple choice task is 28% , while for 7b and aya-expanse-8b) do not provide satisfying replies
the binary choice task it is 50%. in some conditions (marked as ND in Table 1).</p>
        <p>In general, the few-shot setting improves the results
in the binary task, even if in some cases the diference is</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5.1. Results with textual LLMs small.</title>
      <p>
        Figure 1 reports the performance of the models tested in With regard to the prompt type, 5 models out of 7
both multiple-choice (1a) and binary-choice (1b) tasks. show a preference for the long prompt. Aya-expanse-8b
Before illustrating the results, we present an evaluation does slightly better with the medium prompt (.647) with
of the ability of the models to follow the instructions respect to the detailed prompt (.640), whereas Velvet-14B
and to provide the answer in the required format. In- achieves the same accuracy with both (.507).
deed, we forced the model to reply with only 4 tokens, Native Italian models do not perform better than the
since we expected a yes/no answer for the binary task others: the results from Velvet-13b are close to chance,
and a number to identify the intruder sentence in the whereas Minerva-7b achieves better in the long-few shot
multiple-choice task. The desired output format has been setting.
unambiguously specified in the prompts (see Appendix We additionally analyze the impact of the lexical
comA), although we decided not to be strict in accepting an- ponent on models’ performance, i.e., we look at if and
swers: upper/lower case, accents, or additional spacing, how models are facilitated when actions are expressed
have been tolerated whenever the "yes/no" or "1/2/3/4" by diferent verbs (Table 5, Appendix B) and when they
strings were present in the answer. We didn’t use any are expressed by the same one (Table 4, Appendix B).
additional tool to constrain the output (e.g, Guidance2, Most models achieve higher accuracy when actions are
1All data and scripts
https://github.com/mrinaldi97/MACID/
2https://github.com/guidance-ai/guidance
are
available
at 3https://github.com/dottxt-ai/outlines
4Given this behavior, we excluded Gemma-2-9b from the summary
tables reported in Appendix.
(a) Multiple Choice task format
(b) Binary choice task format
expressed by diferent verbs: it is easier to discriminate Multiple choice task Among the small models,
if two sentences express the same action if their lexical qwen2.5-7b reaches the best results, with a .568 accuracy
description is diferent as well. When the verbs are equal, when instructed with the examples. However, diferently
accuracy decreases. This diference is smoother when ex- from the binary task, the gap with the larger model
(qwenamples are added in the prompts, and it increases with the 2.5-72b) is notable, with the latter performing very well
short prompt. A notable exception is given by llama-3.1- among all conditions and reaching an accuracy of 0.737
8b-instruct, which achieves higher accuracy for actions in three of them (few-shot with medium, long, and no
expressed by the same verbs rather than with diferent prompt). Even if it has been noted frequently that LLMs
verbs (reaching a value of .933 in the long-zero format). do not perform well with multiple-choice tasks, in this
When looking in more detail at its behavior, we note that challenge, they do better than in the binary choice one,
this happens with the two most detailed prompts, and we considering the random baseline for each task (Table 2.
hypothesize that it may be due to specification that there As noted for the binary task, providing a few examples
is no one-to-one matching between action concepts and increases accuracy. Exceptions, however, are found for
verbs included in these prompts. the short prompt: velvet-14b and aya-expanse-8b have a
slightly higher accuracy with the zero-shot setting with
respect to the few-shot. The zero/few shot setting also
minerva-7b-instruct-v1.0
mistral-7b-instruct-v0.3
qwen2.5-7b-instruct
aya-expanse-8b
llama-3.1-8b-instruct
gemma-2-9b
velvet-14b
qwen-2.5-72b-instruct
BASELINE
Model
minerva-7b-instruct-v1.0
mistral-7b-instruct-v0.3
qwen2.5-7b-instruct
aya-expanse-8b
llama-3.1-8b-instruct
gemma-2-9b
velvet-14b
qwen-2.5-72b-instruct
BASELINE
has an influence also in the ability of Minerva-7b to com- tive alignment between video and language latent spaces
ply with the required output format: when provided with [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We conducted experiments with two
state-of-theexamples, it follows the instructions, whereas it does not art video models: Qwen 2.5 VL 8B [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and VideoLLama3
in the zero-shot prompt. 7B [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The models were executed on a local machine
      </p>
      <p>Contrary to what we observed above for the binary using configurations recommended in the oficial
doctask, the prompt type does not widely influence the umentation. Both Qwen and VideoLLama utilize
Hugresults: accuracy values for most models (minerva-7b, ging Face’s "transformers" library, which includes the
mistral-7b, qwen2.5-7b, llama-3.1-8b, qwen-2.5-72b) are necessary code for running these video models. Both
equal among diferent prompts. models handle videos of arbitrary resolution sampled at</p>
      <p>As for the binary task, the verb used to describe the user-defined framerates. To keep memory usage
manintruder has an impact: if it is the same as (at least one ageable, we resized the original videos to 360x288
resoof) the other sentences, models’ performance drops, even lution. While this resolution is lower than the original
if less strongly (Table 6 and 7 in Appendix B). ifles, often in FullHD (1920x1080) or PAL DVD (720x576)
format, it remains perfectly intelligible to human
view5.2. Results with visual LLMs ers, being comparable to VideoCD (352x288) and VHS
tape quality (240 horizontal TV lines). The framerate was
The MACID dataset includes all the original videos re- set to 8fps because we decided to avoid very low
framferred to by the sentences. This setting enabled us to con- erates, given that video samples are brief (&lt;4s) and
conduct an exploratory experiment with multimodal models, sistently represent live action. Following the text-only
particularly those capable of processing video inputs. experiments, we selected the best-performing prompt on
At the time of writing, video models are in their early average and adapted it for video model testing.
Specifidevelopmental stages. A great efort is going on to un- cally, we modified the medium prompt to accommodate
derstand optimal methods for integrating video informa- the video experiment, substituting sentences with video
tion into language models, as video data presents chal- clips. Due to memory constraints, we executed the
exlenges for transformer architectures due to the quadratic periment exclusively on the binary task. Neither Qwen
computational cost of self-attention over long sequences. VL nor VideoLLama successfully handled the task: both
Moreover, diferent research groups are experimenting models always returned "No" for every tested video pair.
with diferent architectural choices to ensure an efec- Interestingly, Qwen VL also provided brief video
descriptions. We speculate that the poor performance of video framework (i.e., string matching) might appear at first
models on this task relates to dificulties in coherently glance to be simplistic, lazy, and excessively punitive for
processing temporal sequences and performing cross- the models. As we already mentioned in Section 5.1, we
domain inferences between visual and textual features. could have used specific libraries to parse the responses
Moreover, the prompt being written in Italian and the in search of the correct result, but the point is that, given
presentation of two videos simultaneously, rather than these models’ reputation as “intelligent” (as promoted
the single-video setting usually employed during pre- by the developers), one expects these models to be able
training, further deviated the experimental conditions to follow very simple instructions, regardless of their
from the training distribution, substantially increasing ability to efectively solve a task. Even in few-shot
scetask complexity. Testing multimodal and, in particular, narios, where the requested answer format it is more than
video models poses significant challenges, and we believe explicit, some models consistently fail in following the
that the Macid task can become a useful task to assess instructions. Models with super-human abilities might
the models’ abilities to correctly identify complex actions. not need to be hand-guided.</p>
      <p>For this reason, we leave to future work a more extensive
experimentation with video models, including prompt/- 6. Conclusions
formulation modifications, testing new models, as well
as trying fine-tuning operations.</p>
      <sec id="sec-3-1">
        <title>This study evaluates LLMs on the action concepts discrim</title>
        <p>ination task: we present the results for 7 LLMs evaluated
5.3. Discussion on the MACID dataset.</p>
        <p>Results show a wide variation in models’ performances,
Model Average error rate depending on the model type, the number of model
paminerva-7b-instruct-v1.0 0.166 rameters, the prompt used, and the task format.
mistral-7b-instruct-v0.3 0.0 Qwen-2.5-72b obtained the highest average accuracy
qwen2.5-7b-instruct 0.0 both on the binary and the multiple-choice task,
confirmaya-expanse-8b 0.306 ing that the number of parameters is a core factor in this
llama-3.1-8b-instruct 0.0 type of semantically complex task.
gemma-2-9b 0.867 Italian models (Minerva and Velvet) perform poorly in
vqewlveent--21.45B-72b-instruct 00..00 both task formats. This is an unexpected result,
considering the task requires fine-grained semantic abilities.</p>
        <p>Table 3 Among 7B/8B models, top results are achieved by
Average error rate for each model, grouped and averaged for Qwen-2.5, in multiple-choice format (acc. 0.568), and
all tasks. Llama-3.1 in binary format (acc. 0.696). The latter
obtains an accuracy comparable with Qwen-2.5-72b (0.725),</p>
        <p>Table 3 reports the average values of unacceptable re- despite the diference in the number of parameters.
sponses per model, in each task, i.e. responses where the On average, few-shot prompting works better than
models did not adhere to the requested output format. zero-shot, both in binary and in multiple-choice task
forAs already stated, beside the objective of testing the abil- mats. In general, we don’t find strong performance
diferity of LLMs to interpret and discriminate descriptions ences among the three versions of the task description in
of physical actions, we also want them to show their the prompt (SHORT, MEDIUM, and LONG), while there
ability to follow the instructions given to them. One of is a consistent accuracy improvement with the few-shot
the main problem we faced with our experiments is that prompting. Even the few-shot without task description
responses from models tend to be overly verbose, as mod- (none_few) has a good accuracy on the top models.
els need to explain their choices every time. While this Finally, the lexical components have a strong influence
may be considered a useful and interesting behavior in on models’ behavior in this task: the accuracy varies a
chat models, it is definitely not ideal in instruct models, lot if the two sentences use the same verb or diferent
as those tested in our experiments. As it is specified in verbs (in the binary task) or if the intruder has the same
all our prompts, we explicitly ask to answer with the id verb as the other sentences or not (in the multiple-choice
of the intruder for the multiple-choice and with "sì" or task). The accuracy gap between these two cases is huge
"no" (yes or no) for the binary-choice task (see Appendix with Qwen, which seems to be more sensitive to lexical
A), thus the request is clear. Nevertheless, sometimes diferences than Llama. For example, Qwen-2.5-72b on a
models tend to elude the requested response format (i.e., binary task reaches 0.975 accuracy with diferent verbs
the answer does not start with a valid id number for the and 0.579 with the same verb.
multiple-choice task, or it does not start with "sì/no" for Further experiments need to be done with video LLMs,
the binary-choice task), while others apply absolutely un- which did not provide satisfactory results in this first
necessary markup (e.g., aya-expanse-8b). Our evaluation experimentation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Prompts</title>
      <p>Prompts used for the experiments in binary and multiple-choice
tasks.</p>
      <p>Binary task</p>
      <sec id="sec-4-1">
        <title>Zero-shot prompts</title>
        <p>Three variants have been used, with increasing description details.
scatola’ e ’spingere una scatola’) e, viceversa, un verbo può
rappresentare più concetti azionali distinti (ad es. ’aprire
una porta’ vs. ’aprire una noce’). Nell’individuare un
concetto azionale, è importante concentrare l’attenzione su
quali cambiamenti vengono compiuti dall’azione
rappresentata, non sul verbo. Rispondi ’Sì’ se ritieni che entrambe
le frasi si riferiscano allo stesso concetto azionale, rispondi
’No’ se ritieni che descrivano due concetti azionali diversi.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Few-shot prompts</title>
        <p>Few-shot prompts are created by appending 4 examples to the three
variants of zero-shot prompts; additionally, a fourth prompt with
only examples and no description is provided. The following
examples have been used.</p>
        <p>1) I ragazzi spingono i carrelli lungo il binario del treno
2) La donna con gli occhiali da sole spinge l’anziana signora sulla
sedia a rotelle</p>
        <p>Risposta: Sì
1) L’uomo spinge una carriola nel cortile della fattoria mentre
parla con la donna
2) Il veterinario spinge lo stantufo della siringa
Risposta: No
1) La donna preme sul posacenere al centro del tavolo
2) Il ragazzo spinge le scope nel ripostiglio
Risposta: No
1) La donna sposta leggermente la tenda di perline
2) La donna spinge in alto il pannello di vetro</p>
        <p>Risposta: Sì
Multiple-choiche task
Zero-shot prompts
1. In questo task ti verranno proposte quattro frasi che
descrivono azioni fisiche. Tre di queste azioni sono dello
stesso tipo, mentre una è di un tipo diverso. Individua
la frase che descrive l’azione di tipo diverso. Esiste solo
una risposta esatta, rispondi utilizzando esclusivamente il
numero di riferimento della frase e nient’altro.
2. In questo task ti verranno proposte quattro frasi che
descrivono azioni fisiche. Tre di queste azioni sono
dello stesso tipo, ovvero rappresentano lo stesso concetto
azionale, mentre una è di un tipo diverso. Un concetto
azionale è un’entità linguistico-cognitiva corrispondente a
un pattern di modifiche del mondo compiute da un agente,
ed è generalizzabile a vari oggetti (o azioni). Un concetto
azionale può essere realizzato linguisticamente con più
verbi e, viceversa, un verbo può rappresentare più concetti
azionali distinti. Tra le seguenti quattro frasi, individua
la frase che descrive l’azione di tipo diverso dalle altre tre.
Esiste solo una risposta esatta, rispondi utilizzando
esclusivamente il numero di riferimento della frase e nient’altro.
3. In questo task ti verranno proposte quattro frasi che
descrivono azioni fisiche. Tre di queste azioni sono
dello stesso tipo, ovvero rappresentano lo stesso concetto
azionale, mentre una è di un tipo diverso. Un concetto
azionale è un’entità linguistico-cognitiva corrispondente a
un pattern di modifiche del mondo compiute da un agente
(umano, animale o macchina), ed è generalizzabile a vari
oggetti (o azioni). Si tratta di una rappresentazione
cognitiva di un evento o di un processo che coinvolge,
prototipicamente, un agente (chi compie l’azione), un tema o paziente
(sul quale si esercita l’azione) e, talvolta, uno strumento, un
destinatario o una destinazione. Un concetto azionale è
produttivo, ovvero può applicarsi a un’ampia varietà di oggetti
e si presenta in contesti diversi. L’associazione tra concetto
azionale e verbo che lo descrive non è un rapporto di tipo
uno-a-uno. Infatti, un concetto azionale può essere
realizzato linguisticamente con più verbi (ad es. ’spostare una
scatola’ e ’spingere una scatola’) e, viceversa, un verbo può
rappresentare più concetti azionali distinti (ad es. ’aprire
una porta’ vs. ’aprire una noce’). Nell’individuare un
concetto azionale, è importante concentrare l’attenzione su
quali cambiamenti vengono compiuti dall’azione
rappresentata, non sul verbo. Tra le seguenti quattro frasi,
individua la frase che descrive l’azione di tipo diverso dalle
altre tre. Esiste solo una risposta esatta, rispondi
utilizzando esclusivamente il numero di riferimento della frase
e nient’altro.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Few-shot prompts</title>
        <p>Few-shot prompts are created by appending 4 examples to the three
variants of zero-shot prompts; additionally, a fourth prompt with
only examples and no description is provided.</p>
        <p>1) I ragazzi spingono i carrelli lungo il binario del treno
2) La donna con gli occhiali da sole spinge l’anziana signora sulla
sedia a rotelle</p>
        <p>3) L’uomo spinge una carriola nel cortile della fattoria mentre
parla con la donna
4) Il veterinario spinge lo stantufo della siringa
Intruso: 4
1) Il ragazzo si tira su in ginocchio
2) L’uomo si spinge sulle braccia per alzarsi in piedi
3) Il ragazzo ferito si spinge sui gomiti
4) L’operatore spinge in basso la leva dell’ascensore
Intruso: 4
1) La donna spinge l’uomo sul letto per farlo sdraiare
2) Il veterinario spinge lo stantufo della siringa
3) L’uomo armato sposta il compagno dietro di lui
4) Il marinaio sposta i corpi galleggianti con le mani
Intruso: 2
1) La donna sposta leggermente la tenda di perline
2) La ragazza abbassa la mano del ragazzo con la pistola
3) La donna spinge in alto il pannello di vetro
4) La donna preme un pulsante del suo orologio
Intruso: 4
1) La donna preme sul posacenere al centro del tavolo
2) Il ragazzo spinge le scope nel ripostiglio
3) Il ragazzo spinge il pulsante di rilascio della cintura di sicurezza
4) L’uomo di scatto chiama l’ascensore
Intruso: 2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>B. Complete results</title>
      <p>Model
Model
Model
short
zero
0.000
0.126
0.105
0.151
0.604
0.298
0.004
0.158
short
zero
1.000
0.986
0.972
0.965
0.716
0.846
1.000
0.982</p>
      <p>Accuracy values for quadruples where the intruder is expressed by a different verb
Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arbib</surname>
          </string-name>
          , G. Rizzolatti,
          <article-title>Neural expectations: A possible evolutionary path from manual skills to language</article-title>
          ,
          <source>Communication and Cognition</source>
          <volume>29</volume>
          (
          <year>1996</year>
          )
          <fpage>393</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Varvara</surname>
          </string-name>
          , L. Gregori, MACID - multimodal
          <article-title>ACtion IDentification: A CALAMITA challenge</article-title>
          , in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montemagni</surname>
          </string-name>
          , R. Sprugnoli (Eds.),
          <source>Proceedings of the 10th Italian Conference on Computational Linguistics (CLiCit</source>
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy,
          <year>2024</year>
          , pp.
          <fpage>1234</fpage>
          -
          <lpage>1238</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .137/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Moneglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Frontini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gagliardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Monachini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panunzi</surname>
          </string-name>
          , et al.,
          <article-title>The imagact visual ontology. an extendable multilingual infrastructure for the representation of lexical encoding of action</article-title>
          ,
          <source>in: Proceedings of the Ninth International Conference on Language Resources and Evaluation-LREC'14</source>
          ,
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          (ELRA),
          <year>2014</year>
          , pp.
          <fpage>3425</fpage>
          -
          <lpage>3432</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Scalena, CALAMITA - Challenge the Abilities of LAnguage Models in ITAlian: Overview</article-title>
          , in
          <source>: Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          , Movie description,
          <source>International Journal of Computer Vision</source>
          <volume>123</volume>
          (
          <year>2017</year>
          )
          <fpage>94</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          ,
          <article-title>A dataset for movie description</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>3202</fpage>
          -
          <lpage>3212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Torabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <article-title>Using descriptive video services to create a large data source for video annotation research</article-title>
          ,
          <source>arXiv preprint arXiv:1503.01070</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          ,
          <article-title>Annotation of linguistically derived action concepts in computer vision datasets</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Florence,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gagliardi</surname>
          </string-name>
          ,
          <article-title>Rappresentazione dei concetti azionali attraverso prototipi e accordo nella categorizzazione dei verbi generali. una validazione statistica</article-title>
          ,
          <source>in: Proceedings of the First Italian Conference on Computational Linguistics-CLiC-it</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>K. Y. Y. Nakamizo</surname>
          </string-name>
          , Act-ChatGPT:
          <article-title>Introducing Action Features into Multi-modal Large Language Models for Video Understanding, Pattern Recognition(ICPR</article-title>
          <year>2024</year>
          )
          <article-title>(</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Z. Cheng, H. Zhang,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <year>Qwen2</year>
          .5-vl
          <source>technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502. 13923. arXiv:
          <volume>2502</volume>
          .
          <fpage>13923</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Chen,
          <string-name>
            <given-names>S.</given-names>
            <surname>Leng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Videollama 3: Frontier multimodal foundation models for image and video understanding</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/ abs/2501.13106. arXiv:
          <volume>2501</volume>
          .
          <fpage>13106</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>