1. Introduction

M. Rinaldi);

Evaluating Models, Prompting Strategies, and Task Formats: a Case Study on the MACID Challenge

Matteo Rinaldi

Rossella Varvara

Lorenzo Gregori

Andrea Amelio Ravelli

0 2 0 University of Bologna , Via Cartoleria 5, 40124 Bologna , Italy 1 University of Florence , Via della Pergola 60, 50126 Firenze , Italy 2 University of Trento , Via Giuseppe Verdi 26, 38122 Trento , Italy 3 University of Turin , Corso Svizzera 185, 10149 Torino , Italy

2025

000 0 0001

In this study, we test the ability of 8 Large Language Models to discriminate closely related action concepts, based on textual descriptions or on video representations. Our aim is to understand if these models can handle the fine-grained action understanding that humans perform with ease, particularly when there are cases of action-predicate mismatches, i.e., the same verb may describe diferent actions, or diferent verbs may refer to the same action. We experiment on the MACID dataset, a dataset of actions representing "pushing" events and manually annotated for action IDs taken from the IMAGACT ontology. We evaluate how prompt complexity and task formats influence models' performance. Particularly, we test three diferent prompts with or without examples, two task formats (binary or multiple choice task), and two modalities (textual or visual). Results indicate that the binary task is not easier than the multiple-choice one, and that few-shot prompting generally improves models' accuracy. Moreover, LLMs perform better when helped by lexical cues: accuracy increases when actions are expressed by diferent verbs, whereas it is lower when actions are expressed by the same verb.

eol>large language models action concept understanding prompting strategies task definition

1. Introduction

[ 2 ], a benchmark specifically designed to evaluate the capacity of models to distinguish between subtly diferUnderstanding human action is a cornerstone of both lin- ent human actions described using similar or identical guistic and perceptual intelligence. The close interdepen- linguistic expressions. The MACID dataset provides both dence between language and vision in human cognition natural language descriptions and corresponding video is suggested by the Mirror System Hypothesis [ 1 ], which clips of the actions, enabling an evaluation of how visual considers language as not merely symbolic but grounded grounding can support or enhance linguistic disambiguain sensorimotor experience. This cognitive grounding tion. In this paper, we aim to test the strengths and implies that efective language understanding, especially limitations of current LLMs in grounded language unof action-related expressions, requires grasping subtle derstanding by analyzing the ability of LLMs to resolve distinctions between closely related actions. Recent ad- action ambiguities from linguistic or visual input. We vances in large language models (LLMs) and the emer- experiment considering 8 LLMs, two task formats, three gence of multimodal LLMs, which are capable of jointly prompts of increasing complexity, and two modalities processing textual and visual inputs, allow the integra- (visual or textual). We compare models’ results to rantion of perceptual and linguistic reasoning in artificial dom baselines, and we evaluate the role of the lexical models. However, it remains unclear to what extent these component in the disambiguation of actions. models can handle the fine-grained action understanding that humans perform with ease, particularly when linguistic descriptions are ambiguous or semantically close. 2. Related work To address this gap, we investigate the performance of both textual and multimodal LLMs on the MACID dataset 2.1. Action concepts definition relation between verbs and action concepts is not one- subtle semantic diferences occur among the diferent to-one: a single verb may express diferent concepts, and items. Data have been manually filtered and annotated a concept may be lexicalized by multiple verbs. IMA- [ 8 ] using the action conceptualization derived from the GACT’s multimodal approach supports cross-linguistic IMAGACT Multilingual and Multimodal Ontology of Accomparison and enables accurate mapping between verbs tions [ 3 ]. IMAGACT is a multimodal and multilingual in multiple languages and their underlying event struc- ontology of actions that provides a fine-grained categotures, independent of syntactic realization or argument rization of action concepts, each represented by one or structure. These form-meaning mismatches make action more visual prototypes in the form of recorded videos concepts foundational for modeling verb semantics in and 3D animations. IMAGACT currently contains 1,010 both theoretical and computational settings. scenes that encompass the action concepts most commonly referred to in everyday language usage. Scenes 2.2. LLM benchmarking belonging to the same action concept are grouped together and labeled with a unique identification number.

The categorization of action concepts proposed in the theoretical framework behind IMAGACT has been validated in a series of experiments with a high inter-annotator agreement [ 9 ], confirming that the theoretical framework can be considered well-founded and reproducible.

Large Language Models are usually introduced to the community by showcasing their very high performance on classic benchmarks. They are very good at solving complex math problems, writing and debugging code, or answering multiple-choice questions about common knowledge. However, this kind of evaluation does not tell the full story. When LLMs are tested on more realistic 3.2. Task formats tasks, i.e., closer to what a normal person might do, they often lose their super-human performance. These models Models are evaluated on two distinct versions of the still struggle with tasks that truly require human-like MACID dataset. Initially, models are assessed on an inunderstanding, such as subtle semantic variations, prag- truder detection task in sets of four sentences: three matic understanding, and so on. So, even if they do very sentences are related to the same action concept while well on traditional benchmarks, their performance in one is related to a diferent action concept. The goal of real-life or more everyday human life tasks is still limited. the model is to correctly identify the intruder sentence

Moreover, most of the research and efort in this field within each set, that is, the only one referring to an action is on the English language. The CALAMITA benchmark concept diferent than the remaining three. [ 4 ] represents the first of its kind as an Italian-focused The second experiment is performed on the binarized collection of tasks that really pose a challenge for com- version of the MACID dataset: models were required to monsense, factual, and linguistic knowledge in Italian. compare sentence pairs and classify them as either "different" or "equivalent" with respect to the action concept expressed by the sentence.

3. Experimental Setting

3.1. Data

The data used in this study is taken from the CALAMITA

benchmark [ 4 ], specifically from the MACID challenge [ 2 ]. This dataset is based on a portion of the LSMDC dataset [ 5 ], a collection of short video clips extracted from movies along with transcriptions of English DVS (descriptive video services) for visually impaired people. The LSMDC dataset is the result of the merging of two previous datasets, both built upon DVS from movies: the Max Planck Institute für Informatik Movie Description Dataset (MPII-MD) [ 6 ], and the Montreal Video Annotation Dataset (M-VAD) [ 7 ]. The textual captions were manually translated into Italian and modified to depict the action in the corresponding video and to avoid vague references (e.g., pronouns substituted with common nouns).

The MACID dataset includes video-caption pairs restricted to a set of similar actions, i.e. to the variation of actions and action verbs linked to "pushing" events. This choice was made to define a challenging task, in which

3.2.1. Multiple choice

The dataset in the original MACID challenge [ 2 ] was structured on groups of 4 captions, three of which were annotated as belonging to the same action concept, and one describing a diferent action type. Each entry in the dataset is structured as follows: • id: the quadruple id; • s1-4: the 4 caption sentences describing the actions; • v1-4: the reference ID of the 4 videos depicting the actions; • intruder: the number (1-4) of the sentence (and video) which is the intruder in the group.

Video files are provided in an additional folder, named with a unique reference ID. An example of the textual data follows.

(1) I due ragazzi spingono il carrello verso la colonna (The two boys push the cart toward the column) [action id: 65431186] (2) La donna spinge la signora anziana sulla sedia a rotelle (The woman pushes the elderly lady in the wheelchair) [action id: 65431186] (3) L’uomo spinge a terra l’aggressore (The man pushes the attacker to the ground) [action id: 18ad2fa9] (4) L’infermiere spinge la barella (The nurse pushes the gurney) [action id: 65431186] prompt (MEDIUM) adds to the first more details about what an action concept is, and what are the main features which discriminate between close but diferent actions. The third prompt (LONG) elaborates more on the theoretical distinction between actions and is enriched with some explanation about the possible mismatch between actions and verbs. Finally, we added to the experimental setting a version of the task without any explanation (NONE), but with only some examples. All prompts were formulated in Italian to assess both the models’ sentence processing capabilities and their ability to correctly interpret instructions given in the Italian language. All prompts are reported in the Appendix A. 4.1. Zero or few-shot settings • id: the quadruple id; • s1-2: the 2 caption sentences describing the actions; • v1-2: the reference IDs of the 2 videos depicting the actions; • id1-id2: the action concept IDs of the 2 actions; • diferent: information about the actions being diferent (1) or the same (0).

The empirical investigation with diferent prompting

strategies aimed at finding the optimal balance between 3.2.2. Binary choice instructions given in a concise form and instructions In order to verify the impact of the task format on this given using a long and verbose language. This explochallenge, we converted the dataset (as well as the task) ration involved developing three distinct prompts for into a binary format. This second dataset consists of each dataset variant, alongside an additional experiment video-caption pairs, together with their action concept utilizing few-shot examples without explicit instructions. IDs and the information about whether they correspond To expand the analysis on how the instruction given to the same action type or not. We kept the information in the prompt influences the outcomes, each prompt was about the quadruple ID to allow comparison between the tested under both zero-shot and few-shot conditions. Five results from the two formats. The columns in the new examples were selected from the quadruple dataset and version of the dataset describe the following information: four from the paired dataset, with consistent example sets maintained throughout the evaluation process. The selection of five examples from the quadruple dataset was strategically designed to encompass all possible verb relationship combinations: one example featuring four distinct verbs, one with three diferent verbs, one containing two diferent and two identical verbs, one with verbs paired identically, and one where all verbs were identical. 4.2. Textual and visual settings 3.3. Models

In order to test the models on the diferent settings pro

For this experiment, we tested a bunch of textual models: posed in the MACID’s experiments, we wrote a Python ifve small models with 7/8/9 billion parameters (Llama3.1, script that interrogates an OpenAI API compatible backQwen2.5, Aya-expanse, Mistral, Minerva, Gemma2), one end to perform interrogation and evaluation of the modmedium native-Italian model with 14 billion parameters els. The script loads the data from JSONL files and formu(Velvet), and one big model with 72 billion parameters lates the diferent complete prompts for each datapoint. (Qwen2.5). To evaluate the results, the scripts only consider the first sampled token and check if it corresponds to the expected outcome. For the experiment on quadruples, only the 4. Prompting strategies ifrst character of the first token is considered and checked against the number identifying the intruder sentence. In In both scenarios (multiple or binary choice), we tested the experiment of couples, considered that the model three prompts, built with incremental information. The was asked to answer either “yes” (sì) or “no” (no), the ifrst prompt (SHORT) is the same proposed for the origi- ifrst sampled token was converted in lower case and acnal MACID Challenge, and it is a baseline with just the cents were removed, so that it was possible to check it necessary information to execute the task. The second regardless of the case or the use of the accent on the word sì, required in formally correct Italian but that may be Outline3), because the requested output format is straightomitted without changing the sentence’s meaning even forward and we considered a good adherence to it as part by native speakers. As a backend, we employed vLLM of the task. Restricting the amount of output tokens to with Flash Attention 2.7 for optimal performance for all 4 also allowed for a great saving of resources, given the the 7B, 8B and 14B models. Qwen 2.5 72b was instead high computational costs of autoregressive generation. accessed using the “OpenRouter” API and loaded with Some models were not able to perfectly adhere to the BF16 weights. All the models were set to a temperature instructions, but this behavior seems related to some of 0.0 and a random seed of "27" in order to obtain re- task formats. Aya-expanse-8b does not follow the reproducible results. All the results were then saved in a quired format with all three prompts when tested for SQLite database for easy access.1 binary response without examples. Gemma-2-9b pro

We decided to purposely opt for a strict evaluation vides unacceptable responses for all the binary task’s strategy: answers where the model wrote any kind of conditions.4 Minerva-7B-instruct-v1.0, with no diference text before the actual task’s answer - such as chattering, between prompts and binary/multiple choice tasks, does boilerplate text, reasoning traces, or unwanted answer’s not adhere in the zero-shot setting, with the exception formatting - were automatically discarded by the eval- of the short prompt in the binary task. uation script, that expected the correct answer to be in the very first characters of the model’s response. This decision is motivated by the fact that we also wanted to test the models’ capabilities to strictly adhere to the given instructions: a model that talks too much or return the answer in an unwanted format is a model that may pose problems in production scenario, such as higher costs, due to the generation of more tokens, or the need to add post-processing strategies.

Binary choice task Among the small models (ranging between 7 and 14 billion parameters), llama-3.1-8binstruct reaches the best results, with a .696 accuracy when instructed with the long prompt in a few-shot setting. This model reaches high accuracy (.689) even with the short prompt with examples and with the examples alone, showing generally a preference for the few-shot setting with respect to the zero one (with a .133 diference in accuracy between the few and zero-shot setting with 5. Results the long prompt, Table 1).

Qwen-2.5-72b reaches the highest accuracy (.725) In this section, we discuss the results obtained across among all models, with the long prompt and the fewall the experimental scenarios (i.e., prompting strategies, shot setting. However, despite the huge diference in zero/few-shot, multiple/binary choice). On both task for- parameters, it is outperformed in short_zero setting by mats, we defined a majority class baseline. The baseline Llama-3.1-8b. As noted above, some models (i.e., Minervaaccuracy for the multiple choice task is 28% , while for 7b and aya-expanse-8b) do not provide satisfying replies the binary choice task it is 50%. in some conditions (marked as ND in Table 1).

In general, the few-shot setting improves the results in the binary task, even if in some cases the diference is

5.1. Results with textual LLMs small.

Figure 1 reports the performance of the models tested in With regard to the prompt type, 5 models out of 7 both multiple-choice (1a) and binary-choice (1b) tasks. show a preference for the long prompt. Aya-expanse-8b Before illustrating the results, we present an evaluation does slightly better with the medium prompt (.647) with of the ability of the models to follow the instructions respect to the detailed prompt (.640), whereas Velvet-14B and to provide the answer in the required format. In- achieves the same accuracy with both (.507). deed, we forced the model to reply with only 4 tokens, Native Italian models do not perform better than the since we expected a yes/no answer for the binary task others: the results from Velvet-13b are close to chance, and a number to identify the intruder sentence in the whereas Minerva-7b achieves better in the long-few shot multiple-choice task. The desired output format has been setting. unambiguously specified in the prompts (see Appendix We additionally analyze the impact of the lexical comA), although we decided not to be strict in accepting an- ponent on models’ performance, i.e., we look at if and swers: upper/lower case, accents, or additional spacing, how models are facilitated when actions are expressed have been tolerated whenever the "yes/no" or "1/2/3/4" by diferent verbs (Table 5, Appendix B) and when they strings were present in the answer. We didn’t use any are expressed by the same one (Table 4, Appendix B). additional tool to constrain the output (e.g, Guidance2, Most models achieve higher accuracy when actions are 1All data and scripts https://github.com/mrinaldi97/MACID/ 2https://github.com/guidance-ai/guidance are available at 3https://github.com/dottxt-ai/outlines 4Given this behavior, we excluded Gemma-2-9b from the summary tables reported in Appendix. (a) Multiple Choice task format (b) Binary choice task format expressed by diferent verbs: it is easier to discriminate Multiple choice task Among the small models, if two sentences express the same action if their lexical qwen2.5-7b reaches the best results, with a .568 accuracy description is diferent as well. When the verbs are equal, when instructed with the examples. However, diferently accuracy decreases. This diference is smoother when ex- from the binary task, the gap with the larger model (qwenamples are added in the prompts, and it increases with the 2.5-72b) is notable, with the latter performing very well short prompt. A notable exception is given by llama-3.1- among all conditions and reaching an accuracy of 0.737 8b-instruct, which achieves higher accuracy for actions in three of them (few-shot with medium, long, and no expressed by the same verbs rather than with diferent prompt). Even if it has been noted frequently that LLMs verbs (reaching a value of .933 in the long-zero format). do not perform well with multiple-choice tasks, in this When looking in more detail at its behavior, we note that challenge, they do better than in the binary choice one, this happens with the two most detailed prompts, and we considering the random baseline for each task (Table 2. hypothesize that it may be due to specification that there As noted for the binary task, providing a few examples is no one-to-one matching between action concepts and increases accuracy. Exceptions, however, are found for verbs included in these prompts. the short prompt: velvet-14b and aya-expanse-8b have a slightly higher accuracy with the zero-shot setting with respect to the few-shot. The zero/few shot setting also minerva-7b-instruct-v1.0 mistral-7b-instruct-v0.3 qwen2.5-7b-instruct aya-expanse-8b llama-3.1-8b-instruct gemma-2-9b velvet-14b qwen-2.5-72b-instruct BASELINE Model minerva-7b-instruct-v1.0 mistral-7b-instruct-v0.3 qwen2.5-7b-instruct aya-expanse-8b llama-3.1-8b-instruct gemma-2-9b velvet-14b qwen-2.5-72b-instruct BASELINE has an influence also in the ability of Minerva-7b to com- tive alignment between video and language latent spaces ply with the required output format: when provided with [ 10 ]. We conducted experiments with two state-of-theexamples, it follows the instructions, whereas it does not art video models: Qwen 2.5 VL 8B [ 11 ] and VideoLLama3 in the zero-shot prompt. 7B [ 12 ]. The models were executed on a local machine

Contrary to what we observed above for the binary using configurations recommended in the oficial doctask, the prompt type does not widely influence the umentation. Both Qwen and VideoLLama utilize Hugresults: accuracy values for most models (minerva-7b, ging Face’s "transformers" library, which includes the mistral-7b, qwen2.5-7b, llama-3.1-8b, qwen-2.5-72b) are necessary code for running these video models. Both equal among diferent prompts. models handle videos of arbitrary resolution sampled at

As for the binary task, the verb used to describe the user-defined framerates. To keep memory usage manintruder has an impact: if it is the same as (at least one ageable, we resized the original videos to 360x288 resoof) the other sentences, models’ performance drops, even lution. While this resolution is lower than the original if less strongly (Table 6 and 7 in Appendix B). ifles, often in FullHD (1920x1080) or PAL DVD (720x576) format, it remains perfectly intelligible to human view5.2. Results with visual LLMs ers, being comparable to VideoCD (352x288) and VHS tape quality (240 horizontal TV lines). The framerate was The MACID dataset includes all the original videos re- set to 8fps because we decided to avoid very low framferred to by the sentences. This setting enabled us to con- erates, given that video samples are brief (<4s) and conduct an exploratory experiment with multimodal models, sistently represent live action. Following the text-only particularly those capable of processing video inputs. experiments, we selected the best-performing prompt on At the time of writing, video models are in their early average and adapted it for video model testing. Specifidevelopmental stages. A great efort is going on to un- cally, we modified the medium prompt to accommodate derstand optimal methods for integrating video informa- the video experiment, substituting sentences with video tion into language models, as video data presents chal- clips. Due to memory constraints, we executed the exlenges for transformer architectures due to the quadratic periment exclusively on the binary task. Neither Qwen computational cost of self-attention over long sequences. VL nor VideoLLama successfully handled the task: both Moreover, diferent research groups are experimenting models always returned "No" for every tested video pair. with diferent architectural choices to ensure an efec- Interestingly, Qwen VL also provided brief video descriptions. We speculate that the poor performance of video framework (i.e., string matching) might appear at first models on this task relates to dificulties in coherently glance to be simplistic, lazy, and excessively punitive for processing temporal sequences and performing cross- the models. As we already mentioned in Section 5.1, we domain inferences between visual and textual features. could have used specific libraries to parse the responses Moreover, the prompt being written in Italian and the in search of the correct result, but the point is that, given presentation of two videos simultaneously, rather than these models’ reputation as “intelligent” (as promoted the single-video setting usually employed during pre- by the developers), one expects these models to be able training, further deviated the experimental conditions to follow very simple instructions, regardless of their from the training distribution, substantially increasing ability to efectively solve a task. Even in few-shot scetask complexity. Testing multimodal and, in particular, narios, where the requested answer format it is more than video models poses significant challenges, and we believe explicit, some models consistently fail in following the that the Macid task can become a useful task to assess instructions. Models with super-human abilities might the models’ abilities to correctly identify complex actions. not need to be hand-guided.

For this reason, we leave to future work a more extensive experimentation with video models, including prompt/- 6. Conclusions formulation modifications, testing new models, as well as trying fine-tuning operations.

This study evaluates LLMs on the action concepts discrim

ination task: we present the results for 7 LLMs evaluated 5.3. Discussion on the MACID dataset.

Results show a wide variation in models’ performances, Model Average error rate depending on the model type, the number of model paminerva-7b-instruct-v1.0 0.166 rameters, the prompt used, and the task format. mistral-7b-instruct-v0.3 0.0 Qwen-2.5-72b obtained the highest average accuracy qwen2.5-7b-instruct 0.0 both on the binary and the multiple-choice task, confirmaya-expanse-8b 0.306 ing that the number of parameters is a core factor in this llama-3.1-8b-instruct 0.0 type of semantically complex task. gemma-2-9b 0.867 Italian models (Minerva and Velvet) perform poorly in vqewlveent--21.45B-72b-instruct 00..00 both task formats. This is an unexpected result, considering the task requires fine-grained semantic abilities.

Table 3 Among 7B/8B models, top results are achieved by Average error rate for each model, grouped and averaged for Qwen-2.5, in multiple-choice format (acc. 0.568), and all tasks. Llama-3.1 in binary format (acc. 0.696). The latter obtains an accuracy comparable with Qwen-2.5-72b (0.725),

Table 3 reports the average values of unacceptable re- despite the diference in the number of parameters. sponses per model, in each task, i.e. responses where the On average, few-shot prompting works better than models did not adhere to the requested output format. zero-shot, both in binary and in multiple-choice task forAs already stated, beside the objective of testing the abil- mats. In general, we don’t find strong performance diferity of LLMs to interpret and discriminate descriptions ences among the three versions of the task description in of physical actions, we also want them to show their the prompt (SHORT, MEDIUM, and LONG), while there ability to follow the instructions given to them. One of is a consistent accuracy improvement with the few-shot the main problem we faced with our experiments is that prompting. Even the few-shot without task description responses from models tend to be overly verbose, as mod- (none_few) has a good accuracy on the top models. els need to explain their choices every time. While this Finally, the lexical components have a strong influence may be considered a useful and interesting behavior in on models’ behavior in this task: the accuracy varies a chat models, it is definitely not ideal in instruct models, lot if the two sentences use the same verb or diferent as those tested in our experiments. As it is specified in verbs (in the binary task) or if the intruder has the same all our prompts, we explicitly ask to answer with the id verb as the other sentences or not (in the multiple-choice of the intruder for the multiple-choice and with "sì" or task). The accuracy gap between these two cases is huge "no" (yes or no) for the binary-choice task (see Appendix with Qwen, which seems to be more sensitive to lexical A), thus the request is clear. Nevertheless, sometimes diferences than Llama. For example, Qwen-2.5-72b on a models tend to elude the requested response format (i.e., binary task reaches 0.975 accuracy with diferent verbs the answer does not start with a valid id number for the and 0.579 with the same verb. multiple-choice task, or it does not start with "sì/no" for Further experiments need to be done with video LLMs, the binary-choice task), while others apply absolutely un- which did not provide satisfactory results in this first necessary markup (e.g., aya-expanse-8b). Our evaluation experimentation.

A. Prompts

Prompts used for the experiments in binary and multiple-choice tasks.

Binary task

Zero-shot prompts

Three variants have been used, with increasing description details. scatola’ e ’spingere una scatola’) e, viceversa, un verbo può rappresentare più concetti azionali distinti (ad es. ’aprire una porta’ vs. ’aprire una noce’). Nell’individuare un concetto azionale, è importante concentrare l’attenzione su quali cambiamenti vengono compiuti dall’azione rappresentata, non sul verbo. Rispondi ’Sì’ se ritieni che entrambe le frasi si riferiscano allo stesso concetto azionale, rispondi ’No’ se ritieni che descrivano due concetti azionali diversi.

Few-shot prompts

Few-shot prompts are created by appending 4 examples to the three variants of zero-shot prompts; additionally, a fourth prompt with only examples and no description is provided. The following examples have been used.

1) I ragazzi spingono i carrelli lungo il binario del treno 2) La donna con gli occhiali da sole spinge l’anziana signora sulla sedia a rotelle

Risposta: Sì 1) L’uomo spinge una carriola nel cortile della fattoria mentre parla con la donna 2) Il veterinario spinge lo stantufo della siringa Risposta: No 1) La donna preme sul posacenere al centro del tavolo 2) Il ragazzo spinge le scope nel ripostiglio Risposta: No 1) La donna sposta leggermente la tenda di perline 2) La donna spinge in alto il pannello di vetro

Risposta: Sì Multiple-choiche task Zero-shot prompts 1. In questo task ti verranno proposte quattro frasi che descrivono azioni fisiche. Tre di queste azioni sono dello stesso tipo, mentre una è di un tipo diverso. Individua la frase che descrive l’azione di tipo diverso. Esiste solo una risposta esatta, rispondi utilizzando esclusivamente il numero di riferimento della frase e nient’altro. 2. In questo task ti verranno proposte quattro frasi che descrivono azioni fisiche. Tre di queste azioni sono dello stesso tipo, ovvero rappresentano lo stesso concetto azionale, mentre una è di un tipo diverso. Un concetto azionale è un’entità linguistico-cognitiva corrispondente a un pattern di modifiche del mondo compiute da un agente, ed è generalizzabile a vari oggetti (o azioni). Un concetto azionale può essere realizzato linguisticamente con più verbi e, viceversa, un verbo può rappresentare più concetti azionali distinti. Tra le seguenti quattro frasi, individua la frase che descrive l’azione di tipo diverso dalle altre tre. Esiste solo una risposta esatta, rispondi utilizzando esclusivamente il numero di riferimento della frase e nient’altro. 3. In questo task ti verranno proposte quattro frasi che descrivono azioni fisiche. Tre di queste azioni sono dello stesso tipo, ovvero rappresentano lo stesso concetto azionale, mentre una è di un tipo diverso. Un concetto azionale è un’entità linguistico-cognitiva corrispondente a un pattern di modifiche del mondo compiute da un agente (umano, animale o macchina), ed è generalizzabile a vari oggetti (o azioni). Si tratta di una rappresentazione cognitiva di un evento o di un processo che coinvolge, prototipicamente, un agente (chi compie l’azione), un tema o paziente (sul quale si esercita l’azione) e, talvolta, uno strumento, un destinatario o una destinazione. Un concetto azionale è produttivo, ovvero può applicarsi a un’ampia varietà di oggetti e si presenta in contesti diversi. L’associazione tra concetto azionale e verbo che lo descrive non è un rapporto di tipo uno-a-uno. Infatti, un concetto azionale può essere realizzato linguisticamente con più verbi (ad es. ’spostare una scatola’ e ’spingere una scatola’) e, viceversa, un verbo può rappresentare più concetti azionali distinti (ad es. ’aprire una porta’ vs. ’aprire una noce’). Nell’individuare un concetto azionale, è importante concentrare l’attenzione su quali cambiamenti vengono compiuti dall’azione rappresentata, non sul verbo. Tra le seguenti quattro frasi, individua la frase che descrive l’azione di tipo diverso dalle altre tre. Esiste solo una risposta esatta, rispondi utilizzando esclusivamente il numero di riferimento della frase e nient’altro.

Few-shot prompts

Few-shot prompts are created by appending 4 examples to the three variants of zero-shot prompts; additionally, a fourth prompt with only examples and no description is provided.

1) I ragazzi spingono i carrelli lungo il binario del treno 2) La donna con gli occhiali da sole spinge l’anziana signora sulla sedia a rotelle

3) L’uomo spinge una carriola nel cortile della fattoria mentre parla con la donna 4) Il veterinario spinge lo stantufo della siringa Intruso: 4 1) Il ragazzo si tira su in ginocchio 2) L’uomo si spinge sulle braccia per alzarsi in piedi 3) Il ragazzo ferito si spinge sui gomiti 4) L’operatore spinge in basso la leva dell’ascensore Intruso: 4 1) La donna spinge l’uomo sul letto per farlo sdraiare 2) Il veterinario spinge lo stantufo della siringa 3) L’uomo armato sposta il compagno dietro di lui 4) Il marinaio sposta i corpi galleggianti con le mani Intruso: 2 1) La donna sposta leggermente la tenda di perline 2) La ragazza abbassa la mano del ragazzo con la pistola 3) La donna spinge in alto il pannello di vetro 4) La donna preme un pulsante del suo orologio Intruso: 4 1) La donna preme sul posacenere al centro del tavolo 2) Il ragazzo spinge le scope nel ripostiglio 3) Il ragazzo spinge il pulsante di rilascio della cintura di sicurezza 4) L’uomo di scatto chiama l’ascensore Intruso: 2

B. Complete results

Model Model Model short zero 0.000 0.126 0.105 0.151 0.604 0.298 0.004 0.158 short zero 1.000 0.986 0.972 0.965 0.716 0.846 1.000 0.982

Accuracy values for quadruples where the intruder is expressed by a different verb Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1]

Arbib , G. Rizzolatti, Neural expectations: A possible evolutionary path from manual skills to language , Communication and Cognition 29 ( 1996 ) 393 - 424 .

[2]

A. A.

Ravelli ,

Varvara , L. Gregori, MACID - multimodal ACtion IDentification: A CALAMITA challenge , in: F. Dell'Orletta , A.

Lenci , S.

Montemagni , R. Sprugnoli (Eds.), Proceedings of the 10th Italian Conference on Computational Linguistics (CLiCit 2024 ), CEUR Workshop Proceedings, Pisa, Italy, 2024 , pp. 1234 - 1238 . URL: https://aclanthology.org/ 2024 .clicit- 1 .137/.

[3]

Moneglia ,

S. W.

Brown ,

Frontini ,

Gagliardi ,

Khan ,

Monachini ,

Panunzi , et al., The imagact visual ontology. an extendable multilingual infrastructure for the representation of lexical encoding of action , in: Proceedings of the Ninth International Conference on Language Resources and Evaluation-LREC'14 , European Language Resources Association (ELRA), 2014 , pp. 3425 - 3432 .

[4]

Attanasio ,

Basile ,

Borazio ,

Croce ,

Francis ,

Gili , E. Musacchio,

Nissim ,

Patti ,

Rinaldi , D. Scalena, CALAMITA - Challenge the Abilities of LAnguage Models in ITAlian: Overview , in : Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), 2024 .

[5]

Rohrbach ,

Torabi ,

Rohrbach ,

Tandon ,

Pal ,

Larochelle ,

Courville ,

Schiele , Movie description, International Journal of Computer Vision 123 ( 2017 ) 94 - 120 .

[6]

Rohrbach ,

Tandon ,

Schiele , A dataset for movie description , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2015 , pp. 3202 - 3212 .

[7]

Torabi ,

Pal ,

Larochelle ,

Courville , Using descriptive video services to create a large data source for video annotation research , arXiv preprint arXiv:1503.01070 ( 2015 ).

[8]

A. A.

Ravelli , Annotation of linguistically derived action concepts in computer vision datasets , Ph.D. thesis , University of Florence, 2020 .

[9]

Gagliardi , Rappresentazione dei concetti azionali attraverso prototipi e accordo nella categorizzazione dei verbi generali. una validazione statistica , in: Proceedings of the First Italian Conference on Computational Linguistics-CLiC-it , 2014 , pp. 180 - 185 .

[10] K. Y. Y. Nakamizo , Act-ChatGPT: Introducing Action Features into Multi-modal Large Language Models for Video Understanding, Pattern Recognition(ICPR 2024 ) ( 2024 ).

[11]

Bai ,

Chen ,

Liu ,

Wang ,

Ge ,

Song ,

Dang ,

Wang ,

Tang ,

Zhong ,

Zhu ,

Yang ,

Li ,

Wan ,

Wang ,

Ding ,

Fu ,

Xu ,

Ye ,

Zhang ,

Xie , Z. Cheng, H. Zhang,

Yang ,

Xu ,

Lin , Qwen2 .5-vl technical report , 2025 . URL: https://arxiv.org/abs/2502. 13923. arXiv: 2502 . 13923 .

[12]

Zhang ,

Li ,

Cheng ,

Hu ,

Yuan , G. Chen,

Leng ,

Jiang ,

Zhang ,

Li ,

Jin ,

Zhang ,

Wang ,

Bing ,

Zhao , Videollama 3: Frontier multimodal foundation models for image and video understanding , 2025 . URL: https://arxiv.org/ abs/2501.13106. arXiv: 2501 . 13106 .