1. Introduction

MAIA: a Benchmark for Multimodal AI Assessment

Davide Testa

2 5

Giovanni Bonetta

Rafaella Bernardi

Alessandro Bondielli

0 1

Alessandro Lenci

Alessio Miaschi

Lucia Passaro

Bernardo Magnini

2 0 Department of Computer Science, University of Pisa , Pisa 1 Department of Philology , Literature and Linguistics , University of Pisa , Pisa 2 Fondazione Bruno Kessler (FBK) , Trento 3 Free University of Bozen-Bolzano , Bolzano 4 Istituto di Linguistica Computazionale "A. Zampolli" (CNR-ILC), ItaliaNLP Lab , Pisa 5 Università di Roma La Sapienza , Roma

2025

We introduce MAIA (Multimodal AI Assessment), a multimodal dataset developed as a core component of a competenceoriented benchmark designed for fine-grained investigation of the reasoning abilities of Visual Language Models (VLMs) on videos. The MAIA benchmark is characterized by several distinctive features. To the best of our knowledge, MAIA is the first Italian-native benchmark addressing video understanding: videos were carefully selected to reflect Italian culture, and the language data (i.e., questions and reference answers) were produced by native-Italian speakers. Second, MAIA explicitly includes twelve reasoning categories that are specifically designed to assess the reasoning abilities of VLMs on videos. Third, we structured the dataset to support two aligned tasks (i.e., a statement verification and an open-ended visual question answering) built on the same datapoints, this way allowing to assess VLM coherence across task formats. Finally MAIA integrates, by design, state-of-the-art LLMs in the development process of the benchmark, taking advantage of their linguistic and reasoning capabilities both for data augmentation and for assessing and improving the overall quality of the data. In the paper we focus on the design principles and the data collection methodology, highlighting how MAIA provides a significant advancement with respect to other available dataset for VLM benchmarking. Data available at GitHub.

eol>Multimodality Benchmarking Vision-Language Models Multimodal Reasoning Language Resources

1. Introduction

VLMs, assessing their performance on standardized tasks and metrics is becoming more and more challenging.

In recent years, mainly following the success of large First of all, evaluating VLMs understanding in real language models (LLMs), there has been a growing in- world scenarios requires moving beyond single-frame terest for large pre-trained models able to manage both scenarios. Unlike static images, videos ofer rich temtexts and images. Such Vision and Language models poral structure: they capture dynamic scenes, evolving (VLMs) have been investigated both from a theoretical actions, interactions, and causal dependencies that unperspective (e.g., Baroni [ 1 ]) and for their application- fold over time, making them one of the most faithful and oriented interest (e.g., Bigham et al. [2]). Today, there are closest approximations to real-world complex scenarios. dozens of available VLMs, and the most popular fami- In this context, the role of evaluation becomes critical: to lies of generative AI models (e.g., Llama, Gemma, Qwen, truly assess a model’s ability to understand, reason, and GPT) include several VLMs, which can address a number ground meaning across modalities, we need benchmarks of question answering tasks on both images and videos. that do not merely test task performance, but probe the As a consequence of the fast and increasing power of underlying competences of the model [3]. With this purpose in mind, we introduce MAIA (MultCicLsi,CS-eitpt2e0m25b:eErl2e4ve—nt2h6I,t2a0li2a5n, CCaognlfiearrein,cIteaolyn Computational Linguis- timodal AI Assessment), a multimodal dataset developed * Corresponding author. as a core component of a broader competence-oriented $ dtesta@fbk.eu (D. Testa); gbonetta@fbk.eu (G. Bonetta); evaluation framework for VLMs. MAIA is designed to rafaella.bernardi@unibz.it (R. Bernardi); challenge models on multimodal reasoning grounded in alessandro.bondielli@unipi.it (A. Bondielli); real-world scenarios from diferent linguistic perspec(aAle.sMsainadsrcoh.il)e;nlucic@iau.pnaispsia.irto(@A.uLneipnic.iit);(aLl.ePssaisos.amroia)s;cmhai@gnilicn.ic@nrf.bitk.eu tives. To the best of our knowledge, it is the first native (B. Magnini) Italian evaluation dataset of its kind and based on video 0009-0002-2489-5323 (D. Testa); 0000-0003-4498-1026 content. MAIA provides a linguistically rich and semanti(G. Bonetta); 0000-0002-3423-1208 (R. Bernardi); cally diverse resource for exploring vision and language 0000-0003-3426-6643 (A. Bondielli); 0000-0001-5790-4308 (A. Lenci); understanding in realistic contexts, with a particular fo(0L0.0P0-a0s0sa0r2o-0);703060-504-01010(2A-0.7M40ia-s5c7h7i8);(0B0. 0M0-a0g0n0i3n-i4)934-5344 cus on Italian culture, by covering distinct reasoning cate© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License gories, each targeting specific semantic phenomena. This Attribution 4.0 International (CC BY 4.0). structure allows for a fine-grained analysis of the con- language understanding is assessed in conjunction with tribution of both language and visual modalities across perceptual grounding. Simultaneously, these eforts have diferent types of reasoning. A key feature of MAIA is its revealed critical weaknesses in early multimodal architeccascading data collection approach, which enables the tures, by highlighting their reliance on dataset biases or same source data to be spent and used across multiple shallow heuristics rather than genuine visual reasoning task formats (e.g., generative tasks, classification tasks, [ 10, 11 ]. Such challenges have later been framed within etc.), supporting fully comparable evaluations and paving the broader phenomenon of Unimodal Collapse, where a the way for an all-in-one benchmarking strategy. The VLM disproportionately depends on its language compoeficacy of this approach and of the MAIA benchmark nent, resulting in text-only models performing compaas a severe and robust evaluation framework has been rably to their multimodal counterparts [12]. In contrast proved in Testa et al. [4] in which we evaluate models to earlier stages [13, 14, 15], the growing awareness of against a classification and a generative task, namely these issues has prompted the emergence of diagnostic visual statement verification and open-ended question evaluation frameworks such as in Parcalabescu et al. answering. While the second task turns out to be more [12], Thrush et al. [16], Chen et al. [17], Bianchi et al. challenging even for the best-performing models, they [18] and carefully curated benchmarks such as in Xiao also exhibit significant inconsistencies both within and et al. [19] and Tong et al. [20], designed to expose the across the two tasks, with some categories relying more true capabilities and limitations of VLMs. These methodheavily on either the visual or the linguistic component ological insights strongly motivate the design of MAIA to solve the task. However, in this paper, we dive into as a robust, controlled multimodal dataset, aimed at enhow the dataset was collected. Finally, an additional as- suring that models genuinely integrate both linguistic pect of innovation in the data creation of MAIA pipeline and visual information, rather than relying solely on the lies in the integration of human annotation with targeted priors embedded in their language backbones. data augmentation using powerful LLMs (GPT-4o [5]), Building on this tradition, video-language datasets combined with a multi-stage semi-automatic validation lately extended the challenge to temporal understandprocess conducted with the same model at diferent lev- ing and dynamic scene interpretation, both essential els. This dual use of a generative model (i.e., GPT-4o) not components for complex real-world understanding. Sevonly enhances the diversity and coverage of the dataset eral resources including TVQA [21] and HowToVQA [22] but also ensures high-quality and semantically consistent datasets or the AGQA [23] and MVBench [24] benchmarks data throughout the pipeline. changed their focus from static perception to actions and

The paper is organized as follows. Section 2 reviews entities, by trying to challenge VLMs in identifying the rethe most relevant prior work in the research area. In lationships between them. As in the case of image-based Section 3, we detail the design choices behind the cre- evaluation, early surveys have already stressed the need ation of the dataset and, more broadly, the development for careful and systematic assessment Zhong et al. [25]. of the entire MAIA benchmark. Finally, Sections 4 and 5 While task-oriented benchmarks often report strong perdescribe the specific steps followed for dataset construc- formance [26, 27], more fine-grained evaluations have tion: the former focuses on the selection and collection of revealed critical limitations [28], and competence-based video material, while the latter addresses the collection analyses continue to highlight the substantial gap in and validation of all linguistic data that constitute MAIA. the video understanding capabilities of VLMs [29]. In Both sections are complemented by dedicated analyses this context, MAIA contributes as a new video-language of the collected data. dataset aimed at evaluating VLMs not only on videos featuring temporal dynamics and meaningful content but also through a competence-oriented design that explores 2. Related Work the interplay between language and vision, a dimension largely neglected in prior Video QA benchmarks.

Multimodal datasets combining vision and language have played a crucial role in the development and evaluation of VLMs. Early image-based resources such as the VQA [6], Italian Multimodal Datasets. Most multimodal GQA [7], DVD [8], and HL [9] datasets have provided con- datasets are available in English, with only limited multitrolled environments to assess visual reasoning and nat- lingual or other native-language resources, with Italian ural language understanding through several tasks like being consistently underrepresented. In the image doImage captioning or Visual Question Answering, thereby main, GQA-it dataset [30] is a notable attempt to adapt reinforcing the role of vision as a fundamental compo- a visual question answering dataset into Italian. More nent in the evaluation of multimodal models [6]. Over recent benchmarks like XGQA [31] and EXAMS-V [32] time, contributions of this kind have been instrumental in include translated Italian multiple-choice questions, but shaping the foundations of multimodal evaluation, where lack original content and do not target high-level reasoning. MAIA fills this gap as the first Italian-native and Q

Q&A pairs

(1Q : 8A) 10 Q

True Statements (TS)

A1 : A8 TS FS

True- False Statements pairs (TS-FS)

Statement Verification Task Open-ended Q&A Task

...

VLM

TS FS video-language dataset specifically designed to assess complex visual reasoning and grounding.

3. MAIA: Benchmark Design

This section presents the design principles, structure, and construction pipeline of both the MAIA dataset and the benchmark built upon it. In line with this, Figure 1 illustrates the overall workflow adopted for dataset creation, embedding it with the broader architectural framework of the benchmark, which also includes the downstream tasks the data is designed to support.

As shown, the dataset creation begins with the collection of short videos, each associated with twelve highlevel reasoning categories. These categories reflect different semantic phenomena and were chosen to ensure a rich and controlled testing environment for visual and linguistic reasoning. Based on these categories, we constructed our multimodal dataset by first collecting a set of questions that served as the conceptual backbone for the creation of the linguistic data, both manually (i.e. a set of answers) and automatically generated (i.e. True and False statements), as described in detail in Section 5. Figure 2 illustrates an example1 of a MAIA item and highlights the cascading logic behind the data creation process. This architecture supports the development of two aligned 1Although all source data are in Italian, examples are presented in English to enhance readability. evaluation tasks: a Visual Statement Verification task, using paired true/false statements to assess the model’s ability to distinguish accurate from misleading content in a multiple-choice format, and an open-ended Visual Question Answering task, where each question is matched with eight diferent human answers serving as a reference set to evaluate the quality of the response generated by the VLM. Each task will test diferent aspects of visual understanding and reasoning, all grounded in the same set of videos and categories.

Table 1 presents the structure of the MAIA dataset after the data creation and validation process.

Feature Videos Semantic Categories Questions (Q) Answers (A) True Statements (TS) False Statements (FS) n 100 12 2, 400 19, 200 19, 200 19, 200 3.1. Reasoning Categories We defined 12 reasoning categories as the outcome of two pilot studies conducted with a group of expert volunteer CAUSAL COUNTERFACTUAL IMPLICIT UNCERTAINTY OUT-OF-SCOPE PLANNING SENTIMENT SPATIAL TEMPORAL

Partial Total Partial Total Partial Duration Why is mozzarella melted?

The heat from the wood oven has melted it

Mozzarella is melted by the heat of the wood oven

Mozzarella is melted by the heat generated by the sun.

What would happen if the pizza chef dropped the pizza on the floor?

He would dirty the floor and would have to remake the pizza. pizza, he would dirty the floor and would have to remake the pizza.

If the pizza chef dropped the

If the pizza maker dropped the pizza,

he would not dirty the floor and would not have to remake the pizza.

Is the person who rolls out the pizza the same one who puts it in the oven? What is the function of all the wooden planks under the

wood oven? On average, how many pizzas does the pizza chef bake each day?

No, they are two different people.

They have to feed the fire.

I do not have enough data to know.

In the scene, the person who rolls out the pizza dough and the one who puts it in the oven are two distinct figures.

In the scene, the person who rolls out the pizza dough and the one who puts it in the oven are the

same person.

The wooden planks under the wood oven are for feeding the fire.

The wooden planks under the wood oven are for

decoration.

There is not enough data to determine the average number of pizzas a pizza maker cooks daily.

There is sufficient data to determine the average number of pizzas that the pizza maker cooks daily.

What is the cake made of?

I cannot see any cake.

There is no cake in the video.

There is a cake in the video.

What steps should the pizza maker take to revive the fire? What attitude does the pizza maker show while taking the pizza out of the oven? Where is the pizza placed after being taken out of the oven?

He should stir up the embers a bit and throw

some new wood.

The pizzaiolo looks focused.

The pizza is placed on a

plate.

Where is the pizza maker?

In the pizzeria in front of the oven

To revive the fire, the pizza maker should stir the embers

and add new wood.

In the video, the pizza maker looks focused while taking the pizza out of the oven.

After being taken out of the oven, the pizza is placed on a plate.

In the scene, the pizza maker is in the pizzeria in front of the oven

To revive the fire, the pizza maker should stir the embers and add

new water.

In the video, the pizza maker looks distracted while taking the pizza

out of the oven.

After being taken out of the oven, the pizza is placed on

the table.

In the scene, the pizza chef is in the pizzeria by the counter When does the pizzaiolo take the pizza out of the

oven? How long does it take to cook the pizza in the video?

When he considers it cooked, towards the end of the video.

The pizzaiolo takes the pizza out of the oven towards the end of the video when he considers it cooked.

Pizza baking time is approximately 30 seconds

The baking of the pizza in the video takes approximately 30 seconds

The pizzaiolo takes the pizza out of the oven towards the beginning of the video when he considers it

cooked.

The baking of the pizza in the video takes approximately 30 seconds annotators. These pilots aimed to identify the optimal categories, including their definitions and any associated number, type, and specificity of the categories needed to efectively probe the cognitive and linguistic abilities of VLMs on our videos. Based on the feedback received, some initially proposed categories were merged due to content overlap or redundancy. Conversely, other categories were added to enhance the granularity of reasoning assessment (e.g, we introduced a Planning category, as we consider it a meaningful expression of reasoning skills). These refinements allowed us to design a more robust and informative framework to explore the interplay between language and vision in multimodal processing.

The following paragraphs introduce the final macrosub-categories.

Causal

focuses on reasoning about the causes or effects of events depicted in the video. It includes two

2 subtypes , namely Implicit and Explicit , ofering a comprehensive test of a model’s ability to describe causality within events. The former involves inferring unobservable causes from visible efects in the scene, requiring logical reasoning beyond what is directly shown. The 2Unlike the following cases, these are not treated as distinct subcategories but as two equally represented subtypes of the same category latter concerns clearly observable cause-and-efect dy- entities or events in the scene, throughout the entire namics, where either the cause or the efect is directly video. A typical response to a sentiment question may identifiable from the video content. describe a specific sentiment, attitude, or emotion, or it may reflect a neutral stance. This category evaluates Counterfactual focuses on questions about hypo- the ability of the model to recognize and identify the thetical scenarios that do not actually occur in the video emotional state or attitude of characters based on visual but could take place under specific conditions. These cues. questions are based on entities or events visible in the video and explore the consequences of an event or situa- Spatial investigates the spatial relationships between tion that might happen in the video if a certain condition entities, objects, or events depicted in the video. It aims were met. This category tests the ability of a model to at assessing the model’s ability to infer both stable and reason about hypothetical scenarios grounded in the con- time-dependent spatial relationships, as well as the text of the video while deriving logical and plausible ability to determine relative positioning in space and to outcomes from such scenarios. rely on grounding competencies.

Implicit investigates entities, events, or their at- Total Spatial: focuses on position of entities in space tributes that are not explicitly visible in the video while (including their relation to other entities) that remains their presence or properties can be reasonably inferred constant throughout the whole video, disregarding from the context. It evaluates the ability of a model any temporal variations or minimal movements of the to infer implicit details based on context, whether the entity at diferent moments in the video. A typical target information was never shown or was previously response to this type of question provides general spatial visible but later obscured. information valid for the entire duration of the video. Total Implicit: involves entities or events that are never directly visible in the video but can be inferred from observable details. A typical answer provides the requested information based on logical inference.

Partial Implicit: involves entities or events that were visible earlier in the video but are no longer visible due to a shift in the scene or because they have moved out of the frame.

Partial Spatial: focuses on time-related positions of entities in space, takin into account events occurring in the scene. A typical answer to this question provides spatial information that is valid only for the requested time range in the video.

Temporal focuses on temporal information and studies the ability of a model to infer temporal relationships, sequence of events, and durations from visual content in a coherent manner.

Sentiment assesses sentiment, mood, attitude, or emotion displayed by characters in the video toward other Out-of-scope refers to entities or events entirely absent from the video, focusing on properties or details of Partial Temporal: focuses on the temporal properties these non-existent elements. Typical responses to out- and relationships between events in the video, excluding of-scope questions involve a negation, indicating that their duration. Questions target aspects such as when the referenced entity or event is not present in the scene. something happens or whether it occurs before or after Typical answers to this question types involve a negation, another event. Typical answers specify the event along signaling that the referenced content is not present. This with the requested temporal detail. category indirectly tests the ability of a model to detect multimodal hallucinations and an assertiveness tendency in its responses.

Duration Temporal: focuses on a specific property of events in the video: their duration. A typical answer to a question involves several ways to express the duraPlanning asks for actions needed to achieve a spe- tion of the event. cific goal related to the video. The typical response to a planning question is a sequence of actions that someone Uncertainty refers to entities or events present in the should perform in order to reach the desired outcome. video but lacking suficient information to answer the Such a category assesses the ability of the model to infer question precisely. Questions are inherently ambiguous, and plan the necessary steps to accomplish a goal based as the visual content does not fully support a definitive on the visual cues provided in the video. response. Answers may ofer plausible options, acknowledge uncertainty, or signal that the reply is a guess. This category tests a VLM in handling ambiguity and incomplete evidence, and in assessing its tendency to respond ahn ing umb n i d rsoen ircah adbg ltaeb llraeb liccye lttoeb cup rca oenh toab ckkap izzap lobw ltoapp lltrsab rsoeh itscae trcku ite p p l l e c c a b o p s u s

4. Curated Video Dataset

4.1. Video Selection A key design choice for the MAIA benchmark was to reflect Italian culture in real-world scenarios through a carefully curated selection of video clips. To ensure richness and variety, the selection process was based on the following thematic areas: Locations, Food, Sport, Job, Nature, Activities. Such topics allowed us to collect a dataset showing locations, iconic Italian cities, and daily activities (e.g., enjoying breakfast at a café, cooking pasta, attending a soccer match) or even typical events (e.g., Italian local festivals or weddings). This cultural focus was not intended to limit the generalizability of the benchmark, but rather to ofer a valuable opportunity to assess model performance on culturally grounded data, which is an aspect often underrepresented in existing multimodal resources. 4.2. Video Collection We collected a culturally representative set of 100 short videos (~30 seconds each) sourced from YouTube Italy.

Following the criteria described in Section 4.1, videos were retrieved using keyword-based queries across selected thematic areas. Only Creative Commons licensed content was included to ensure reproducibility. When necessary, longer videos were manually checked and cut to extract the most relevant 30-second segments, resulting in a uniform and culturally grounded video set. 4.3. Analysis of Videos

To better understand the visual content present in the MAIA benchmark, we conducted an object detection and classification analysis over the full set of videos using a YOLOv113 detection pipeline. For each video, we sampled 32 uniformly spaced frames and ran object detection on them. This analysis provides a high-level view of the 5. Curated Linguistic Data typical objects types in MAIA.

Figure 3 shows the frequency distribution of detected 5.1. Questions Collection object labels across all annotated frames. Person is by far the most common object class, reflecting the human- We created 12 diferent sets of guidelines, each assigned centered nature of most videos in the benchmark. How- to a diferent annotator via Google Forms in order to colever, the dataset also includes a wide variety of everyday lect two questions per reasoning category for each video. objects, suggesting a rich and diverse set of visual ele- Annotators were PhD students under 30 with specializaments. tions in Linguistics and Computational Linguistics4. To

Figure 4 shows the distribution of the number of de- ensure variability between the pair of questions about tected objects per frame. Most frames contain a moderate that video, annotators were asked to change the entities number of objects, typically between two and six. This indicates that the videos ofer a balance between visual simplicity and complexity, making them suitable for testing both low-level perception and high-level reasoning in VLMs. 3https://docs.ultralytics.com/it/models/yolo11/ 4Each annotator was paid 100 euros for generating questions, which were collected through the administration of 1, 200 forms (10 per annotator)

ANSWER 1 ANSWER 2 ANSWER 3 ANSWER 4 ANSWER 5 ANSWER 6 ANSWER 7 ANSWER 8

What role do the men in white shirts play? Che ruolo svolgono gli uomini con le maglie bianche?

The men in white shirts are the competition judges Gli uomini con le maglie bianche sono i giudici di gara

They observe who scores a point

Osservano chi fa punto Men in white give judgements on the competition

Gli uomini in bianco danno giudizi sulla gara

They seem to be the referees of this bocce game

Sembra che siano gli arbitri di questa partita a bocce They measure the distance of the thrown ball from the little one and determine the winner of the set

Misurano la distanza della boccia tirata dal boccino e decretano il vincitore del set

The men in white shirts are the referees of the match Gli uomini con le maglie bianche sono gli arbitri dellla partita

The men in white are the jury

Gli uomini in bianco sono i giudici

Men in white shirts play the role of refereeing the match

Gli uomini con le maglie bianche svolgono il compito di arbitrare la partita and/or events involved in both of them. Each provided form contained both the definition of the assigned semantic category with examples, and also general rules to be followed (see Appendix, Figure 8 for an example of the form used). Each question had to be generated naturally and as an open-ended question. Questions involving a ‘Yes/No’ answer (e.g. Is there a car in the video?) were not allowed. Finally, for the correct execution of the task, the audio of the video had to be ignored, as the VLMs to be tested could only work on the visual part. Subsequently, questions were manually reviewed to ensure quality and category alignment. 5.2. Answers Collection Italian as their first language, and had spent the majority of their first 18 years of life in Italy. As with the question collection step, we used Google Forms to provide the task6. Each form included 10 videos, and for each video, the annotators were asked to answer 12 questions, one per reasoning category (see Appendix, Figure 9 for an example of the form used). Annotators were encouraged to use their own world knowledge when interpreting the visual content of the video.

To guarantee high quality of the collected answers, we employed rigid control mechanisms based on sanity check questions. Answers were accepted only if the annotators correctly answered at least 90% of these control questions, otherwise their submissions were rejected and the task was reassigned to another annotator. In total, 2, 400 questions were paired with 8 answers each, resulting in 19, 200 responses. They were then further checked by a semi-automated two-step validation process based on GPT-4o with few-shot prompting: The goal of this phase was to collect 8 diferent answers for each question to ensure not only accuracy but also variability in responses. This choice is also supported by ifndings from Mañas et al. [33], who empirically show that using up to 8 demonstrations provides an efective trade-of between diversity, accuracy, and computational Semantic Consistency Check. Each response was eficiency in in-context learning with LLMs for VQA eval- evaluated for semantic consistency with the corresponduation. We used the Prolific platform5 and selected an- ing question. In cases where inconsistencies were denotators aged 25 to 80 who were born in Italy, spoke tected, the answers were manually reviewed to assess 5https://www.prolific.com 6Annotators were paid £7 per hour for answering questions

Given an Italian question Q and an answer A concerning a video, you must create a statement S based on A. While generating S, try not to alter the words composing A. If A includes first-person verbs or phrases (e.g., 'I think,' 'I believe'), rephrase S to be impersonal, avoiding a first-person perspective.

The statement should be a concise, declarative sentence.

Given an Italian caption (TS) regarding the position or location of someone or something, your task is to create its foil (FS) by changing only the spatial information.

Don't add other information respect to what is stated in TS. Here is an example to guide you: TS: La donna nel video è in un campo di papaveri.

FS: La donna nel video è in una classe.

Given an Italian caption (C) dealing with temporal information about events and its foil (F), your task is to assess whetthheerctohrereqcutneestsisoonfsFhboausldedboenreC-.answered by another with diferent wording. Following the same procedure annoTtoabtoervoarlidt,hFesrheospulodnesxepcreosuslda dstiOilelrbeentatcecmepptoerda.l iRnefoarlmautisoendwfoitrhtrheefepreonoclseotof 8threeospnoenesxepsr,ewsseepdeirnfoCr.med a quality inco nIfsFisitseancviaelsidwfoerile, gfoeunenrdatteo 'bceormreicnti'moathle(ir.we.i,sfee w'neort tchoarnrect'c.heck to ensure lexical variability within the 2400 pools 100Bout of 19, 200 responses). of true statements (TS) (see Section 5.6).

Your task is to determine the natural language inference (NLI) relationship between S1 and S2. The possible Conltarbaedliscatiroe:n Test. We checked whether, within each 5.4. False Statement Generation pool- oEfnt8airlmesepnot:nSs2eslotgoictahlley fsoalmlowesqfuroemstiSo1n., any of the respo- nCsoenstcraodnitcrtaiodnic:tSe2dctohnetoratdhiecrtss. SW1.e found that 90.25% The goal of this phase is to create a false statement (FS) of th-eN8e-uantrsawl: eSr2paonodlsS1exahreibrietlafutelldabgurtedemoneontt,eanstatihleoyrcdoontrafdoircteaecahchToSthalerre.ady collected, in order to form a minimal not cPoronvtiadienoannlyyocnoenlatrbaedliacstioountsp.uTt(hEentraeilmmaeinnti,nCgon9.t7ra5d%ictioTnS,-oFrSNpeauirt,raeln).abling controlled experiments and precise (234 cases) were manually reviewed by an additional analysis of a model’s behavior with respect to the reaannotator to resolve inconsistencies. soning categories. As for TSs, the FSs were automatically

A post-processing phase of the responses was imple- generated using GPT-4o for editing only the elements of mented to ensure a suficient degree of variability and the sentence related to that semantic category, an apreduce potential redundancy within each of the 2, 400 6pBroashcho winsspairperdo mbyptthuesecadpftoiornt-hfeoiFlSmsegtheonder[a1t4io].nF7i.gFuorer poolGsoivfe8naansqwueersst(isoene (SQec),tiaonca5.n6d).idFiagtuerean5sswhoewr s(Aa)n, and a set of 8 reference answers (R1–R8), your instance, taking into account the previous example in 5.3: examtapslekoisf otoned8e-taenrsmwienrepwoohleatshseorciAatiesdcwoirtrhecatv.iAdeiso considered correct if it aligns with at least one of and a question, after this refinement procedure described abovteh. e reference answers. In the video the boy is in the bathroom before running away.

Return only one label as output: 'Correct' or 'Incorrect. 5.3. True Statement Generation At this step we automatically generate a true statement (TS) for each question-answer pair collected in the previous phases. A TS consists of descriptive declarative sentences aligned with the visual content of the videos.

For example, if a video shows a boy who is initially in a kitchen and he hears a loud noise and runs away, a TS for the Spatial category could be: In the video, the boy is in the kitchen before running away.

To create TS we used GPT-4o, with the prompt in Figure 6A, leveraging the combination of each question and its answer to automatically generate 19, 200 true statements (TSs). As with the answers, the TS are organised into 2, 400 pools of 8 items, each expressing the same event Finally, we implemented two quality checks for FS using GPT-4o.

Structural Check aiming at automatically verifying that each FS aligns correctly with its corresponding TS according to its category. While the GPT-4o evaluation initially flagged 864 out of 19, 200 cases as incorrect, only 2.5% were ultimately confirmed as truly problematic and subsequently corrected through manual revision.

Contradiction Test performed by assuming that a correct FS must be in contradiction with the relevant TS.

We ran an NLI task to classify TS-FS pairs as Entailment, 7Due to space constraints, we could not include all the 12 prompts used for generating FSs specific to each reasoning category. However, the prompt shown here is representative of the adopted methodology.

Nouns in Q&A Across All Videos

As said in Section 5.2, we opted for a pool-based structure with 8 items per question in order to balance semantic consistency with lexical diversity both across answers and statements. To meet this requirement, we assessed and enhance lexical richness within our data. This phase was carried out in several incremental steps (i.e., a string based test, lexical overlap and Type-Token Ratio (TTR) 8Nouns from TS and FS were excluded, as those sentences are derived from Q&A and would result in redundant repetitions. 9https://spacy.io 10Since TSs are generated from an automatic rephrasing of Q&A pairs, we checked and improve their lexical diversity. This indirectly benefits the corresponding FSs, which difer by a single term from TS. This work has been carried out while Davide Testa was enrolled in the Italian National Doctorate on Artificial Intelligence run by Sapienza University of Rome in collaboration with Fondazione Bruno Kessler (FBK). Giovanni Bonetta and Bernardo Magnini were supported by the PNRR MUR project PE0000013-FAIR (Spoke 2).

Alessandro Lenci and Alessandro Bondielli were supported by the PNRR MUR project PE0000013-FAIR (Spoke 1). Alessio Miaschi was supported by the PNRR MUR project PE0000013-FAIR (Spoke 5). Lucia Passaro was supported by the EU EIC project EMERGE (Grant No. 101070918).

A. Additional Materials

The following figures show examples of the forms adopted for collecting the questions (Figure 8) and the corresponding answers (Figure 9).

General Task Privacy Policy

and Research Purposes

Category specific task

Example 2 Questions 2-Qgueensteiroantsion generation Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

tics , Dublin, Ireland, 2022 , pp. 8253 - 8280 . URL: pirical Methods in Natural Language Processing ,

https://aclanthology.org/ 2022 . acl-long . 567 . doi:10. Association for Computational Linguistics, Brus-

18653 /v1/ 2022 . acl-long.567. sels , Belgium, 2018 , pp. 1369 - 1379 . URL: https: [13]

Johnson ,

Hariharan , L. van der Maaten, L. Fei- //aclanthology.org/D18-1167/. doi: 10 .18653/v1/

Fei , C. L.

Zitnick , R.

Girshick , Clevr: A diagnostic D18-1167.

dataset for compositional language and elementary [22]

Yang ,

Miech ,

Sivic , I. Laptev,

Schmid , Just

visual reasoning , in: CVPR, 2017 . ask: Learning to answer questions from millions of [14]

Shekhar ,

Pezzelle ,

Klimovich , A . Herbe- narrated videos, 2021 . URL: https://arxiv.org/abs/

lot , M.

Nabi , E. Sangineto, R.

Bernardi , FOIL

2012 . 00451 . arXiv: 2012 .00451.

it! find one mismatch between image and lan- [23]

Grunde-McLaughlin ,

Krishna , M. Agrawala,

Proceedings of the 55th Annual Meeting of the temporal reasoning , in: Proceedings of the

ume 1: Long Papers), Association for Computa- tern Recognition , 2021 , pp. 11287 - 11297 .

tional Linguistics , Vancouver, Canada, 2017 , pp. [24]

Li ,

Wang ,

He ,

Li ,

Wang ,

Liu ,

Wang ,

255- 265 . URL: https://aclanthology.org/P17-1024/. J. Xu , G.

Chen , P.

Luo , L.

Wang , Y.

Qiao , Mvbench:

doi:10 .18653/v1/ P17 -1024. A comprehensive multi-modal video understanding [15]

Suhr ,

Lewis ,

Yeh ,

Artzi , A corpus of natu- benchmark , CVPR ( 2024 ). URL: https://doi.org/10.

ral language for visual reasoning , in: R. Barzilay, M.- 48550 /arXiv.2311.17005.

Kan (Eds.), Proceedings of the 55th Annual Meet- [25]

Zhong ,

Ji ,

Xiao ,

Li ,

Deng , T.-S. Chua,

tics (Volume 2 : Short

Papers)

, Association for Com- and challenges, in: Y. Goldberg , Z. Kozareva ,

putational Linguistics , Vancouver, Canada, 2017 , pp. Y . Zhang (Eds.), Proceedings of the 2022 Con-

217- 223 . URL: https://aclanthology.org/P17-2034/. ference on Empirical Methods in Natural Lan-

doi:10 .18653/v1/ P17 -2034. guage Processing, Association for Computational [16]

Thrush ,

Jiang ,

Bartolo ,

Singh , Linguistics, Abu Dhabi, United Arab Emirates,

Williams ,

Kiela ,

Ross , Winoground: Prob- 2022, pp. 6439 - 6455 . URL: https://aclanthology.org/

ing vision and language models for visio-linguistic 2022 . emnlp-main . 432 /. doi: 10 .18653/v1/ 2022 .

compositionality, in: CVPR 2022 , 2022 . emnlp-main. 432 . [17]

Chen ,

Fernández ,

Pezzelle , The BLA bench- [26]

Grunde-McLaughlin ,

Krishna , M. Agrawala,

Bali (Eds.), Proceedings of the 2023 Conference IEEE/CVF Conference on Computer Vision and Pat-

on Empirical Methods in Natural Language Pro- tern Recognition , 2021 .

cessing , Association for Computational Linguis- [27] Z.

Yu , L.

Zheng , Z.

Zhao , F.

Wu , J.

Fan , K. Ren,

tics , Singapore, 2023 , pp. 5817 - 5830 . URL: https: J. Yu , ANetQA: A Large-scale Benchmark for Fine-

//aclanthology.org/ 2023 .emnlp-main. 356 /. doi:10. grained Compositional Reasoning over Untrimmed

18653 /v1/ 2023 .emnlp-main. 356 . Videos , in: 2023 IEEE/CVF Conference on Com[18]

Bianchi ,

Carrara ,

Messina , C.

Gennaro, puter Vision and Pattern Recognition (CVPR), IEEE

Falchi , The devil is in the fine-grained details : Computer Society , Los Alamitos, CA, USA, 2023 , pp.

Evaluating open -vocabulary object detectors for 23191-23200 . URL: https://doi.ieeecomputersociety.

ifne-grained understanding , in: Proceedings of the org/10.1109/CVPR52729 . 2023 . 02221 . doi: 10 .1109/

IEEE/CVF Conference on Computer Vision and Pat- CVPR52729 . 2023 . 02221 .

tern Recognition , 2024 , pp. 22520 - 22529 . [28]

Kesen ,

Pedrotti ,

Dogan ,

Cafagna , E. C. [19]

Xiao ,

Yao ,

Li , T.-S. Chua, Can i trust Acikgoz,

Parcalabescu , I. Calixto , A . Frank,

answering, in: CVPR, 2024 , pp. 13204 - 13214 . URL: benchmark for linguistic and temporal ground-

https://doi.org/10.1109/CVPR52733. 2024 . 01254 . ing in video-language models , 2023 . URL: https: [20]

Tong ,

Liu ,

Zhai ,

Ma , Y. LeCun, S. Xie, //arxiv.org/abs/2311.07022. arXiv: 2311 . 07022 .

Eyes wide shut? exploring the visual shortcomings [29]

Patraucean ,

Smaira ,

Gupta , A . R. Con-

of multimodal llms , in: CVPR 2024 , 2024 . tinente, L. Markeeva,

D. S.

Banarse , S. Koppula, [21]

Lei ,

Yu ,

Bansal , T. Berg, TVQA: Lo- J. Heyward , M.

Malinowski , Y.

Yang , C.

Doersch ,

(Eds.), Proceedings of the 2018 Conference on Em- tar, S. Osindero ,

Damen ,

Zisserman , J. Car-

tems Datasets and Benchmarks Track , 2023 . URL:

https://openreview.net/forum?id=HYEGXFnPoq. [30]

Croce ,

L. C.

Passaro ,

Lenci ,

Basili , Gqa-

Linguistics , 2021 . URL: https://api.semanticscholar.

org/CorpusID:245125448. [31]

B. S.

Shafique ,

Vayani ,

Maaz ,

H. A.

Rasheed ,

benchmark model , 2025 . URL: https://arxiv.org/

abs/2506 .07032. arXiv: 2506 . 07032 . [32] R. Das , S.

Hristov , H.

Li , D.

Dimitrov , I. Koy-

the 62nd Annual Meeting of the Association for

Computational

Linguistics (Volume 1 : Long Pa-

Bangkok , Thailand, 2024 , pp. 7768 - 7791 . URL:

https://aclanthology.org/ 2024 . acl-long . 420 /. doi:10.

18653 /v1/ 2024 . acl-long . 420 . [33]

Mañas ,

Krojer ,

Agrawal , Improving auto-

els , 2024 . URL: https://arxiv.org/abs/2310.02567.

arXiv:2310 . 02567 . [34]

M. O.

Gul , Y. Artzi, CoGen: Learning from feed-

Proceedings of the 2024 Conference on Empiri-

Florida , USA, 2024 , pp. 12966 - 12982 . URL: https:

//aclanthology.org/ 2024 .emnlp-main. 721 /. doi:10.

18653 /v1/ 2024 .emnlp-main. 721 .