<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MAIA: a Benchmark for Multimodal AI Assessment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Testa</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Bonetta</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafaella Bernardi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Bondielli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Miaschi</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Passaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Pisa</institution>
          ,
          <addr-line>Pisa</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Philology</institution>
          ,
          <addr-line>Literature and Linguistics</addr-line>
          ,
          <institution>University of Pisa</institution>
          ,
          <addr-line>Pisa</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fondazione Bruno Kessler (FBK)</institution>
          ,
          <addr-line>Trento</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Free University of Bozen-Bolzano</institution>
          ,
          <addr-line>Bolzano</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Istituto di Linguistica Computazionale "A. Zampolli" (CNR-ILC), ItaliaNLP Lab</institution>
          ,
          <addr-line>Pisa</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Università di Roma La Sapienza</institution>
          ,
          <addr-line>Roma</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We introduce MAIA (Multimodal AI Assessment), a multimodal dataset developed as a core component of a competenceoriented benchmark designed for fine-grained investigation of the reasoning abilities of Visual Language Models (VLMs) on videos. The MAIA benchmark is characterized by several distinctive features. To the best of our knowledge, MAIA is the first Italian-native benchmark addressing video understanding: videos were carefully selected to reflect Italian culture, and the language data (i.e., questions and reference answers) were produced by native-Italian speakers. Second, MAIA explicitly includes twelve reasoning categories that are specifically designed to assess the reasoning abilities of VLMs on videos. Third, we structured the dataset to support two aligned tasks (i.e., a statement verification and an open-ended visual question answering) built on the same datapoints, this way allowing to assess VLM coherence across task formats. Finally MAIA integrates, by design, state-of-the-art LLMs in the development process of the benchmark, taking advantage of their linguistic and reasoning capabilities both for data augmentation and for assessing and improving the overall quality of the data. In the paper we focus on the design principles and the data collection methodology, highlighting how MAIA provides a significant advancement with respect to other available dataset for VLM benchmarking. Data available at GitHub.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodality</kwd>
        <kwd>Benchmarking</kwd>
        <kwd>Vision-Language Models</kwd>
        <kwd>Multimodal Reasoning</kwd>
        <kwd>Language Resources</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>VLMs, assessing their performance on standardized tasks
and metrics is becoming more and more challenging.</p>
      <p>
        In recent years, mainly following the success of large First of all, evaluating VLMs understanding in real
language models (LLMs), there has been a growing in- world scenarios requires moving beyond single-frame
terest for large pre-trained models able to manage both scenarios. Unlike static images, videos ofer rich
temtexts and images. Such Vision and Language models poral structure: they capture dynamic scenes, evolving
(VLMs) have been investigated both from a theoretical actions, interactions, and causal dependencies that
unperspective (e.g., Baroni [
        <xref ref-type="bibr" rid="ref10">1</xref>
        ]) and for their application- fold over time, making them one of the most faithful and
oriented interest (e.g., Bigham et al. [2]). Today, there are closest approximations to real-world complex scenarios.
dozens of available VLMs, and the most popular fami- In this context, the role of evaluation becomes critical: to
lies of generative AI models (e.g., Llama, Gemma, Qwen, truly assess a model’s ability to understand, reason, and
GPT) include several VLMs, which can address a number ground meaning across modalities, we need benchmarks
of question answering tasks on both images and videos. that do not merely test task performance, but probe the
As a consequence of the fast and increasing power of underlying competences of the model [3].
With this purpose in mind, we introduce MAIA
(MultCicLsi,CS-eitpt2e0m25b:eErl2e4ve—nt2h6I,t2a0li2a5n, CCaognlfiearrein,cIteaolyn Computational Linguis- timodal AI Assessment), a multimodal dataset developed
* Corresponding author. as a core component of a broader competence-oriented
$ dtesta@fbk.eu (D. Testa); gbonetta@fbk.eu (G. Bonetta); evaluation framework for VLMs. MAIA is designed to
rafaella.bernardi@unibz.it (R. Bernardi); challenge models on multimodal reasoning grounded in
alessandro.bondielli@unipi.it (A. Bondielli); real-world scenarios from diferent linguistic
perspec(aAle.sMsainadsrcoh.il)e;nlucic@iau.pnaispsia.irto(@A.uLneipnic.iit);(aLl.ePssaisos.amroia)s;cmhai@gnilicn.ic@nrf.bitk.eu tives. To the best of our knowledge, it is the first native
(B. Magnini) Italian evaluation dataset of its kind and based on video
0009-0002-2489-5323 (D. Testa); 0000-0003-4498-1026 content. MAIA provides a linguistically rich and
semanti(G. Bonetta); 0000-0002-3423-1208 (R. Bernardi); cally diverse resource for exploring vision and language
0000-0003-3426-6643 (A. Bondielli); 0000-0001-5790-4308 (A. Lenci); understanding in realistic contexts, with a particular
fo(0L0.0P0-a0s0sa0r2o-0);703060-504-01010(2A-0.7M40ia-s5c7h7i8);(0B0. 0M0-a0g0n0i3n-i4)934-5344 cus on Italian culture, by covering distinct reasoning
cate© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License gories, each targeting specific semantic phenomena. This
Attribution 4.0 International (CC BY 4.0).
structure allows for a fine-grained analysis of the con- language understanding is assessed in conjunction with
tribution of both language and visual modalities across perceptual grounding. Simultaneously, these eforts have
diferent types of reasoning. A key feature of MAIA is its revealed critical weaknesses in early multimodal
architeccascading data collection approach, which enables the tures, by highlighting their reliance on dataset biases or
same source data to be spent and used across multiple shallow heuristics rather than genuine visual reasoning
task formats (e.g., generative tasks, classification tasks, [
        <xref ref-type="bibr" rid="ref13 ref19">10, 11</xref>
        ]. Such challenges have later been framed within
etc.), supporting fully comparable evaluations and paving the broader phenomenon of Unimodal Collapse, where a
the way for an all-in-one benchmarking strategy. The VLM disproportionately depends on its language
compoeficacy of this approach and of the MAIA benchmark nent, resulting in text-only models performing
compaas a severe and robust evaluation framework has been rably to their multimodal counterparts [12]. In contrast
proved in Testa et al. [4] in which we evaluate models to earlier stages [13, 14, 15], the growing awareness of
against a classification and a generative task, namely these issues has prompted the emergence of diagnostic
visual statement verification and open-ended question evaluation frameworks such as in Parcalabescu et al.
answering. While the second task turns out to be more [12], Thrush et al. [16], Chen et al. [17], Bianchi et al.
challenging even for the best-performing models, they [18] and carefully curated benchmarks such as in Xiao
also exhibit significant inconsistencies both within and et al. [19] and Tong et al. [20], designed to expose the
across the two tasks, with some categories relying more true capabilities and limitations of VLMs. These
methodheavily on either the visual or the linguistic component ological insights strongly motivate the design of MAIA
to solve the task. However, in this paper, we dive into as a robust, controlled multimodal dataset, aimed at
enhow the dataset was collected. Finally, an additional as- suring that models genuinely integrate both linguistic
pect of innovation in the data creation of MAIA pipeline and visual information, rather than relying solely on the
lies in the integration of human annotation with targeted priors embedded in their language backbones.
data augmentation using powerful LLMs (GPT-4o [5]), Building on this tradition, video-language datasets
combined with a multi-stage semi-automatic validation lately extended the challenge to temporal
understandprocess conducted with the same model at diferent lev- ing and dynamic scene interpretation, both essential
els. This dual use of a generative model (i.e., GPT-4o) not components for complex real-world understanding.
Sevonly enhances the diversity and coverage of the dataset eral resources including TVQA [21] and HowToVQA [22]
but also ensures high-quality and semantically consistent datasets or the AGQA [23] and MVBench [24] benchmarks
data throughout the pipeline. changed their focus from static perception to actions and
      </p>
      <p>The paper is organized as follows. Section 2 reviews entities, by trying to challenge VLMs in identifying the
rethe most relevant prior work in the research area. In lationships between them. As in the case of image-based
Section 3, we detail the design choices behind the cre- evaluation, early surveys have already stressed the need
ation of the dataset and, more broadly, the development for careful and systematic assessment Zhong et al. [25].
of the entire MAIA benchmark. Finally, Sections 4 and 5 While task-oriented benchmarks often report strong
perdescribe the specific steps followed for dataset construc- formance [26, 27], more fine-grained evaluations have
tion: the former focuses on the selection and collection of revealed critical limitations [28], and competence-based
video material, while the latter addresses the collection analyses continue to highlight the substantial gap in
and validation of all linguistic data that constitute MAIA. the video understanding capabilities of VLMs [29]. In
Both sections are complemented by dedicated analyses this context, MAIA contributes as a new video-language
of the collected data. dataset aimed at evaluating VLMs not only on videos
featuring temporal dynamics and meaningful content but
also through a competence-oriented design that explores
2. Related Work the interplay between language and vision, a dimension
largely neglected in prior Video QA benchmarks.</p>
      <p>Multimodal datasets combining vision and language have
played a crucial role in the development and evaluation of
VLMs. Early image-based resources such as the VQA [6], Italian Multimodal Datasets. Most multimodal
GQA [7], DVD [8], and HL [9] datasets have provided con- datasets are available in English, with only limited
multitrolled environments to assess visual reasoning and nat- lingual or other native-language resources, with Italian
ural language understanding through several tasks like being consistently underrepresented. In the image
doImage captioning or Visual Question Answering, thereby main, GQA-it dataset [30] is a notable attempt to adapt
reinforcing the role of vision as a fundamental compo- a visual question answering dataset into Italian. More
nent in the evaluation of multimodal models [6]. Over recent benchmarks like XGQA [31] and EXAMS-V [32]
time, contributions of this kind have been instrumental in include translated Italian multiple-choice questions, but
shaping the foundations of multimodal evaluation, where lack original content and do not target high-level
reasoning. MAIA fills this gap as the first Italian-native and
Q</p>
      <p>A</p>
      <sec id="sec-1-1">
        <title>Q&amp;A pairs</title>
        <p>(1Q : 8A)
10
Q</p>
      </sec>
      <sec id="sec-1-2">
        <title>True Statements (TS)</title>
        <p>A1 : A8
TS
FS</p>
        <p>True- False
Statements pairs
(TS-FS)</p>
        <sec id="sec-1-2-1">
          <title>Statement Verification Task</title>
        </sec>
        <sec id="sec-1-2-2">
          <title>Open-ended Q&amp;A Task</title>
          <p>...</p>
          <p>VLM</p>
          <p>TS
FS
video-language dataset specifically designed to assess
complex visual reasoning and grounding.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. MAIA: Benchmark Design</title>
      <p>This section presents the design principles, structure, and
construction pipeline of both the MAIA dataset and the
benchmark built upon it. In line with this, Figure 1
illustrates the overall workflow adopted for dataset creation,
embedding it with the broader architectural framework
of the benchmark, which also includes the downstream
tasks the data is designed to support.</p>
      <p>As shown, the dataset creation begins with the
collection of short videos, each associated with twelve
highlevel reasoning categories. These categories reflect
different semantic phenomena and were chosen to ensure a
rich and controlled testing environment for visual and
linguistic reasoning. Based on these categories, we
constructed our multimodal dataset by first collecting a set of
questions that served as the conceptual backbone for the
creation of the linguistic data, both manually (i.e. a set of
answers) and automatically generated (i.e. True and False
statements), as described in detail in Section 5. Figure 2
illustrates an example1 of a MAIA item and highlights the
cascading logic behind the data creation process. This
architecture supports the development of two aligned
1Although all source data are in Italian, examples are presented in
English to enhance readability.
evaluation tasks: a Visual Statement Verification task,
using paired true/false statements to assess the model’s
ability to distinguish accurate from misleading content
in a multiple-choice format, and an open-ended Visual
Question Answering task, where each question is matched
with eight diferent human answers serving as a
reference set to evaluate the quality of the response generated
by the VLM. Each task will test diferent aspects of visual
understanding and reasoning, all grounded in the same
set of videos and categories.</p>
      <p>Table 1 presents the structure of the MAIA dataset
after the data creation and validation process.</p>
      <p>Feature
Videos
Semantic Categories
Questions (Q)
Answers (A)
True Statements (TS)
False Statements (FS)
n
100
12
2, 400
19, 200
19, 200
19, 200
3.1. Reasoning Categories
We defined 12 reasoning categories as the outcome of two
pilot studies conducted with a group of expert volunteer
CAUSAL
COUNTERFACTUAL
IMPLICIT
UNCERTAINTY
OUT-OF-SCOPE
PLANNING
SENTIMENT
SPATIAL
TEMPORAL</p>
      <p>Partial
Total
Partial
Total
Partial
Duration
Why is mozzarella melted?</p>
      <p>The heat from the wood
oven has melted it</p>
      <p>Mozzarella is melted by the
heat of the wood oven</p>
      <p>Mozzarella is melted by the
heat generated by the sun.</p>
      <p>What would happen if the
pizza chef dropped the pizza
on the floor?</p>
      <p>He would dirty the floor and
would have to remake the pizza. pizza, he would dirty the floor and
would have to remake the pizza.</p>
      <p>If the pizza chef dropped the</p>
      <p>If the pizza maker dropped the pizza,</p>
      <p>he would not dirty the floor and
would not have to remake the pizza.</p>
      <p>Is the person who rolls out
the pizza the same one
who puts it in the oven?
What is the function of all the
wooden planks under the</p>
      <p>wood oven?
On average, how many
pizzas does the pizza chef
bake each day?</p>
      <p>No, they are two
different people.</p>
      <p>They have to feed the fire.</p>
      <p>I do not have enough
data to know.</p>
      <p>In the scene, the person who rolls
out the pizza dough and the one
who puts it in the oven are two
distinct figures.</p>
      <p>In the scene, the person who rolls
out the pizza dough and the one
who puts it in the oven are the</p>
      <p>same person.</p>
      <p>The wooden planks under the
wood oven are for feeding the
fire.</p>
      <p>The wooden planks under
the wood oven are for</p>
      <p>decoration.</p>
      <p>There is not enough data to
determine the average number of
pizzas a pizza maker cooks daily.</p>
      <p>There is sufficient data to determine
the average number of pizzas that
the pizza maker cooks daily.</p>
      <p>What is the cake made of?</p>
      <p>I cannot see any cake.</p>
      <p>There is no cake in the video.</p>
      <p>There is a cake in the video.</p>
      <p>What steps should the pizza
maker take to revive the fire?
What attitude does the pizza
maker show while taking the
pizza out of the oven?
Where is the pizza
placed after being
taken out of the oven?</p>
      <p>He should stir up the
embers a bit and throw</p>
      <p>some new wood.</p>
      <p>The pizzaiolo looks focused.</p>
      <p>The pizza is placed on a</p>
      <p>plate.</p>
      <p>Where is the pizza
maker?</p>
      <p>In the pizzeria in front
of the oven</p>
      <p>To revive the fire, the pizza
maker should stir the embers</p>
      <p>and add new wood.</p>
      <p>In the video, the pizza maker
looks focused while taking
the pizza out of the oven.</p>
      <p>After being taken out of
the oven, the pizza is
placed on a plate.</p>
      <p>In the scene, the pizza maker
is in the pizzeria in front of the
oven</p>
      <p>To revive the fire, the pizza maker
should stir the embers and add</p>
      <p>new water.</p>
      <p>In the video, the pizza maker looks
distracted while taking the pizza</p>
      <p>out of the oven.</p>
      <p>After being taken out of the
oven, the pizza is placed on</p>
      <p>the table.</p>
      <p>In the scene, the pizza chef is
in the pizzeria by the counter
When does the pizzaiolo
take the pizza out of the</p>
      <p>oven?
How long does it take to
cook the pizza in the video?</p>
      <p>When he considers it
cooked, towards the end of
the video.</p>
      <p>The pizzaiolo takes the pizza out of
the oven towards the end of the
video when he considers it cooked.</p>
      <p>Pizza baking time is
approximately 30 seconds</p>
      <p>The baking of the pizza in the
video takes approximately 30
seconds</p>
      <p>The pizzaiolo takes the pizza out of
the oven towards the beginning of
the video when he considers it</p>
      <p>cooked.</p>
      <p>The baking of the pizza in the
video takes approximately 30
seconds
annotators. These pilots aimed to identify the optimal
categories, including their definitions and any associated
number, type, and specificity of the categories needed
to efectively probe the cognitive and linguistic abilities
of VLMs on our videos. Based on the feedback received,
some initially proposed categories were merged due to
content overlap or redundancy. Conversely, other
categories were added to enhance the granularity of
reasoning assessment (e.g, we introduced a Planning category,
as we consider it a meaningful expression of reasoning
skills). These refinements allowed us to design a more
robust and informative framework to explore the interplay
between language and vision in multimodal processing.</p>
      <p>The following paragraphs introduce the final
macrosub-categories.</p>
      <p>Causal</p>
      <p>focuses on reasoning about the causes or
effects of events depicted in the video. It includes two</p>
      <p>2
subtypes , namely Implicit and Explicit , ofering a
comprehensive test of a model’s ability to describe causality
within events. The former involves inferring
unobservable causes from visible efects in the scene, requiring
logical reasoning beyond what is directly shown. The
2Unlike the following cases, these are not treated as distinct
subcategories but as two equally represented subtypes of the same
category
latter concerns clearly observable cause-and-efect dy- entities or events in the scene, throughout the entire
namics, where either the cause or the efect is directly video. A typical response to a sentiment question may
identifiable from the video content. describe a specific sentiment, attitude, or emotion, or
it may reflect a neutral stance. This category evaluates
Counterfactual focuses on questions about hypo- the ability of the model to recognize and identify the
thetical scenarios that do not actually occur in the video emotional state or attitude of characters based on visual
but could take place under specific conditions. These cues.
questions are based on entities or events visible in the
video and explore the consequences of an event or situa- Spatial investigates the spatial relationships between
tion that might happen in the video if a certain condition entities, objects, or events depicted in the video. It aims
were met. This category tests the ability of a model to at assessing the model’s ability to infer both stable and
reason about hypothetical scenarios grounded in the con- time-dependent spatial relationships, as well as the
text of the video while deriving logical and plausible ability to determine relative positioning in space and to
outcomes from such scenarios. rely on grounding competencies.</p>
      <p>Implicit investigates entities, events, or their at- Total Spatial: focuses on position of entities in space
tributes that are not explicitly visible in the video while (including their relation to other entities) that remains
their presence or properties can be reasonably inferred constant throughout the whole video, disregarding
from the context. It evaluates the ability of a model any temporal variations or minimal movements of the
to infer implicit details based on context, whether the entity at diferent moments in the video. A typical
target information was never shown or was previously response to this type of question provides general spatial
visible but later obscured. information valid for the entire duration of the video.
Total Implicit: involves entities or events that are never
directly visible in the video but can be inferred from
observable details. A typical answer provides the
requested information based on logical inference.</p>
      <p>Partial Implicit: involves entities or events that were
visible earlier in the video but are no longer visible due to a
shift in the scene or because they have moved out of the
frame.</p>
      <p>Partial Spatial: focuses on time-related positions of
entities in space, takin into account events occurring in
the scene. A typical answer to this question provides
spatial information that is valid only for the requested
time range in the video.</p>
      <p>Temporal focuses on temporal information and
studies the ability of a model to infer temporal relationships,
sequence of events, and durations from visual content in
a coherent manner.</p>
      <p>Sentiment assesses sentiment, mood, attitude, or
emotion displayed by characters in the video toward other
Out-of-scope refers to entities or events entirely
absent from the video, focusing on properties or details of Partial Temporal: focuses on the temporal properties
these non-existent elements. Typical responses to out- and relationships between events in the video, excluding
of-scope questions involve a negation, indicating that their duration. Questions target aspects such as when
the referenced entity or event is not present in the scene. something happens or whether it occurs before or after
Typical answers to this question types involve a negation, another event. Typical answers specify the event along
signaling that the referenced content is not present. This with the requested temporal detail.
category indirectly tests the ability of a model to detect
multimodal hallucinations and an assertiveness tendency
in its responses.</p>
      <p>Duration Temporal: focuses on a specific property of
events in the video: their duration. A typical answer
to a question involves several ways to express the
duraPlanning asks for actions needed to achieve a spe- tion of the event.
cific goal related to the video. The typical response to a
planning question is a sequence of actions that someone Uncertainty refers to entities or events present in the
should perform in order to reach the desired outcome. video but lacking suficient information to answer the
Such a category assesses the ability of the model to infer question precisely. Questions are inherently ambiguous,
and plan the necessary steps to accomplish a goal based as the visual content does not fully support a definitive
on the visual cues provided in the video. response. Answers may ofer plausible options,
acknowledge uncertainty, or signal that the reply is a guess. This
category tests a VLM in handling ambiguity and
incomplete evidence, and in assessing its tendency to respond
ahn ing umb
n
i
d
rsoen ircah adbg ltaeb llraeb liccye lttoeb cup rca oenh toab ckkap izzap lobw ltoapp lltrsab rsoeh itscae trcku ite
p
p
l
l
e
c
c
a
b
o
p
s
u
s</p>
    </sec>
    <sec id="sec-3">
      <title>4. Curated Video Dataset</title>
      <p>4.1. Video Selection
A key design choice for the MAIA benchmark was to
reflect Italian culture in real-world scenarios through
a carefully curated selection of video clips. To ensure
richness and variety, the selection process was based
on the following thematic areas: Locations, Food, Sport,
Job, Nature, Activities. Such topics allowed us to collect
a dataset showing locations, iconic Italian cities, and
daily activities (e.g., enjoying breakfast at a café, cooking
pasta, attending a soccer match) or even typical events
(e.g., Italian local festivals or weddings). This cultural
focus was not intended to limit the generalizability of the
benchmark, but rather to ofer a valuable opportunity to
assess model performance on culturally grounded data,
which is an aspect often underrepresented in existing
multimodal resources.
4.2. Video Collection
We collected a culturally representative set of 100 short
videos (~30 seconds each) sourced from YouTube Italy.</p>
      <p>Following the criteria described in Section 4.1, videos
were retrieved using keyword-based queries across
selected thematic areas. Only Creative Commons licensed
content was included to ensure reproducibility. When
necessary, longer videos were manually checked and cut
to extract the most relevant 30-second segments,
resulting in a uniform and culturally grounded video set.
4.3. Analysis of Videos</p>
      <p>To better understand the visual content present in the
MAIA benchmark, we conducted an object detection and
classification analysis over the full set of videos using a
YOLOv113 detection pipeline. For each video, we sampled
32 uniformly spaced frames and ran object detection on
them. This analysis provides a high-level view of the 5. Curated Linguistic Data
typical objects types in MAIA.</p>
      <p>Figure 3 shows the frequency distribution of detected 5.1. Questions Collection
object labels across all annotated frames. Person is by
far the most common object class, reflecting the human- We created 12 diferent sets of guidelines, each assigned
centered nature of most videos in the benchmark. How- to a diferent annotator via Google Forms in order to
colever, the dataset also includes a wide variety of everyday lect two questions per reasoning category for each video.
objects, suggesting a rich and diverse set of visual ele- Annotators were PhD students under 30 with
specializaments. tions in Linguistics and Computational Linguistics4. To</p>
      <p>Figure 4 shows the distribution of the number of de- ensure variability between the pair of questions about
tected objects per frame. Most frames contain a moderate that video, annotators were asked to change the entities
number of objects, typically between two and six. This
indicates that the videos ofer a balance between visual
simplicity and complexity, making them suitable for
testing both low-level perception and high-level reasoning
in VLMs.
3https://docs.ultralytics.com/it/models/yolo11/
4Each annotator was paid 100 euros for generating questions, which
were collected through the administration of 1, 200 forms (10 per
annotator)</p>
      <sec id="sec-3-1">
        <title>ANSWER 1</title>
      </sec>
      <sec id="sec-3-2">
        <title>ANSWER 2</title>
      </sec>
      <sec id="sec-3-3">
        <title>ANSWER 3</title>
      </sec>
      <sec id="sec-3-4">
        <title>ANSWER 4</title>
      </sec>
      <sec id="sec-3-5">
        <title>ANSWER 5</title>
      </sec>
      <sec id="sec-3-6">
        <title>ANSWER 6</title>
      </sec>
      <sec id="sec-3-7">
        <title>ANSWER 7</title>
      </sec>
      <sec id="sec-3-8">
        <title>ANSWER 8</title>
        <p>What role do the men in white shirts play?
Che ruolo svolgono gli uomini con le maglie bianche?</p>
        <p>The men in white shirts are the competition judges
Gli uomini con le maglie bianche sono i giudici di gara</p>
        <p>They observe who scores a point</p>
        <p>Osservano chi fa punto
Men in white give judgements on the competition</p>
        <p>Gli uomini in bianco danno giudizi sulla gara</p>
        <p>They seem to be the referees of this bocce game</p>
        <p>Sembra che siano gli arbitri di questa partita a bocce
They measure the distance of the thrown ball from the little one and determine the winner of the set</p>
        <p>Misurano la distanza della boccia tirata dal boccino e decretano il vincitore del set</p>
        <p>The men in white shirts are the referees of the match
Gli uomini con le maglie bianche sono gli arbitri dellla partita</p>
        <p>The men in white are the jury</p>
        <p>Gli uomini in bianco sono i giudici</p>
        <p>Men in white shirts play the role of refereeing the match</p>
        <p>Gli uomini con le maglie bianche svolgono il compito di arbitrare la partita
and/or events involved in both of them. Each provided
form contained both the definition of the assigned
semantic category with examples, and also general rules to be
followed (see Appendix, Figure 8 for an example of the
form used). Each question had to be generated naturally
and as an open-ended question. Questions involving a
‘Yes/No’ answer (e.g. Is there a car in the video?) were not
allowed. Finally, for the correct execution of the task, the
audio of the video had to be ignored, as the VLMs to be
tested could only work on the visual part. Subsequently,
questions were manually reviewed to ensure quality and
category alignment.
5.2. Answers Collection
Italian as their first language, and had spent the majority
of their first 18 years of life in Italy. As with the
question collection step, we used Google Forms to provide the
task6. Each form included 10 videos, and for each video,
the annotators were asked to answer 12 questions, one
per reasoning category (see Appendix, Figure 9 for an
example of the form used). Annotators were encouraged
to use their own world knowledge when interpreting the
visual content of the video.</p>
        <p>To guarantee high quality of the collected answers,
we employed rigid control mechanisms based on sanity
check questions. Answers were accepted only if the
annotators correctly answered at least 90% of these control
questions, otherwise their submissions were rejected and
the task was reassigned to another annotator. In total,
2, 400 questions were paired with 8 answers each,
resulting in 19, 200 responses. They were then further checked
by a semi-automated two-step validation process based
on GPT-4o with few-shot prompting:
The goal of this phase was to collect 8 diferent answers
for each question to ensure not only accuracy but also
variability in responses. This choice is also supported by
ifndings from Mañas et al. [33], who empirically show
that using up to 8 demonstrations provides an efective
trade-of between diversity, accuracy, and computational Semantic Consistency Check. Each response was
eficiency in in-context learning with LLMs for VQA eval- evaluated for semantic consistency with the
corresponduation. We used the Prolific platform5 and selected an- ing question. In cases where inconsistencies were
denotators aged 25 to 80 who were born in Italy, spoke tected, the answers were manually reviewed to assess
5https://www.prolific.com
6Annotators were paid £7 per hour for answering questions</p>
        <p>Given an Italian question Q and an answer A concerning a video, you must create a statement S based on A.
While generating S, try not to alter the words composing A. If A includes first-person verbs or phrases
(e.g., 'I think,' 'I believe'), rephrase S to be impersonal, avoiding a first-person perspective.</p>
        <p>The statement should be a concise, declarative sentence.</p>
        <p>Given an Italian caption (TS) regarding the position or location of someone or something, your task is to create its
foil (FS) by changing only the spatial information.</p>
        <p>Don't add other information respect to what is stated in TS. Here is an example to guide you:
TS: La donna nel video è in un campo di papaveri.</p>
        <p>FS: La donna nel video è in una classe.</p>
        <p>Given an Italian caption (C) dealing with temporal information about events and its foil (F), your task is to assess
whetthheerctohrereqcutneestsisoonfsFhboausldedboenreC-.answered by another with diferent wording. Following the same procedure
annoTtoabtoervoarlidt,hFesrheospulodnesxepcreosuslda dstiOilelrbeentatcecmepptoerda.l iRnefoarlmautisoendwfoitrhtrheefepreonoclseotof 8threeospnoenesxepsr,ewsseepdeirnfoCr.med a quality
inco nIfsFisitseancviaelsidwfoerile, gfoeunenrdatteo 'bceormreicnti'moathle(ir.we.i,sfee w'neort tchoarnrect'c.heck to ensure lexical variability within the 2400 pools
100Bout of 19, 200 responses). of true statements (TS) (see Section 5.6).</p>
        <p>Your task is to determine the natural language inference (NLI) relationship between S1 and S2. The possible
Conltarbaedliscatiroe:n Test. We checked whether, within each 5.4. False Statement Generation
pool- oEfnt8airlmesepnot:nSs2eslotgoictahlley fsoalmlowesqfuroemstiSo1n., any of the
respo- nCsoenstcraodnitcrtaiodnic:tSe2dctohnetoratdhiecrtss. SW1.e found that 90.25% The goal of this phase is to create a false statement (FS)
of th-eN8e-uantrsawl: eSr2paonodlsS1exahreibrietlafutelldabgurtedemoneontt,eanstatihleoyrcdoontrafdoircteaecahchToSthalerre.ady collected, in order to form a minimal
not cPoronvtiadienoannlyyocnoenlatrbaedliacstioountsp.uTt(hEentraeilmmaeinnti,nCgon9.t7ra5d%ictioTnS,-oFrSNpeauirt,raeln).abling controlled experiments and precise
(234 cases) were manually reviewed by an additional analysis of a model’s behavior with respect to the
reaannotator to resolve inconsistencies. soning categories. As for TSs, the FSs were automatically</p>
        <p>A post-processing phase of the responses was imple- generated using GPT-4o for editing only the elements of
mented to ensure a suficient degree of variability and the sentence related to that semantic category, an
apreduce potential redundancy within each of the 2, 400 6pBroashcho winsspairperdo mbyptthuesecadpftoiornt-hfeoiFlSmsegtheonder[a1t4io].nF7i.gFuorer
poolGsoivfe8naansqwueersst(isoene (SQec),tiaonca5.n6d).idFiagtuerean5sswhoewr s(Aa)n, and a set of 8 reference answers (R1–R8), your
instance, taking into account the previous example in 5.3:
examtapslekoisf otoned8e-taenrsmwienrepwoohleatshseorciAatiesdcwoirtrhecatv.iAdeiso considered correct if it aligns with at least one of
and a question, after this refinement procedure described
abovteh. e reference answers. In the video the boy is in the bathroom before running away.</p>
        <p>Return only one label as output: 'Correct' or 'Incorrect.
5.3. True Statement Generation
At this step we automatically generate a true statement
(TS) for each question-answer pair collected in the
previous phases. A TS consists of descriptive declarative
sentences aligned with the visual content of the videos.</p>
        <p>For example, if a video shows a boy who is initially in a
kitchen and he hears a loud noise and runs away, a TS
for the Spatial category could be:
In the video, the boy is in the kitchen before running away.</p>
        <p>To create TS we used GPT-4o, with the prompt in Figure
6A, leveraging the combination of each question and its
answer to automatically generate 19, 200 true statements
(TSs). As with the answers, the TS are organised into
2, 400 pools of 8 items, each expressing the same event
Finally, we implemented two quality checks for FS using
GPT-4o.</p>
        <p>Structural Check aiming at automatically verifying
that each FS aligns correctly with its corresponding TS
according to its category. While the GPT-4o evaluation
initially flagged 864 out of 19, 200 cases as incorrect,
only 2.5% were ultimately confirmed as truly problematic
and subsequently corrected through manual revision.</p>
        <p>Contradiction Test performed by assuming that a
correct FS must be in contradiction with the relevant TS.</p>
        <p>We ran an NLI task to classify TS-FS pairs as Entailment,
7Due to space constraints, we could not include all the 12 prompts
used for generating FSs specific to each reasoning category.
However, the prompt shown here is representative of the adopted
methodology.</p>
        <p>Nouns in Q&amp;A Across All Videos</p>
        <p>As said in Section 5.2, we opted for a pool-based structure
with 8 items per question in order to balance semantic
consistency with lexical diversity both across answers
and statements. To meet this requirement, we assessed
and enhance lexical richness within our data. This phase
was carried out in several incremental steps (i.e., a string
based test, lexical overlap and Type-Token Ratio (TTR)
8Nouns from TS and FS were excluded, as those sentences are derived
from Q&amp;A and would result in redundant repetitions.
9https://spacy.io
10Since TSs are generated from an automatic rephrasing of Q&amp;A
pairs, we checked and improve their lexical diversity. This
indirectly benefits the corresponding FSs, which difer by a single term
from TS.
This work has been carried out while Davide Testa was
enrolled in the Italian National Doctorate on Artificial
Intelligence run by Sapienza University of Rome in
collaboration with Fondazione Bruno Kessler (FBK).
Giovanni Bonetta and Bernardo Magnini were supported
by the PNRR MUR project PE0000013-FAIR (Spoke 2).</p>
        <p>Alessandro Lenci and Alessandro Bondielli were
supported by the PNRR MUR project PE0000013-FAIR (Spoke
1). Alessio Miaschi was supported by the PNRR MUR
project PE0000013-FAIR (Spoke 5). Lucia Passaro was
supported by the EU EIC project EMERGE (Grant No.
101070918).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Additional Materials</title>
      <p>The following figures show examples of the forms
adopted for collecting the questions (Figure 8) and the
corresponding answers (Figure 9).</p>
      <p>General Task
Privacy Policy</p>
      <p>and
Research Purposes</p>
      <p>Category
specific task</p>
      <p>Example
2 Questions
2-Qgueensteiroantsion
generation
Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase
and reword, Improve writing style, and Grammar and spelling check. After using these
tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>8253</fpage>
          -
          <lpage>8280</lpage>
          . URL:
          <article-title>pirical Methods in Natural Language Processing</article-title>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>567</volume>
          . doi:10. Association for Computational Linguistics, Brus-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2022</year>
          .
          <article-title>acl-long.567. sels</article-title>
          , Belgium,
          <year>2018</year>
          , pp.
          <fpage>1369</fpage>
          -
          <lpage>1379</lpage>
          . URL: https: [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          , L. van der Maaten, L. Fei- //aclanthology.org/D18-1167/. doi:
          <volume>10</volume>
          .18653/v1/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Fei</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Clevr: A diagnostic D18-1167.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>dataset for compositional language</article-title>
          and elementary [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sivic</surname>
          </string-name>
          , I. Laptev,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          , Just
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>visual reasoning</article-title>
          , in: CVPR,
          <year>2017</year>
          .
          <article-title>ask: Learning to answer questions from</article-title>
          millions of [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pezzelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Klimovich</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Herbe- narrated videos,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>lot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nabi</surname>
            , E. Sangineto,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Bernardi</surname>
            ,
            <given-names>FOIL</given-names>
          </string-name>
          <year>2012</year>
          .
          <volume>00451</volume>
          . arXiv:
          <year>2012</year>
          .00451.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>it! find one mismatch between image</article-title>
          and lan- [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grunde-McLaughlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          , M. Agrawala,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>Proceedings of the 55th Annual Meeting of the temporal reasoning</article-title>
          , in: Proceedings of the
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>ume 1: Long Papers), Association for Computa-</article-title>
          tern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>11287</fpage>
          -
          <lpage>11297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>tional Linguistics</source>
          , Vancouver, Canada,
          <year>2017</year>
          , pp. [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          255-
          <fpage>265</fpage>
          . URL: https://aclanthology.org/P17-1024/. J.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Qiao</surname>
          </string-name>
          , Mvbench:
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>doi:10</source>
          .18653/v1/
          <fpage>P17</fpage>
          -1024.
          <article-title>A comprehensive multi-modal video understanding [15]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Suhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          ,
          <article-title>A corpus of natu- benchmark</article-title>
          ,
          <source>CVPR</source>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>ral language for visual reasoning</article-title>
          , in: R. Barzilay, M.-
          <volume>48550</volume>
          /arXiv.2311.17005.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 55th Annual</source>
          Meet- [25]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Deng</surname>
          </string-name>
          , T.-S. Chua,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>tics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <article-title>Association for Com-</article-title>
          and challenges, in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>putational Linguistics</source>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <source>Y</source>
          . Zhang (Eds.),
          <source>Proceedings of the 2022</source>
          Con-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          217-
          <fpage>223</fpage>
          . URL: https://aclanthology.org/P17-2034/. ference on Empirical Methods in Natural Lan-
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>doi:10</source>
          .18653/v1/
          <fpage>P17</fpage>
          -2034. guage Processing, Association for Computational [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Thrush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bartolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          , Linguistics, Abu Dhabi, United Arab Emirates,
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ross</surname>
          </string-name>
          , Winoground: Prob- 2022, pp.
          <fpage>6439</fpage>
          -
          <lpage>6455</lpage>
          . URL: https://aclanthology.org/
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>ing vision and language models for visio-linguistic 2022</article-title>
          .
          <article-title>emnlp-main</article-title>
          .
          <volume>432</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          compositionality,
          <source>in: CVPR</source>
          <year>2022</year>
          ,
          <year>2022</year>
          . emnlp-main.
          <volume>432</volume>
          . [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pezzelle</surname>
          </string-name>
          , The BLA bench- [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grunde-McLaughlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          , M. Agrawala,
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Bali</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2023 Conference IEEE/CVF Conference on Computer Vision</source>
          and Pat-
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>on Empirical Methods in Natural Language Pro- tern Recognition</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>cessing</surname>
            , Association for Computational Linguis- [27]
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
          </string-name>
          , K. Ren,
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>5817</fpage>
          -
          <lpage>5830</lpage>
          . URL: https: J.
          <string-name>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>ANetQA: A Large-scale Benchmark for Fine-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          //aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>356</volume>
          /. doi:10. grained Compositional Reasoning over Untrimmed
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2023</year>
          .emnlp-main.
          <volume>356</volume>
          . Videos , in: 2023 IEEE/CVF Conference on Com[18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Carrara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Messina</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Gennaro, puter Vision and Pattern Recognition (CVPR), IEEE</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Falchi</surname>
          </string-name>
          ,
          <article-title>The devil is in the fine-grained details</article-title>
          :
          <source>Computer Society</source>
          , Los Alamitos, CA, USA,
          <year>2023</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Evaluating</surname>
          </string-name>
          open
          <article-title>-vocabulary object detectors for 23191-23200</article-title>
          . URL: https://doi.ieeecomputersociety.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <article-title>ifne-grained understanding</article-title>
          ,
          <source>in: Proceedings of the org/10.1109/CVPR52729</source>
          .
          <year>2023</year>
          .
          <volume>02221</volume>
          . doi:
          <volume>10</volume>
          .1109/
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>IEEE/CVF Conference on Computer Vision and Pat- CVPR52729</source>
          .
          <year>2023</year>
          .
          <volume>02221</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>tern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>22520</fpage>
          -
          <lpage>22529</lpage>
          . [28]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kesen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pedrotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafagna</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. C.</surname>
          </string-name>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          , T.-S. Chua, Can i trust Acikgoz,
          <string-name>
            <given-names>L.</given-names>
            <surname>Parcalabescu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Calixto</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Frank,
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          answering, in: CVPR,
          <year>2024</year>
          , pp.
          <fpage>13204</fpage>
          -
          <lpage>13214</lpage>
          . URL:
          <article-title>benchmark for linguistic and temporal ground-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          https://doi.org/10.1109/CVPR52733.
          <year>2024</year>
          .
          <volume>01254</volume>
          .
          <article-title>ing in video-language models</article-title>
          ,
          <year>2023</year>
          . URL: https: [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          , Y. LeCun, S. Xie, //arxiv.org/abs/2311.07022. arXiv:
          <volume>2311</volume>
          .
          <fpage>07022</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>Eyes wide shut? exploring the visual shortcomings</article-title>
          [29]
          <string-name>
            <given-names>V.</given-names>
            <surname>Patraucean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Smaira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . R. Con-
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <article-title>of multimodal llms</article-title>
          ,
          <source>in: CVPR</source>
          <year>2024</year>
          ,
          <year>2024</year>
          . tinente, L. Markeeva,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Banarse</surname>
          </string-name>
          , S. Koppula, [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          , T. Berg, TVQA:
          <string-name>
            <surname>Lo- J. Heyward</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Malinowski</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doersch</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 2018 Conference on Em- tar, S. Osindero</source>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Damen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , J. Car-
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>tems Datasets and Benchmarks Track</source>
          ,
          <year>2023</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          https://openreview.net/forum?id=HYEGXFnPoq. [30]
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Basili</surname>
          </string-name>
          , Gqa-
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: https://api.semanticscholar.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          org/CorpusID:245125448. [31]
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Shafique</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vayani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Rasheed</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <source>benchmark model</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <source>abs/2506</source>
          .07032. arXiv:
          <volume>2506</volume>
          .
          <fpage>07032</fpage>
          . [32]
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
          </string-name>
          , I. Koy-
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <article-title>the 62nd Annual Meeting of the Association for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <given-names>Computational</given-names>
            <surname>Linguistics</surname>
          </string-name>
          (Volume
          <volume>1</volume>
          : Long Pa-
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <surname>Bangkok</surname>
          </string-name>
          , Thailand,
          <year>2024</year>
          , pp.
          <fpage>7768</fpage>
          -
          <lpage>7791</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          /. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          . [33]
          <string-name>
            <given-names>O.</given-names>
            <surname>Mañas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krojer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , Improving auto-
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <surname>els</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2310.02567.
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <source>arXiv:2310</source>
          .
          <fpage>02567</fpage>
          . [34]
          <string-name>
            <given-names>M. O.</given-names>
            <surname>Gul</surname>
          </string-name>
          , Y. Artzi, CoGen: Learning from feed-
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <source>Proceedings of the 2024 Conference on Empiri-</source>
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <string-name>
            <surname>Florida</surname>
          </string-name>
          , USA,
          <year>2024</year>
          , pp.
          <fpage>12966</fpage>
          -
          <lpage>12982</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          //aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>721</volume>
          /. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .emnlp-main.
          <volume>721</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>