<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Dec</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The Inherence of Telicity: Unveiling Temporal Reasoning in Video Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olga Loginova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafaella Bernardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CIMeC, University of Trento</institution>
          ,
          <addr-line>Corso Bettini 31, 38068 Rovereto TN</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DISI, University of Trento</institution>
          ,
          <addr-line>Via Sommarive, 9, 38123 Povo TN</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>02</volume>
      <issue>2023</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Video question answering (VQA) requires models to understand video-related questions and generate natural language answers. In multiple-choice VQA, models must associate visual content with one of several predetermined answers. As videos often encompass intricate events and actions unfolding over time, these models must possess the ability to reason across multiple frames and discern the relationships between them with respect to the answers. This paper focuses on the Answerer component of a multiple-choice VQA model, which predicts answers using language-infused key frames. We hypothesise that the Answerer's capacity for temporal reasoning is closely intertwined with its understanding of aspectuality. To investigate this, we augment NeXT-QA, a VQA dataset for causal and temporal reasoning, with annotations for telicity. We then delve into the performance evaluation of SeViLA, a state-of-the-art multiple-choice VQA model, on it. Our findings demonstrate that the model generally exhibits correct handling of aspects, albeit with a bias that is inherent in human nature.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;video question answering</kwd>
        <kwd>temporal reasoning</kwd>
        <kwd>aspect</kwd>
        <kwd>telicity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>training data, indicates their ability to capture temporal
reasoning through aspect classification.</p>
      <p>Temporal ordering of actions and events is not solely de- Our work extends this line of research to
videotermined by time; it is also influenced by causality. The language models, where video content comes with text
organisation of activities in episodic memory is estab- labels assigned to key frames or the whole video.
Orderlished based on contingency, where one activity triggers ing of events corresponds to changing frames, making
another [1]. Recognising cause-efect relationships is the correct key frame extraction critical for temporal
essential for temporal understanding, as causes typically reasoning. Action timestamps to the frames provide
adprecede efects. A cause that has reached its culmination ditional cues for temporal reasoning. We propose a study
induces the efect. that focuses on contemporary video question-answering</p>
      <p>In language, linguistic aspects play a role in how ac- (VQA) models in order to explore the relevance of telicity
tivities unfold and whether they have culminated. The for answering temporal questions related to
simultaneconcept of telicity marks the endpoint of an activity: a ous and consecutive activities. We consider the aspects
verb phrase with a clear endpoint is considered telic (e. of question’s both main and dependent clauses.
g., “to pick up something"), while an atelic one is on- To achieve this, we annotate1 the test set of NExT-QA
going, without a specific endpoint (e. g., “to clap"). In [5], widely used for causal and temporal reasoning
benchdescriptions of a sequence of activities with the resul- marks, with telicity and evaluate the SeViLA model [6]
tative structure there is an evident human bias towards on this annotated dataset. To the best of our knowledge,
telic interpretation [2]. this is the first such endeavor in the VQA field.</p>
      <p>Previous research explored telicity for textual
transformer-based [3] models, showing that they can
classify activities based on duration and telicity with
an accuracy surpassing 80% [4]. Such performance
at a level comparable to humans, even with limited</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Literature</title>
      <sec id="sec-2-1">
        <title>Numerous transformer-based models tackle the challenge</title>
        <p>of video question answering [7, 8, 9, 10, 11, 12, 6, 13, 14].
These models process both the visual and textual
modalities by incorporating video, captions or subtitles, and
fuse these streams to generate the final answer. They
have showed impressive performance in modelling
multimodal VQA. However, they were never assessed for
telicity. SeViLA [6], selected for our experiment, consists
of two modules: Localizer, for action recognition within
videos, and Answerer. The modules are fine-tuned based
on BLIP-2 [15]. The model has proved the best results in
comparison to other similar models on several datasets,
such as STAR [16], NExT-QA [5], How2QA [17], and
TVQA [18].</p>
        <p>We examined datasets that ofer multiple-choice
answer options where models must choose the correct
answer from a set of candidates. CausalQA [19], Social-IQ
[20], CLEVRER [21], STAR [16], and NExT-QA [5] are
specifically designed to explore temporal dynamics and
the role of causal relationships. NExT-QA proved to be
particularly suitable for our experiment, as it is the most
comprehensive and emphasises the real-world scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Annotation</title>
      <p>The NExT-QA test set comprises 1000 videos with 8564
question-answer pairs supported by five answer options
each. From a range of 1 to 15 questions with an average
of 9-10 questions per video, we selected solely temporal
(T-type) questions. We further excluded closed questions
and questions that do not involve two distinct
temporallylinked activities, such as “did the baby get hurt after
putting out the candle" or “what are the people in this
video doing". Thus, the refined total set (RTS) consists of
2060 question-answer pairs.</p>
      <p>Notably, RTS questions pertaining to the following
activities are in the absolute majority, while the ones
concerning preceding actions are very few.2
2More details on the dataset are in Section A of the Appendix.
• T (telic) for activities implying an endpoint (e.
g., “what happened", “pick up camera", “after the
door opens"),
• A (atelic) for enduring processes (e. g., “how is
the person in black positioned", “smiles", “while
watching"), and
• U (undefined) for activities lacking clear telicity
and duration (e. g., “what does the dog do", “do
the same", “to man’s action to him").</p>
      <sec id="sec-3-1">
        <title>Additionally, an I (irrelevant) marker was assigned to</title>
        <p>answers unrelated to aspectuality, such as “astonished" or
“nothing". This marker appears among the target answers
too in response to questions like “how did the boy react
to..." or “what does the person do while...".</p>
        <p>From Table 1 it is evident that the question’s main
clause rarely impose a definitive telic label, setting the
model free to explore temporal relations without
predeifned constraints. The majority of DCAs are telic and,
considering that the most of RTS questions center around
the following activity, this afirms the cause-efect nature
of the dataset, where the cause predominantly culminates
in an endpoint.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment and Results</title>
      <sec id="sec-4-1">
        <title>We ran zero-shot SeViLA setting on the test dataset de</title>
        <p>creasing the batch size down to 2. The obtained results
revealed the overall accuracy of 63.18% and the T-type
question accuracy of 60,18%. On RTS, we obtained 58.1%
of matching predicted and target answers.</p>
        <p>We further calculated the telicity precision, recall, F1
score and accuracy on the annotated RTS.</p>
        <sec id="sec-4-1-1">
          <title>4.1. Results</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>SeViLA selected 781 telic (T) and 1261 atelic (A) responses,</title>
        <p>alongside 2 instances marked as undefined (U), and 16
responses classified as irrelevant (I). 3</p>
        <p>As demonstrated in Table 2, the results verify that the
model attains an accuracy rate exceeding 80%.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations</title>
      <p>The confusion matrix shown in Figure 2 indicates a
higher frequency of atelic answers. The majority of atelic
responses might initially prompt an inference of an atelic While NExT-QA is distinguished as a versatile dataset,
predisposition of the model. Upon closer examination, it has limitations in representing temporal expressions
however, we observed that the incidence of erroneous from a linguistic perspective. Primarily, its questions use
allocations from atelic to telic responses is more pro- a limited set of temporal conjunctions, including after,
nounced than in the inverse direction. Thus, the model before, during, as, while, and whenever. A dataset with a
exhibits a clear inclination towards selecting telic values broader array of temporal constructions related to both
instead of the target atelic ones: in 26,12% of the target time and telicity could introduce variations, potentially
atelic answers it chooses the telic ones, while there are altering model’s outcomes.
only 14.45% of the opposite cases. Another source of result variations can stem from the
number of annotators. The annotations were created by a
professional linguist in a pilot version, but it is important
to acknowledge a potential subjective bias. To mitigate
the bias, at least three annotators are suggested for each
question-answer pair.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <sec id="sec-6-1">
        <title>3Additional data regarding SeViLA’s predictions in the context of</title>
        <p>RTS can be found in Section B of the Appendix.</p>
      </sec>
      <sec id="sec-6-2">
        <title>The linguistic models grounded in cognitive research</title>
        <p>highlight a tendency for individuals to remember causally
linked activities. Sequential actions and events are
associated with the idea that the culmination of one activity
4.2. Qualitative analysis sets of another. This culmination is closely tied to the
The SeViLA Answerer employs a top-k frame extraction internal structure of the activity which is expressed in
strategy to evaluate each frame’s probability and deter- language through aspects and, in particular, telicity.
mine the optimal choice for answering a question. The Using NExT-QA dataset, we revealed that VQA models,
erroneous answers often come from the model’s misjudg- such as SeViLA, generally capture the contrast in
durament in instructive key frames. tive and endpoint activities at a human level. Whereas</p>
        <p>As shown in Figure 3, the telicity cues may have their they mostly tend to predict correct telicity for causal and
origins in the question’s both MCA and DCA. As much temporal reasoning, their inherent erroneous implication
as in the TN-question (top) SeViLA disregards the DCA’s of culminated activity, in essence, aligns with human
telic action, it also struggles to correspond with the atelic intuition.
activities of the MCA in the answer for the TC-question This revelation prompts us to answer the follow-up
(bottom). question: to what extent the improvement in matching
telicity in questions and answers will amplify the key
frame extraction for correct answering in multiple-choice
VQA models.</p>
        <sec id="sec-6-2-1">
          <title>A.1. Types of Questions</title>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>There is prominent imbalance among all T-type questions in RTS with the majority of TN questions.</title>
      </sec>
      <sec id="sec-6-4">
        <title>The detailed overview of RTS questions shows that the dataset has predominately questions with the “what did S do after..." structure.</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>B. SeViLA’s performance on RTS</title>
      <sec id="sec-7-1">
        <title>This section presents the details concerning the data predicted by the model.</title>
        <sec id="sec-7-1-1">
          <title>B.1. Matching in Absolute Numbers and</title>
        </sec>
        <sec id="sec-7-1-2">
          <title>Percentage</title>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>Target answers do not have underdetermination with little data irrelevant from the aspect point of view.</title>
        <sec id="sec-7-2-1">
          <title>B.2. Predicted vs. Target Answers</title>
          <p>The examination of the most frequently predicted and
target answers reveals a significant number of matches,
predominantly characterised by atelic labels.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>