1. Introduction

Dec

The Inherence of Telicity: Unveiling Temporal Reasoning in Video Question Answering

Olga Loginova

Rafaella Bernardi

0 0 CIMeC, University of Trento , Corso Bettini 31, 38068 Rovereto TN , Italy 1 DISI, University of Trento , Via Sommarive, 9, 38123 Povo TN , Italy

2023

02 2023 0000 0002

Video question answering (VQA) requires models to understand video-related questions and generate natural language answers. In multiple-choice VQA, models must associate visual content with one of several predetermined answers. As videos often encompass intricate events and actions unfolding over time, these models must possess the ability to reason across multiple frames and discern the relationships between them with respect to the answers. This paper focuses on the Answerer component of a multiple-choice VQA model, which predicts answers using language-infused key frames. We hypothesise that the Answerer's capacity for temporal reasoning is closely intertwined with its understanding of aspectuality. To investigate this, we augment NeXT-QA, a VQA dataset for causal and temporal reasoning, with annotations for telicity. We then delve into the performance evaluation of SeViLA, a state-of-the-art multiple-choice VQA model, on it. Our findings demonstrate that the model generally exhibits correct handling of aspects, albeit with a bias that is inherent in human nature.

eol>video question answering temporal reasoning aspect telicity

1. Introduction

training data, indicates their ability to capture temporal reasoning through aspect classification.

Temporal ordering of actions and events is not solely de- Our work extends this line of research to videotermined by time; it is also influenced by causality. The language models, where video content comes with text organisation of activities in episodic memory is estab- labels assigned to key frames or the whole video. Orderlished based on contingency, where one activity triggers ing of events corresponds to changing frames, making another [1]. Recognising cause-efect relationships is the correct key frame extraction critical for temporal essential for temporal understanding, as causes typically reasoning. Action timestamps to the frames provide adprecede efects. A cause that has reached its culmination ditional cues for temporal reasoning. We propose a study induces the efect. that focuses on contemporary video question-answering

In language, linguistic aspects play a role in how ac- (VQA) models in order to explore the relevance of telicity tivities unfold and whether they have culminated. The for answering temporal questions related to simultaneconcept of telicity marks the endpoint of an activity: a ous and consecutive activities. We consider the aspects verb phrase with a clear endpoint is considered telic (e. of question’s both main and dependent clauses. g., “to pick up something"), while an atelic one is on- To achieve this, we annotate1 the test set of NExT-QA going, without a specific endpoint (e. g., “to clap"). In [5], widely used for causal and temporal reasoning benchdescriptions of a sequence of activities with the resul- marks, with telicity and evaluate the SeViLA model [6] tative structure there is an evident human bias towards on this annotated dataset. To the best of our knowledge, telic interpretation [2]. this is the first such endeavor in the VQA field.

Previous research explored telicity for textual transformer-based [3] models, showing that they can classify activities based on duration and telicity with an accuracy surpassing 80% [4]. Such performance at a level comparable to humans, even with limited

2. Related Literature Numerous transformer-based models tackle the challenge

of video question answering [7, 8, 9, 10, 11, 12, 6, 13, 14]. These models process both the visual and textual modalities by incorporating video, captions or subtitles, and fuse these streams to generate the final answer. They have showed impressive performance in modelling multimodal VQA. However, they were never assessed for telicity. SeViLA [6], selected for our experiment, consists of two modules: Localizer, for action recognition within videos, and Answerer. The modules are fine-tuned based on BLIP-2 [15]. The model has proved the best results in comparison to other similar models on several datasets, such as STAR [16], NExT-QA [5], How2QA [17], and TVQA [18].

We examined datasets that ofer multiple-choice answer options where models must choose the correct answer from a set of candidates. CausalQA [19], Social-IQ [20], CLEVRER [21], STAR [16], and NExT-QA [5] are specifically designed to explore temporal dynamics and the role of causal relationships. NExT-QA proved to be particularly suitable for our experiment, as it is the most comprehensive and emphasises the real-world scenarios.

3. Annotation

The NExT-QA test set comprises 1000 videos with 8564 question-answer pairs supported by five answer options each. From a range of 1 to 15 questions with an average of 9-10 questions per video, we selected solely temporal (T-type) questions. We further excluded closed questions and questions that do not involve two distinct temporallylinked activities, such as “did the baby get hurt after putting out the candle" or “what are the people in this video doing". Thus, the refined total set (RTS) consists of 2060 question-answer pairs.

Notably, RTS questions pertaining to the following activities are in the absolute majority, while the ones concerning preceding actions are very few.2 2More details on the dataset are in Section A of the Appendix. • T (telic) for activities implying an endpoint (e. g., “what happened", “pick up camera", “after the door opens"), • A (atelic) for enduring processes (e. g., “how is the person in black positioned", “smiles", “while watching"), and • U (undefined) for activities lacking clear telicity and duration (e. g., “what does the dog do", “do the same", “to man’s action to him").

Additionally, an I (irrelevant) marker was assigned to

answers unrelated to aspectuality, such as “astonished" or “nothing". This marker appears among the target answers too in response to questions like “how did the boy react to..." or “what does the person do while...".

From Table 1 it is evident that the question’s main clause rarely impose a definitive telic label, setting the model free to explore temporal relations without predeifned constraints. The majority of DCAs are telic and, considering that the most of RTS questions center around the following activity, this afirms the cause-efect nature of the dataset, where the cause predominantly culminates in an endpoint.

4. Experiment and Results We ran zero-shot SeViLA setting on the test dataset de

creasing the batch size down to 2. The obtained results revealed the overall accuracy of 63.18% and the T-type question accuracy of 60,18%. On RTS, we obtained 58.1% of matching predicted and target answers.

We further calculated the telicity precision, recall, F1 score and accuracy on the annotated RTS.

4.1. Results SeViLA selected 781 telic (T) and 1261 atelic (A) responses,

alongside 2 instances marked as undefined (U), and 16 responses classified as irrelevant (I). 3

As demonstrated in Table 2, the results verify that the model attains an accuracy rate exceeding 80%.

5. Limitations

The confusion matrix shown in Figure 2 indicates a higher frequency of atelic answers. The majority of atelic responses might initially prompt an inference of an atelic While NExT-QA is distinguished as a versatile dataset, predisposition of the model. Upon closer examination, it has limitations in representing temporal expressions however, we observed that the incidence of erroneous from a linguistic perspective. Primarily, its questions use allocations from atelic to telic responses is more pro- a limited set of temporal conjunctions, including after, nounced than in the inverse direction. Thus, the model before, during, as, while, and whenever. A dataset with a exhibits a clear inclination towards selecting telic values broader array of temporal constructions related to both instead of the target atelic ones: in 26,12% of the target time and telicity could introduce variations, potentially atelic answers it chooses the telic ones, while there are altering model’s outcomes. only 14.45% of the opposite cases. Another source of result variations can stem from the number of annotators. The annotations were created by a professional linguist in a pilot version, but it is important to acknowledge a potential subjective bias. To mitigate the bias, at least three annotators are suggested for each question-answer pair.

6. Conclusion 3Additional data regarding SeViLA’s predictions in the context of

RTS can be found in Section B of the Appendix.

The linguistic models grounded in cognitive research

highlight a tendency for individuals to remember causally linked activities. Sequential actions and events are associated with the idea that the culmination of one activity 4.2. Qualitative analysis sets of another. This culmination is closely tied to the The SeViLA Answerer employs a top-k frame extraction internal structure of the activity which is expressed in strategy to evaluate each frame’s probability and deter- language through aspects and, in particular, telicity. mine the optimal choice for answering a question. The Using NExT-QA dataset, we revealed that VQA models, erroneous answers often come from the model’s misjudg- such as SeViLA, generally capture the contrast in durament in instructive key frames. tive and endpoint activities at a human level. Whereas

As shown in Figure 3, the telicity cues may have their they mostly tend to predict correct telicity for causal and origins in the question’s both MCA and DCA. As much temporal reasoning, their inherent erroneous implication as in the TN-question (top) SeViLA disregards the DCA’s of culminated activity, in essence, aligns with human telic action, it also struggles to correspond with the atelic intuition. activities of the MCA in the answer for the TC-question This revelation prompts us to answer the follow-up (bottom). question: to what extent the improvement in matching telicity in questions and answers will amplify the key frame extraction for correct answering in multiple-choice VQA models.

A.1. Types of Questions There is prominent imbalance among all T-type questions in RTS with the majority of TN questions. The detailed overview of RTS questions shows that the dataset has predominately questions with the “what did S do after..." structure. B. SeViLA’s performance on RTS This section presents the details concerning the data predicted by the model. B.1. Matching in Absolute Numbers and Percentage Target answers do not have underdetermination with little data irrelevant from the aspect point of view. B.2. Predicted vs. Target Answers

The examination of the most frequently predicted and target answers reveals a significant number of matches, predominantly characterised by atelic labels.