-

1613-0073

10.1016/j.jmsy.2020.07.011

German task-oriented VQA dataset annotated with human visual attention

Moritz Kronberger

moritz.kronberger1@tha.de 0 1

Viviana Ventura

viviana.ventura@tha.de 0 1 0 CLiC-it 2024: Tenth Italian Conference on Computational Linguistics 1 Technische Hochschule Augsburg , An der Hochschule 1, 86161 Augsburg , Germany

2024

27 43

Video question answering (VQA) is a challenging task that requires models to generate answers by using both information from text and video. We present Task-oriented Human Attention Video Question Answering (THAVQA), a new VQA dataset consisting of third- and first- person videos of an instructor using a sewing machine. The sewing task is formalized step-bystep in a script: each step consists of a video annotated with German language open-ended question and answer (QA) pairs and with human visual attention. The paper also includes a first assessment of the performance of a pre-trained Multimodal Large Language Model (MLLM) in generating answers to the questions of our dataset across diferent experimental settings. Results show that our task-oriented dataset is challenging for pre-trained models. Specifically, the model struggles to answer questions requiring technical knowledge or spatio-temporal reasoning. video question answering, human visual attention, multimodal large language model This paper presents a new VQA dataset based on demon- performance in user- and task-oriented datasets [1, 2]. In ∗Corresponding author. †These authors contributed equally.

CEUR ceur-ws.org

1. Introduction

strating basic sewing machine operations. To our knowledge, THAVQA1, which is also annotated with human visual attention, is the first task-oriented VQA dataset in German language.

The dataset building is a first step in the larger project aimed at developing an AI-assistant for a sewing machine This AI-assistant would support students when using sewing machines for the first time. For example, this could mean answering questions about basic machine settings or explaining fundamental sewing skills. Our dataset poses unique challenges for VQA models and is almost unique in the state-of-the-art VQA datasets since it is user- and task-oriented: the questions collected are those that a real user would ask for help while using the sewing machine. The process of operating the sewing machine was decomposed in a script into steps and sub-steps that were recorded and on which questions and answers were annotated. Specialized knowledge of the process and understanding of spatial and temporal relationships workshop held at the Technische Hochschule Augsburg. are few participants and staged events.

Annotating human attention in the video inputs of VQA models has recently been shown to improve their our dataset, the workshop instructor’s eye gaze has been used as a proxy for human visual attention. The concept behind it is that visual human attention integrated as input into models for VQA can help the model distinguish between video frames, especially in datasets in which recorded scenes are very similar to each other as there

Our paper also provides a first assessment on the VQA performance of the pre-trained MLLM Gemini 1.5 Pro2 on THAVQA. Indeed, new releases of LLMs, such as Gemini 1.5 [ 3 ] but also GPT-4 [ 4 ], Llama 2 [5] or Claude 3 [6], now allow for visual inputs, making it possible to perform VQA tasks using pre-trained models directly.

To sum up, this paper presents (1) A new dataset with third-person videos of an instructor operating a sewing machine and first-person videos annotated with visual human attention, QA pairs in German, a script in German of the steps required to operate the machine; and (2) An evaluation of the performance of a pre-trained MLLM on generating open-ended answers from questions and The majority of state-of-art VQA datasets portray complex scenes composed of many events and participants, gathered using either synthetic simulation data or data sourced from movies, social media, video games or the web [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. VQA models are then tasked with answering questions about the dition, the limited visual variety of the video scenes and the specialized language and dictionary challenge the models for VQA. is required for answering the questions collected. In ad- videos of our dataset.

2. Related Work

2https://deepmind.google/technologies/gemini/pro/ (a) (b) (c) videos’ content. This requires a wide variety of reasoning an elective module on Smart Textiles at the Faculty of abilities such as reasoning about spatial and temporal Design. We first structured the contents and detailed relationships, casual inference or relationships between instructions of the workshop in a script, which primaractions and objects [16, 18]. ily served as a template for video data collection. The

In contrast, research on task-oriented VQA, where script contains seven larger tasks, such as setting up the question answering supports users with tasks such as machine and performing diferent kind of sewing operindustrial assembly and disassembly [ 1, 2 ] or collabora- ations on diferent kinds of fabrics, each with three to tive machine operation [19], is relatively limited. Simi- eight smaller sub-steps (35 in total), which in turn require larly, the setting of our dataset, the tutorial on sewing multiple actions to be performed. The script’s contents machine operation, is task-oriented and requires special- are available as part of the publicly accessible dataset (see ized knowledge, which makes it dificult for pre-trained Online Resources).

MLLMs to generate satisfactory answers from only their inherent knowledge. In line with the task-oriented ap- 3.2. Video Data Collection proaches of Ilaslan et al. [ 1 ] and Gao et al. [10] we adopt both a fixed third-person view (TPV) and the first-person We recorded video data of the workshop being performed view (FPV) of the workshop instructor during the video by the instructor. All videos depict a regular consumerrecordings. To our knowledge no other German datasets grade sewing machine being operated by the instructor exist specifically for task-oriented VQA. at a table (see Figure 1). The video background is visu

Human and model attention in VQA seem to be related, ally complex and reflects the real workshop environment. as human visual attention has been shown to be corre- We also extended the video dataset to two student parlated to model attention for VQA [20] and diferences in ticipants using exactly the same recording procedure their attention can be used to explain disagreement in (same environment, perspectives and script steps). The VQA [21]. Human attention has been modeled explicitly extended dataset, containing a total of 48 minutes of by eye [ 1 ] and hand tracking [ 2 ] and included into the footage, is available on request. To reduce the chance input of VQA models in order to highlight important of errors in the video demonstrations negatively impactparts of the videos that correspond to the user inten- ing VQA performance, we rely exclusively on the expert tions. These annotations of human visual attention have demonstrations for the scope of this paper. been shown to improve VQA performance, even when Two diferent camera perspectives were recorded siusing pre-trained encoders without specific fine-tuning multaneously: a static TPV looking over the instructor’s to extract features from the visual data [ 1 ]. With these left shoulder towards the machine (see Figure 1a) as well intuitions, we annotated the FPV videos in our dataset as a dynamic FPV of the instructor (see Figure 1b). For with human visual attention. recording the FPV we used the Tobii Pro Glasses 3 eye tracking glasses3 and collected the instructor’s eye gaze ifxations for the entire duration of recordings. We split 3. The Dataset the recordings (TPV and FPV) into the 35 sub-steps and manually synchronized them across both perspectives. 3.1. Dataset Structure We chose two diferent types of annotations to represent the human attention in FPV. First we annotated the 2D-location of the instructor’s eye gaze via a red circular The setting of our custom VQA dataset is the introduction to sewing machine operation presented in a tutorial form.

We based the contents on a sewing machine workshop held at the Technische Hochschule Augsburg as part of 3https://www.tobii.com/products/eye-trackers/wearables/ tobii-pro-glasses-3 outline (FPVC) (see Figure 1b), representing a bounding box for the current area of human attention, similar to the annotation style of Ilaslan et al. [ 1 ]. We also created a second annotation layer, attention maps (FPVA), where each pixel is masked with increasing intensity with increasing distance to the gaze fixation point (see Figure 1c).

Although this masking may obscure important information in the video, it clearly restricts the model’s visual input to the human focal point.

This approach still leaves some amount of ambiguity, for example specialized knowledge about sewing-machinespecific terms may be required in order to identify the 3.3. QA Pair Collection object, for example “the bobbin”, to be located in a QA We recruited 10 German speaking crowdworkers on the pair about temporal-based reasoning. For the QA pair Prolific 4 platform to formulate open-ended question- annotation it was therefore decided if a question coranswer pairs on the recorded videos.5 Crowdworkers responds to a single reasoning type or if it should be were shown a random video in the TPV that represents a assigned to multiple reasoning types. sub-step, together with the corresponding sub-step in the The diferent reasoning types also give an indication script. Giving annotators access to the script’s contents, of which dataset modality is required for the model to a description of the actions performed on the sewing answer the dataset’s questions. Strictly knowledge-based machine by the instructor (see Section 3.1), did cause the questions for instance primarily test the model’s preresulting QA pairs to be less focused on the contents of training knowledge and are therefore not expected to the video and more focused on the contents of the tex- profit from a visual input modality. Spatial and temporal tual descriptions. However, we still opted to include the questions both require the model to extract additional textual context, in order to encourage the use of correct information from visual inputs. For spatial reasoning, a technical language by the non-expert annotators and to sequence of video frames might help with occlusion or ensure a better understanding of the videos’ contents. depth perception, however, in most cases a static image The resulting QA pairs were then manually annotated will ofer the required context for a spatial question to by reasoning type (see Figures 2-3 in the Appendix): be answered. Temporal reasoning requires the model to relate visual information over a span of multiple frames, making video context a requirement to answer temporal questions.

Additionally, we discarded QA pairs that were either factually incorrect, not intelligible or ungrammatical. 4https://www.prolific.com 5Crowdworkers were ofered an approximate hourly reward of 11.80€ including bonuses. • If a question can be answered by observing and relating the video input over multiple frames it is categorized as requiring temporal reasoning. • If a question cannot be reasonably answered from the video input but rather requires using pretraining knowledge it is categorized as requiring knowledge-based reasoning.

The categorization of QA pairs into these reasoning types is often ambiguous, especially when diferentiating if a question pertains to knowledge-based reasoning as opposed to spatial or temporal reasoning. In fact most knowledge about how to sew is based on spatial and temporal information. For example the question of “What happens after winding the bobbin?” is temporal in nature but could also be answered from the model’s inherent pre-training knowledge instead of extracting temporal information from the video input. We therefore approached the labeling process of QA pairs as follows: • knowledge-based reasoning when questions need

technical knowledge to be answered; • spatial reasoning when locations or directions

are to be described; • temporal reasoning when questions are related

to the sequential order of actions; 3.4. Descriptive Statistics • perception-based reasoning when the answer can only be retrieved by visually inspecting the video. In total the video recordings span 16 minutes and 24 seconds across the TPV and FPV with a mean duration of 14 seconds for single sub-step-related video clips.

Since the dataset’s scenario only involves sewing machine operation, we expect limited variability within the contents of the videos. This might mean that the video data ofers little usable information to a pre-trained MLLM. We quantified this lack of visual variation as the semantic similarity of video frames within a single video clip related to one of the 35 sub-steps. We obtained the semantic similarity scores by randomly sampling 20 frames for each clip and transforming them into embeddings using the CLIP model [22]. We used cosine similarity [23] • If a question can be answered by locating objects as the distance metric and calculated the mean of the simin the visual input it is categorized as requiring ilarity matrix between all 20 embeddings. We compared spatial reasoning. this semantic similarity for the TPV and FPV, including both types of annotations for human visual attention (see Table 1). As expected, the frames within video clips are very similar, with the static TPV exhibiting the largest

RTTR. When extending the calculations to all questions and answers or the entire dataset, repetitions become more frequent, evidenced by a higher RTTR.

4. Methodology

TPV FPV FPVC FPVA

For the evaluation we selected Gemini 1.5 Pro6 as an example of pretrained MLLMs. Gemini 1.5 Pro is part of a new family of highly-capable multi-modal models, Gemini 1.5, and it is a sparse mixture-of-expert TransformerTMaebalne s2tatistics over single questions and answers as well as based model. Due to its is long input context of up to 10 across all questions, answers and the entire dataset. million tokens it is capable of processing video inputs at a high resolution and sampling rate [ 3 ], giving it a good

Tokens Lemmas RTTR chance at extracting detailed visual information. We acSingle questions 9.79 ±3.0 9.12 ±2.43 2.88 ±0.45 cessed Gemini through the Vertex AI inference API7. We Single answers 12.58 ±8.74 10.45 ±5.83 2.99 ±0.85 prompted Gemini to answer the questions formulated by Questions 1519 286 9.34 human annotators. To evaluate the model’s performance, Answers 1950 371 9.94 the answers generated by Gemini are manually compared Total 3469 502 10.31 against the human gold-standard answers. Two human annotators gave binary labels of whether or not the model answer could serve as an acceptable replacement for the semantic similarity between video frames. The FPV an- human answer. The two annotators were trained by tagnotated with attention maps displays the second highest ging part of the dataset together. Given the clarity of the similarity score, possibly due to the fact that large por- binary annotation task, they proceeded to annotate the tions of the frames are masked and the position of the remaining part of the dataset by themselves. Instances focal point is not altering the embedding vector signifi- where the model refused to answer due to a lack of incantly. We do not find a diference between the similarity formation were labeled as not acceptable. For the final scores of the regular FPV and the FPV including the cir- evaluation score we expressed the ratio of acceptable ancle annotation of the eye gaze. Overall, this indicates swers to the number of total answers as binary accuracy that a pre-trained MLLM may struggle to extract and (see Table 3). meaningfully interpret human attention information. To evaluate the impact of diferent inputs (FPV, TPV,

After manually filtering incorrect or unintelligible QA human visual attention, script) on the VQA performance pairs and annotating the reasoning types we obtained of Gemini we constructed seven ablation settings: a total of 122 QA pairs, with 1 to 9 QA pairs per sub- First, we prompted the model with the questions and step of the script. Additionally, we prompted Gemini did not include any other context in form of textual in1.5 Pro to answer the 122 questions, obtaining a total formation or videos. We refer to this ablation setting amount of 2562 answers, further details are described in as the naive baseline. We expected this configuration to Section 4. We found 96 QA pairs to pertain to knowledge- serve as the bottom limit of model performance, relying based reasoning, with 33 QA pairs requiring spatial-, 15 exclusively on the model’s inherent knowledge gathered temporal- and 4 perception-based reasoning (see Figure 3 from pre-training. in the Appendix). A total of 24 QA pairs were annotated For the second ablation scenario, we included the inwith more than one reasoning type due to ambiguity. All structions for the sub-step of the script any given quesbut one of these pairs was assigned the label ”knowledge- tion was formulated for. These instructions do not only based reasoning” in combination with at least one more aid with knowledge-based questions but also contain reasoning type. important descriptions about the temporal order and

Additionally, we analyzed the diversity of QA pairs in spatial location of actions. Excluding perception-based terms of token and lemma counts as well as Root Type- reasoning, we therefore expected this ablation setting Token Ratio (RTTR) calculated using the default param- to represent the upper limit of model performance. As eters of Shen [24] (see Table 2). We calculated the de- such, this ablation setting is referred to as the text-only scriptive statistics as a mean over singular questions and reference model. answers as well as across all questions, answers and the entire dataset. The questions and answers provided by the human annotators are largely brief and concise, re- 6https://deepmind.google/technologies/gemini/pro/ sulting in low token and lemma counts alongside a low 7https://cloud.google.com/vertex-ai/generative-ai/docs/ model-reference/inference

Third, we included a FPV video clip corresponding to 2 in a contingency table of the binary “acceptable”-labels the given question along with the sub-step instructions. between every pair of ablation settings for every reasonWe refer to this model as the multimodal reference model ing type. We accepted -values < 0.05 as statistically and expect it to perform similarly to the text-only refer- significant. ence model with the additional ability to reason about Both reference models outperformed the naive baseline perception-based questions. If satisfactory answers can- significantly in terms of total accuracy over all reasoning not be generated from the model’s pre-training knowl- types (4.28 −25 ≤ ≤ 4.57 −19). This confirms that the edge, we would expect both reference models to outper- chosen task-oriented VQA scenario of sewing machine form the naive baseline significantly. operation was specialized enough, such that Gemini was

In the remaining four ablation settings, we included a not able to provide satisfactory answers using only its single video clip related to the given question with every pre-training knowledge. For perception-based reasoning prompt. Each ablation setting used video clips, either questions, no significant diference in accuracy between from a specific perspective ( TPV or FPV ) or a specific the naive baseline and the text-only reference model was type of visual attention information, either the red circle found. However, both were outperformed significantly (FPVC) or the attention map (FPVA). For these settings we by the multi-modal reference model (0.004 ≤ ≤ 0.04 ). did not include any other textual information, meaning We can therefore conclude that the model was generally all information present in the answers must have been able to extract meaningful information from the video inherent to the model or extracted from the video. inputs. Across all individual reasoning types other than

We repeated the same prompt for every question in perception-based questions, no statistically significant every ablation setting three times to account for varia- diferences between the performances of the text-only tions in the model’s output. This resulted in 366 model and multi-modal reference model could be observed, inresponses per ablation setting, a total of 2562 answers. dicating that the textual instructions included enough Additional information about the model prompts is pro- spatial and temporal information to make the additional vided in Section E of the Appendix. Since THAVQA video input redundant. is imbalanced towards knowledge-based questions, we All video-only ablation scenarios (TPV, FPV, FPVC, reduced their amount by randomly sampling knowledge- FPVA) across all individual reasoning types except for based questions. We chose the sample size with a margin perception-based reasoning were outperformed by both of error of 5%, a confidence of 95% and estimated the reference models, and did not show significant advanproportion maximally at 0.5. With finite population cor- tages over the naive baseline. Given that even the multirection we therefore reduced the amount of knowledge- modal reference model was not able to significantly imbased model answers from 210 to 143 per ablation setting. prove upon the text-only reference model, these results Model answers including spatial reasoning accounted for were to be expected. Similarly, the video-only ablation 99, temporal reasoning for 45 and perception-based rea- scenarios were able to improve over the accuracy of the soning for 12 model answers per ablation setting. This naive baseline and the text-only reference model with means that the evaluated model answers were still im- respect to perception-based reasoning, although these balanced towards knowledge-based reasoning. results were above or close to the cutof for statistical significance ( 0.004 ≤ ≤ 0.4 ).

More importantly however, for any individual reason5. Evaluation ing type, annotating human attention via both annotation types (FPVC and FPVA) did not significantly improve accuracy in comparison to the regular FPV or TPV videos.

This confirms that the pre-trained MLLM was in fact We calculated the binary answer accuracy (see Section 4) for every ablation setting and reasoning type as shown in Table 3. To test for statistical significance we calculated not able to meaningfully interpret the human attention with a first person video resulted in the best performing annotations without fine-tuning. model across all reasoning types of questions.

Overall, the experimental setup was suitable to re- When looking towards the design of a VQA model veal diferences in VQA performance for the diferent for a future, practical sewing machine assistant, video forms of video inputs and reasoning types. In fact, the inputs could therefore be used mainly to improve the task-oriented nature of THAVQA was challenging for model’s perception abilities, while a retrieval system for a pre-trained MLLM such as Gemini: while the model textual information could provide the necessary specialwas often able to extract enough information for ques- ized knowledge. tions requiring basic perception, this was not the case for questions involving complex reasoning about temporal or spatial dimensions that are peculiar of a procedural Acknowledgments task such as sewing. For these types of reasoning the model achieved its best performances when detailed tex- This research was funded by the Bavarian State Ministry tual information related to the corresponding sub-steps for Science and the Arts (StMWK: Bayerische Staatsminwas included in the ablation scenarios. Besides the na- isterium für Wissenschaft und Kunst - StMWK) as part of ture of the questions formulated, maybe the videos are the Project ”CHIASM” (Changenreiche industrielle Analso challenging for the model: we can hypothesize that wendungen für vortrainierte Sprachmodelle) and as part this is due to the high semantic similarity between the the High Tech Agenda of the Free State of Bavaria. video frames, as we showed in Section 3.4. We thank Rebecca Bilger of the Education and Learning Lab for Sustainability Innovations (ELLSI) for her support with the topic of sewing machine operation, the 5.1. Qualitative Analysis scheduling and organization of data collection and her participation in the video dataset. We also thank the research group for Applied Technologies of Language and Assistance Systems (THA_atlas) at the Technische Hochschule Augsburg for supporting the project with If no video inputs were included for perception-based questions, such as retrieving the fabric’s color, Gemini mostly pointed out that it was lacking the information required to provide an answer. Additionally, including video inputs seemed to help the model disambiguate ques- advice and equipment. tions. For example, the naive baseline misunderstood a question about removing excess threads from the work References piece, interpreting it as referring to undoing entire unwanted seams. With video inputs, the model was able to infer that the question was simply related to trimming long threads hanging of the fabric. Finally, we found that video context seemed to encourage the model to provide descriptions of spatial relationships, even when this is not strictly required to answer the question.

Overall, we observed a positive efect of video inputs on the model’s answers when compared to the naive baseline. Examples are provided in the Appendix (Figures 5- 7).

6. Conclusion

We provide a new task-oriented, German-language VQA dataset on demonstrations of sewing machine operation with open-ended human QA pairs and human visual attention: THAVQA. We then compared the VQA performance of Gemini 1.5 Pro on THAVQA varying the model inputs. We found that the task-oriented scenario of THAVQA was specific enough, such that the model could not rely on only its inherent knowledge to generate satisfactory responses. The questions contained in our dataset were over the capacity of the model to reason about the video data. Combining textual instructions [5] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- I. Laptev, J. Sivic, HowTo100M: Learning a Texthairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- Video Embedding by Watching Hundred Million gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, Narrated Video Clips, in: 2019 IEEE/CVF InterM. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, national Conference on Computer Vision (ICCV), W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, 2019, pp. 2630–2640. URL: https://ieeexplore.ieee. A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar- org/document/9009806. doi:10.1109/ICCV.2019. das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko- 00272, iSSN: 2380-7504. renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, [13] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, D. Tao, ActivityNet-QA: A Dataset for UnderstandP. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen- ing Complex Web Videos via Question Answering, stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Proceedings of the AAAI Conference on Artificial Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay- Intelligence 33 (2019) 9127–9134. URL: https://ojs. lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, aaai.org/index.php/AAAI/article/view/4946. doi:10. Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro- 1609/aaai.v33i01.33019127, number: 01. driguez, R. Stojnic, S. Edunov, T. Scialom, Llama [14] J. Lei, L. Yu, M. Bansal, T. Berg, TVQA: Lo2: Open Foundation and Fine-Tuned Chat Models, calized, Compositional Video Question Answer2023. URL: http://arxiv.org/abs/2307.09288. doi:10. ing, in: E. Rilof, D. Chiang, J. Hockenmaier, 48550/arXiv.2307.09288, arXiv:2307.09288 [cs]. J. Tsujii (Eds.), Proceedings of the 2018 Confer[6] Anthropic PBC, Introducing the next generation ence on Empirical Methods in Natural Language of Claude, 2024. URL: https://www.anthropic.com/ Processing, Association for Computational Linguisnews/claude-3-family. tics, Brussels, Belgium, 2018, pp. 1369–1379. URL: [7] C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, https://aclanthology.org/D18-1167. doi:10.18653/ C. Zhou, Y. Shen, M. Zhang, P. Chen, Y. Li, S. Lin, v1/D18-1167.

S. Zhao, K. Li, T. Xu, X. Zheng, E. Chen, R. Ji, [15] Y. Jang, Y. Song, Y. Yu, Y. Kim, G. Kim, TGIF-QA: X. Sun, Video-MME: The First-Ever Comprehen- Toward Spatio-Temporal Reasoning in Visual Quessive Evaluation Benchmark of Multi-modal LLMs tion Answering, in: 2017 IEEE Conference on Comin Video Analysis, 2024. URL: http://arxiv.org/ puter Vision and Pattern Recognition (CVPR), 2017, abs/2405.21075. doi:10.48550/arXiv.2405.21075, pp. 1359–1367. URL: https://ieeexplore.ieee.org/ arXiv:2405.21075 [cs]. document/8099632. doi:10.1109/CVPR.2017.149, [8] Y. Li, X. Chen, B. Hu, L. Wang, H. Shi, M. Zhang, iSSN: 1063-6919.

VideoVista: A Versatile Benchmark for Video [16] K. Yi*, C. Gan*, Y. Li, P. Kohli, J. Wu, A. Torralba, Understanding and Reasoning, 2024. URL: http: J. B. Tenenbaum, CLEVRER: Collision Events for //arxiv.org/abs/2406.11303. doi:10.48550/arXiv. Video Representation and Reasoning, 2019. URL: 2406.11303, arXiv:2406.11303 [cs]. https://openreview.net/forum?id=HkxYzANYDB. [9] R. Rawal, K. Saifullah, R. Basri, D. Jacobs, [17] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, G. Somepalli, T. Goldstein, CinePile: A Long R. Urtasun, S. Fidler, MovieQA: Understanding StoVideo Question Answering Dataset and Benchmark, ries in Movies through Question-Answering, in: 2024. URL: http://arxiv.org/abs/2405.08813. doi:10. 2016 IEEE Conference on Computer Vision and 48550/arXiv.2405.08813, arXiv:2405.08813 [cs]. Pattern Recognition (CVPR), 2016, pp. 4631–4640. [10] D. Gao, R. Wang, Z. Bai, X. Chen, Env-QA: A URL: https://ieeexplore.ieee.org/document/7780870.

Video Question Answering Benchmark for Com- doi:10.1109/CVPR.2016.501, iSSN: 1063-6919. prehensive Understanding of Dynamic Environ- [18] M. Grunde-McLaughlin, R. Krishna, M. Agrawala, ments, in: 2021 IEEE/CVF International Conference AGQA: A Benchmark for Compositional Spatioon Computer Vision (ICCV), 2021, pp. 1655–1665. Temporal Reasoning, in: 2021 IEEE/CVF URL: https://ieeexplore.ieee.org/document/9711383. Conference on Computer Vision and Pattern doi:10.1109/ICCV48922.2021.00170, iSSN: 2380- Recognition (CVPR), 2021, pp. 11282–11292. URL: 7504. https://ieeexplore.ieee.org/document/9577594. [11] A. Yang, A. Miech, J. Sivic, I. Laptev, C. Schmid, doi:10.1109/CVPR46437.2021.01113, iSSN: Just Ask: Learning to Answer Questions from Mil- 2575-7075. lions of Narrated Videos, in: 2021 IEEE/CVF Inter- [19] T. Wang, J. Li, Z. Kong, X. Liu, H. Snoussi, national Conference on Computer Vision (ICCV), H. Lv, Digital twin improved via visual ques2021, pp. 1666–1677. URL: https://ieeexplore.ieee. tion answering for vision-language interacorg/document/9710833. doi:10.1109/ICCV48922. tive mode in human–machine collaboration, 2021.00171, iSSN: 2380-7504. Journal of Manufacturing Systems 58 (2021) [12] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, 261–269. URL: https://www.sciencedirect.

A. Online Resources

The dataset, including synchronized video data with annotated eye gaze as well as human formulated and model generated question-answer pairs with reasoning type annotations, is available via https://github.com/tha-atlas/ HowDoesSewingMachineWork.git.

B. Crowdsourced Question-Answer Formulation

Where is the sewing machine’s built-in thread cutter located? Wo befindet sich der integrierte Fadenschneider der Maschine? What does the seamstress check at the end of the sewing? Was kontrolliert die Näherin am Ende des Nähens? What color is the fabric in the video? Welche Farbe hat der Stof in dem Video? (b) (c) (d)

D. Semantic Similarity of Human and Model Answers

We also evaluated the similarity between human and model answers for every ablation scenario as a sentence BLEU-score [25] and BERT-scores [26] with precision, recall and F1-score (see Table 4). However, we excluded these metrics from the main evaluation, since they do not provide a direct measure for the factual correctness of the model’s responses. As expected, the reference model with access to the same textual information that annotators were using to formulate QA pairs achieves the highest semantic similarity to human answers.

E. Model Prompts

Why do you use a zigzag stitch on elastic fabrics? 8https://ai.google.dev/gemini-api/docs/prompting_with_media# prompting-with-videos You are a sewing machine assistant. Answer questions about using a sewing machine as accurately and precisely as possible.

It may be dificult to answer the questions based on the given context. However, there is no way to ask follow-up questions.

Therefore, always try to answer the question as well as possible.

The answer should be concise and directly related to the question, such as: Question: How do I ...? Answer: In order to ..., you ...

The question is: <question> Du bist ein Nähmaschinenassistent. Beantworte Fragen zu Benutzung einer Nähmaschine so korrekt und präzise wie möglich.

Möglicherweise sind die Fragen anhand des gegeben Kontexts schwer zu beantworten. Es gibt jedoch keine Möglichkeit Rückfragen zu stellen.

Versuche also die Frage stets so gut wie möglich zu beantworten.

Die Antwort sollte bündig formuliert sein und lediglich Bezug zur Frage aufweisen, also beispielsweise: Frage: Wie mache ich ...? Antwort: Um ... durchzuführen wird ...

Die Frage lautet: <question> (a)

The following excerpt from a manual may be related to the question: Folgender Auszug aus einer Anleitung hat möglicherweise Bezug zur Frage: <instructions> <instructions> <video> <video> The answer to the question is demonstrated in the attached video.

Die Antwort auf die Frage wird im beigefügten Video demonstriert.

The gaze point of a person performing the task is annotated with a red circle and highlights important areas of the image.

Der Blickpunkt eines Menschen beim Durchführen der Aufgabe ist durch einen roten Kreis annotiert und weist auf wichtige Bildbereiche hin.

In the video, the viewpoint of a person performing the task is masked, thus restricting the video to the most important area of the image. Im Video ist der Blickpunkt eines Menschen beim Durchführen der Aufgabe maskiert, das Video also auf den wichtigsten Bildbereich beschränkt.

Question: What color is the fabric in the video? Welche Farbe hat der Stof in dem Video?

Answer (naive baseline): It is impossible to de

termine the color of the fabric in a video without seeing the video.

Es ist unmöglich die Farbe des Stofes in einem Video zu bestimmen, ohne das Video zu sehen. Answer (FPV): The color of the fabric is blue.

Die Farbe des Stofes ist blau.

Question: How should excess threads be removed from the work piece? Wie sollten die überschüssigen Fäden des Werkstückes entfernt werden?

Answer (naive baseline): To remove excess

threads, use a seam ripper to carefully open the seams and remove the unwanted threads. Um überschüssige Fäden zu entfernen, verwende einen Nahtauftrenner, um die Nähte vorsichtig zu öfnen und die unerwünschten Fäden zu entfernen. Answer (FPV): To remove excess threads, cut them of with fabric scissors.

Um überschüssige Fäden zu entfernen, schneidet man diese mit einer Stofschere ab.

Question: How do I manually insert the needle into the fabric? Wie führe ich die Nadel manuell in den Stof ein?

Answer (naive baseline): The needle can be

manually inserted using the handwheel.

Mithilfe des Handrads kann die Nadel manuell eingeführt werden.

Answer (TPV): To manually insert the needle into the fabric, turn the handwheel on the right side of the sewing machine towards yourself.

Um die Nadel manuell in den Stof einzuführen, dreht man das Handrad an der rechten Seite der Nähmaschine zu sich heran.

[1]

Ilaslan ,

Song ,

Chen ,

Gao ,

Lei ,

Xu ,

Lim , M. Shou, GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze TaskOriented Collaborations , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 10462 - 10479 . URL: https: //aclanthology.org/ 2023 .emnlp-main. 648 . doi: 10 . 18653/v1/ 2023 .emnlp-main. 648 .

[2]

H. L.

Tan ,

M. C.

Leong ,

Xu ,

Li ,

Fang , Y. Cheng, N. Gauthier,

Sun ,

J. H.

Lim , TaskOriented Multi-Modal Question Answering For Collaborative Applications , in: 2020 IEEE International Conference on Image Processing (ICIP) , 2020 , pp. 1426 - 1430 . URL: https://ieeexplore.ieee. org/document/9190659. doi: 10 .1109/ICIP40778. 2020 . 9190659 , iSSN: 2381 - 8549 .

[3]

Gemini

Team , Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024 . URL: http://arxiv.org/abs/2403.05530. doi: 10 . 48550/arXiv.2403.05530.

[4] OpenAI, GPT-4 Technical Report , 2024 . URL: http: //arxiv.org/abs/2303.08774. doi: 10 .48550/arXiv. 2303.08774.