<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.jmsy.2020.07.011</article-id>
      <title-group>
        <article-title>German task-oriented VQA dataset annotated with human visual attention</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Moritz Kronberger</string-name>
          <email>moritz.kronberger1@tha.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviana Ventura</string-name>
          <email>viviana.ventura@tha.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Hochschule Augsburg</institution>
          ,
          <addr-line>An der Hochschule 1, 86161 Augsburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>43</lpage>
      <abstract>
        <p>Video question answering (VQA) is a challenging task that requires models to generate answers by using both information from text and video. We present Task-oriented Human Attention Video Question Answering (THAVQA), a new VQA dataset consisting of third- and first- person videos of an instructor using a sewing machine. The sewing task is formalized step-bystep in a script: each step consists of a video annotated with German language open-ended question and answer (QA) pairs and with human visual attention. The paper also includes a first assessment of the performance of a pre-trained Multimodal Large Language Model (MLLM) in generating answers to the questions of our dataset across diferent experimental settings. Results show that our task-oriented dataset is challenging for pre-trained models. Specifically, the model struggles to answer questions requiring technical knowledge or spatio-temporal reasoning. video question answering, human visual attention, multimodal large language model This paper presents a new VQA dataset based on demon- performance in user- and task-oriented datasets [1, 2]. In ∗Corresponding author. †These authors contributed equally.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>strating basic sewing machine operations. To our
knowledge, THAVQA1, which is also annotated with human
visual attention, is the first task-oriented VQA dataset in
German language.</p>
      <p>The dataset building is a first step in the larger project
aimed at developing an AI-assistant for a sewing machine
This AI-assistant would support students when using
sewing machines for the first time. For example, this
could mean answering questions about basic machine
settings or explaining fundamental sewing skills. Our
dataset poses unique challenges for VQA models and is
almost unique in the state-of-the-art VQA datasets since
it is user- and task-oriented: the questions collected are
those that a real user would ask for help while using the
sewing machine. The process of operating the sewing
machine was decomposed in a script into steps and sub-steps
that were recorded and on which questions and answers
were annotated. Specialized knowledge of the process
and understanding of spatial and temporal relationships
workshop held at the Technische Hochschule Augsburg. are few participants and staged events.</p>
      <p>Annotating human attention in the video inputs of
VQA models has recently been shown to improve their
our dataset, the workshop instructor’s eye gaze has been
used as a proxy for human visual attention. The concept
behind it is that visual human attention integrated as
input into models for VQA can help the model distinguish
between video frames, especially in datasets in which
recorded scenes are very similar to each other as there</p>
      <p>
        Our paper also provides a first assessment on the VQA
performance of the pre-trained MLLM Gemini 1.5 Pro2 on
THAVQA. Indeed, new releases of LLMs, such as Gemini
1.5 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] but also GPT-4 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Llama 2 [5] or Claude 3 [6],
now allow for visual inputs, making it possible to perform
VQA tasks using pre-trained models directly.
      </p>
      <p>To sum up, this paper presents (1) A new dataset with
third-person videos of an instructor operating a sewing
machine and first-person videos annotated with visual
human attention, QA pairs in German, a script in German
of the steps required to operate the machine; and (2) An
evaluation of the performance of a pre-trained MLLM
on generating open-ended answers from questions and
The majority of state-of-art VQA datasets portray
complex scenes composed of many events and participants,
gathered using either synthetic simulation data or data
sourced from movies, social media, video games or the
web [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. VQA
models are then tasked with answering questions about the
dition, the limited visual variety of the video scenes and
the specialized language and dictionary challenge the
models for VQA.
is required for answering the questions collected. In ad- videos of our dataset.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related</title>
    </sec>
    <sec id="sec-4">
      <title>Work</title>
      <p>2https://deepmind.google/technologies/gemini/pro/
(a)
(b)
(c)
videos’ content. This requires a wide variety of reasoning an elective module on Smart Textiles at the Faculty of
abilities such as reasoning about spatial and temporal Design. We first structured the contents and detailed
relationships, casual inference or relationships between instructions of the workshop in a script, which
primaractions and objects [16, 18]. ily served as a template for video data collection. The</p>
      <p>
        In contrast, research on task-oriented VQA, where script contains seven larger tasks, such as setting up the
question answering supports users with tasks such as machine and performing diferent kind of sewing
operindustrial assembly and disassembly [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] or collabora- ations on diferent kinds of fabrics, each with three to
tive machine operation [19], is relatively limited. Simi- eight smaller sub-steps (35 in total), which in turn require
larly, the setting of our dataset, the tutorial on sewing multiple actions to be performed. The script’s contents
machine operation, is task-oriented and requires special- are available as part of the publicly accessible dataset (see
ized knowledge, which makes it dificult for pre-trained Online Resources).
      </p>
      <p>
        MLLMs to generate satisfactory answers from only their
inherent knowledge. In line with the task-oriented ap- 3.2. Video Data Collection
proaches of Ilaslan et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Gao et al. [10] we adopt
both a fixed third-person view (TPV) and the first-person We recorded video data of the workshop being performed
view (FPV) of the workshop instructor during the video by the instructor. All videos depict a regular
consumerrecordings. To our knowledge no other German datasets grade sewing machine being operated by the instructor
exist specifically for task-oriented VQA. at a table (see Figure 1). The video background is
visu
      </p>
      <p>
        Human and model attention in VQA seem to be related, ally complex and reflects the real workshop environment.
as human visual attention has been shown to be corre- We also extended the video dataset to two student
parlated to model attention for VQA [20] and diferences in ticipants using exactly the same recording procedure
their attention can be used to explain disagreement in (same environment, perspectives and script steps). The
VQA [21]. Human attention has been modeled explicitly extended dataset, containing a total of 48 minutes of
by eye [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and hand tracking [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and included into the footage, is available on request. To reduce the chance
input of VQA models in order to highlight important of errors in the video demonstrations negatively
impactparts of the videos that correspond to the user inten- ing VQA performance, we rely exclusively on the expert
tions. These annotations of human visual attention have demonstrations for the scope of this paper.
been shown to improve VQA performance, even when Two diferent camera perspectives were recorded
siusing pre-trained encoders without specific fine-tuning multaneously: a static TPV looking over the instructor’s
to extract features from the visual data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. With these left shoulder towards the machine (see Figure 1a) as well
intuitions, we annotated the FPV videos in our dataset as a dynamic FPV of the instructor (see Figure 1b). For
with human visual attention. recording the FPV we used the Tobii Pro Glasses 3 eye
tracking glasses3 and collected the instructor’s eye gaze
ifxations for the entire duration of recordings. We split
3. The Dataset the recordings (TPV and FPV) into the 35 sub-steps and
manually synchronized them across both perspectives.
3.1. Dataset Structure We chose two diferent types of annotations to
represent the human attention in FPV. First we annotated the
2D-location of the instructor’s eye gaze via a red circular
The setting of our custom VQA dataset is the introduction
to sewing machine operation presented in a tutorial form.
      </p>
      <p>
        We based the contents on a sewing machine workshop
held at the Technische Hochschule Augsburg as part of
3https://www.tobii.com/products/eye-trackers/wearables/
tobii-pro-glasses-3
outline (FPVC) (see Figure 1b), representing a bounding
box for the current area of human attention, similar to
the annotation style of Ilaslan et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We also created
a second annotation layer, attention maps (FPVA), where
each pixel is masked with increasing intensity with
increasing distance to the gaze fixation point (see Figure 1c).
      </p>
      <p>Although this masking may obscure important
information in the video, it clearly restricts the model’s visual
input to the human focal point.</p>
      <p>This approach still leaves some amount of ambiguity, for
example specialized knowledge about
sewing-machinespecific terms may be required in order to identify the
3.3. QA Pair Collection object, for example “the bobbin”, to be located in a QA
We recruited 10 German speaking crowdworkers on the pair about temporal-based reasoning. For the QA pair
Prolific 4 platform to formulate open-ended question- annotation it was therefore decided if a question
coranswer pairs on the recorded videos.5 Crowdworkers responds to a single reasoning type or if it should be
were shown a random video in the TPV that represents a assigned to multiple reasoning types.
sub-step, together with the corresponding sub-step in the The diferent reasoning types also give an indication
script. Giving annotators access to the script’s contents, of which dataset modality is required for the model to
a description of the actions performed on the sewing answer the dataset’s questions. Strictly knowledge-based
machine by the instructor (see Section 3.1), did cause the questions for instance primarily test the model’s
preresulting QA pairs to be less focused on the contents of training knowledge and are therefore not expected to
the video and more focused on the contents of the tex- profit from a visual input modality. Spatial and temporal
tual descriptions. However, we still opted to include the questions both require the model to extract additional
textual context, in order to encourage the use of correct information from visual inputs. For spatial reasoning, a
technical language by the non-expert annotators and to sequence of video frames might help with occlusion or
ensure a better understanding of the videos’ contents. depth perception, however, in most cases a static image
The resulting QA pairs were then manually annotated will ofer the required context for a spatial question to
by reasoning type (see Figures 2-3 in the Appendix): be answered. Temporal reasoning requires the model to
relate visual information over a span of multiple frames,
making video context a requirement to answer temporal
questions.</p>
      <p>Additionally, we discarded QA pairs that were either
factually incorrect, not intelligible or ungrammatical.
4https://www.prolific.com
5Crowdworkers were ofered an approximate hourly reward of 11.80€
including bonuses.
• If a question can be answered by observing and
relating the video input over multiple frames it is
categorized as requiring temporal reasoning.
• If a question cannot be reasonably answered from
the video input but rather requires using
pretraining knowledge it is categorized as requiring
knowledge-based reasoning.</p>
      <p>The categorization of QA pairs into these reasoning types
is often ambiguous, especially when diferentiating if a
question pertains to knowledge-based reasoning as
opposed to spatial or temporal reasoning. In fact most
knowledge about how to sew is based on spatial and
temporal information. For example the question of “What
happens after winding the bobbin?” is temporal in nature
but could also be answered from the model’s inherent
pre-training knowledge instead of extracting temporal
information from the video input. We therefore approached
the labeling process of QA pairs as follows:
• knowledge-based reasoning when questions need</p>
      <p>technical knowledge to be answered;
• spatial reasoning when locations or directions</p>
      <p>are to be described;
• temporal reasoning when questions are related</p>
      <p>to the sequential order of actions; 3.4. Descriptive Statistics
• perception-based reasoning when the answer can
only be retrieved by visually inspecting the video. In total the video recordings span 16 minutes and 24
seconds across the TPV and FPV with a mean duration
of 14 seconds for single sub-step-related video clips.</p>
      <p>Since the dataset’s scenario only involves sewing
machine operation, we expect limited variability within
the contents of the videos. This might mean that the
video data ofers little usable information to a pre-trained
MLLM. We quantified this lack of visual variation as the
semantic similarity of video frames within a single video
clip related to one of the 35 sub-steps. We obtained the
semantic similarity scores by randomly sampling 20 frames
for each clip and transforming them into embeddings
using the CLIP model [22]. We used cosine similarity [23]
• If a question can be answered by locating objects as the distance metric and calculated the mean of the
simin the visual input it is categorized as requiring ilarity matrix between all 20 embeddings. We compared
spatial reasoning. this semantic similarity for the TPV and FPV, including
both types of annotations for human visual attention (see
Table 1). As expected, the frames within video clips are
very similar, with the static TPV exhibiting the largest</p>
      <p>RTTR. When extending the calculations to all questions
and answers or the entire dataset, repetitions become
more frequent, evidenced by a higher RTTR.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Methodology</title>
      <p>TPV
FPV
FPVC
FPVA</p>
      <p>
        For the evaluation we selected Gemini 1.5 Pro6 as an
example of pretrained MLLMs. Gemini 1.5 Pro is part of a
new family of highly-capable multi-modal models,
Gemini 1.5, and it is a sparse mixture-of-expert
TransformerTMaebalne s2tatistics over single questions and answers as well as based model. Due to its is long input context of up to 10
across all questions, answers and the entire dataset. million tokens it is capable of processing video inputs at
a high resolution and sampling rate [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], giving it a good
      </p>
      <p>Tokens Lemmas RTTR chance at extracting detailed visual information. We
acSingle questions 9.79 ±3.0 9.12 ±2.43 2.88 ±0.45 cessed Gemini through the Vertex AI inference API7. We
Single answers 12.58 ±8.74 10.45 ±5.83 2.99 ±0.85 prompted Gemini to answer the questions formulated by
Questions 1519 286 9.34 human annotators. To evaluate the model’s performance,
Answers 1950 371 9.94 the answers generated by Gemini are manually compared
Total 3469 502 10.31 against the human gold-standard answers. Two human
annotators gave binary labels of whether or not the model
answer could serve as an acceptable replacement for the
semantic similarity between video frames. The FPV an- human answer. The two annotators were trained by
tagnotated with attention maps displays the second highest ging part of the dataset together. Given the clarity of the
similarity score, possibly due to the fact that large por- binary annotation task, they proceeded to annotate the
tions of the frames are masked and the position of the remaining part of the dataset by themselves. Instances
focal point is not altering the embedding vector signifi- where the model refused to answer due to a lack of
incantly. We do not find a diference between the similarity formation were labeled as not acceptable. For the final
scores of the regular FPV and the FPV including the cir- evaluation score we expressed the ratio of acceptable
ancle annotation of the eye gaze. Overall, this indicates swers to the number of total answers as binary accuracy
that a pre-trained MLLM may struggle to extract and (see Table 3).
meaningfully interpret human attention information. To evaluate the impact of diferent inputs (FPV, TPV,</p>
      <p>After manually filtering incorrect or unintelligible QA human visual attention, script) on the VQA performance
pairs and annotating the reasoning types we obtained of Gemini we constructed seven ablation settings:
a total of 122 QA pairs, with 1 to 9 QA pairs per sub- First, we prompted the model with the questions and
step of the script. Additionally, we prompted Gemini did not include any other context in form of textual
in1.5 Pro to answer the 122 questions, obtaining a total formation or videos. We refer to this ablation setting
amount of 2562 answers, further details are described in as the naive baseline. We expected this configuration to
Section 4. We found 96 QA pairs to pertain to knowledge- serve as the bottom limit of model performance, relying
based reasoning, with 33 QA pairs requiring spatial-, 15 exclusively on the model’s inherent knowledge gathered
temporal- and 4 perception-based reasoning (see Figure 3 from pre-training.
in the Appendix). A total of 24 QA pairs were annotated For the second ablation scenario, we included the
inwith more than one reasoning type due to ambiguity. All structions for the sub-step of the script any given
quesbut one of these pairs was assigned the label ”knowledge- tion was formulated for. These instructions do not only
based reasoning” in combination with at least one more aid with knowledge-based questions but also contain
reasoning type. important descriptions about the temporal order and</p>
      <p>Additionally, we analyzed the diversity of QA pairs in spatial location of actions. Excluding perception-based
terms of token and lemma counts as well as Root Type- reasoning, we therefore expected this ablation setting
Token Ratio (RTTR) calculated using the default param- to represent the upper limit of model performance. As
eters of Shen [24] (see Table 2). We calculated the de- such, this ablation setting is referred to as the text-only
scriptive statistics as a mean over singular questions and reference model.
answers as well as across all questions, answers and the
entire dataset. The questions and answers provided by
the human annotators are largely brief and concise, re- 6https://deepmind.google/technologies/gemini/pro/
sulting in low token and lemma counts alongside a low 7https://cloud.google.com/vertex-ai/generative-ai/docs/
model-reference/inference</p>
      <p>Third, we included a FPV video clip corresponding to  2 in a contingency table of the binary “acceptable”-labels
the given question along with the sub-step instructions. between every pair of ablation settings for every
reasonWe refer to this model as the multimodal reference model ing type. We accepted  -values &lt; 0.05 as statistically
and expect it to perform similarly to the text-only refer- significant.
ence model with the additional ability to reason about Both reference models outperformed the naive baseline
perception-based questions. If satisfactory answers can- significantly in terms of total accuracy over all reasoning
not be generated from the model’s pre-training knowl- types (4.28 −25 ≤  ≤ 4.57 −19). This confirms that the
edge, we would expect both reference models to outper- chosen task-oriented VQA scenario of sewing machine
form the naive baseline significantly. operation was specialized enough, such that Gemini was</p>
      <p>In the remaining four ablation settings, we included a not able to provide satisfactory answers using only its
single video clip related to the given question with every pre-training knowledge. For perception-based reasoning
prompt. Each ablation setting used video clips, either questions, no significant diference in accuracy between
from a specific perspective ( TPV or FPV ) or a specific the naive baseline and the text-only reference model was
type of visual attention information, either the red circle found. However, both were outperformed significantly
(FPVC) or the attention map (FPVA). For these settings we by the multi-modal reference model (0.004 ≤  ≤ 0.04 ).
did not include any other textual information, meaning We can therefore conclude that the model was generally
all information present in the answers must have been able to extract meaningful information from the video
inherent to the model or extracted from the video. inputs. Across all individual reasoning types other than</p>
      <p>We repeated the same prompt for every question in perception-based questions, no statistically significant
every ablation setting three times to account for varia- diferences between the performances of the text-only
tions in the model’s output. This resulted in 366 model and multi-modal reference model could be observed,
inresponses per ablation setting, a total of 2562 answers. dicating that the textual instructions included enough
Additional information about the model prompts is pro- spatial and temporal information to make the additional
vided in Section E of the Appendix. Since THAVQA video input redundant.
is imbalanced towards knowledge-based questions, we All video-only ablation scenarios (TPV, FPV, FPVC,
reduced their amount by randomly sampling knowledge- FPVA) across all individual reasoning types except for
based questions. We chose the sample size with a margin perception-based reasoning were outperformed by both
of error of 5%, a confidence of 95% and estimated the reference models, and did not show significant
advanproportion maximally at 0.5. With finite population cor- tages over the naive baseline. Given that even the
multirection we therefore reduced the amount of knowledge- modal reference model was not able to significantly
imbased model answers from 210 to 143 per ablation setting. prove upon the text-only reference model, these results
Model answers including spatial reasoning accounted for were to be expected. Similarly, the video-only ablation
99, temporal reasoning for 45 and perception-based rea- scenarios were able to improve over the accuracy of the
soning for 12 model answers per ablation setting. This naive baseline and the text-only reference model with
means that the evaluated model answers were still im- respect to perception-based reasoning, although these
balanced towards knowledge-based reasoning. results were above or close to the cutof for statistical
significance ( 0.004 ≤  ≤ 0.4 ).</p>
      <p>More importantly however, for any individual
reason5. Evaluation ing type, annotating human attention via both annotation
types (FPVC and FPVA) did not significantly improve
accuracy in comparison to the regular FPV or TPV videos.</p>
      <p>This confirms that the pre-trained MLLM was in fact
We calculated the binary answer accuracy (see Section 4)
for every ablation setting and reasoning type as shown in
Table 3. To test for statistical significance we calculated
not able to meaningfully interpret the human attention with a first person video resulted in the best performing
annotations without fine-tuning. model across all reasoning types of questions.</p>
      <p>Overall, the experimental setup was suitable to re- When looking towards the design of a VQA model
veal diferences in VQA performance for the diferent for a future, practical sewing machine assistant, video
forms of video inputs and reasoning types. In fact, the inputs could therefore be used mainly to improve the
task-oriented nature of THAVQA was challenging for model’s perception abilities, while a retrieval system for
a pre-trained MLLM such as Gemini: while the model textual information could provide the necessary
specialwas often able to extract enough information for ques- ized knowledge.
tions requiring basic perception, this was not the case for
questions involving complex reasoning about temporal
or spatial dimensions that are peculiar of a procedural Acknowledgments
task such as sewing. For these types of reasoning the
model achieved its best performances when detailed tex- This research was funded by the Bavarian State Ministry
tual information related to the corresponding sub-steps for Science and the Arts (StMWK: Bayerische
Staatsminwas included in the ablation scenarios. Besides the na- isterium für Wissenschaft und Kunst - StMWK) as part of
ture of the questions formulated, maybe the videos are the Project ”CHIASM” (Changenreiche industrielle
Analso challenging for the model: we can hypothesize that wendungen für vortrainierte Sprachmodelle) and as part
this is due to the high semantic similarity between the the High Tech Agenda of the Free State of Bavaria.
video frames, as we showed in Section 3.4. We thank Rebecca Bilger of the Education and
Learning Lab for Sustainability Innovations (ELLSI) for her
support with the topic of sewing machine operation, the
5.1. Qualitative Analysis scheduling and organization of data collection and her
participation in the video dataset. We also thank the
research group for Applied Technologies of Language
and Assistance Systems (THA_atlas) at the Technische
Hochschule Augsburg for supporting the project with
If no video inputs were included for perception-based
questions, such as retrieving the fabric’s color, Gemini
mostly pointed out that it was lacking the information
required to provide an answer. Additionally, including
video inputs seemed to help the model disambiguate ques- advice and equipment.
tions. For example, the naive baseline misunderstood a
question about removing excess threads from the work References
piece, interpreting it as referring to undoing entire
unwanted seams. With video inputs, the model was able to
infer that the question was simply related to trimming
long threads hanging of the fabric. Finally, we found
that video context seemed to encourage the model to
provide descriptions of spatial relationships, even when
this is not strictly required to answer the question.</p>
      <p>Overall, we observed a positive efect of video inputs
on the model’s answers when compared to the naive
baseline. Examples are provided in the Appendix
(Figures 5- 7).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We provide a new task-oriented, German-language VQA
dataset on demonstrations of sewing machine operation
with open-ended human QA pairs and human visual
attention: THAVQA. We then compared the VQA
performance of Gemini 1.5 Pro on THAVQA varying the
model inputs. We found that the task-oriented scenario
of THAVQA was specific enough, such that the model
could not rely on only its inherent knowledge to generate
satisfactory responses. The questions contained in our
dataset were over the capacity of the model to reason
about the video data. Combining textual instructions
[5] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- I. Laptev, J. Sivic, HowTo100M: Learning a
Texthairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- Video Embedding by Watching Hundred Million
gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, Narrated Video Clips, in: 2019 IEEE/CVF
InterM. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, national Conference on Computer Vision (ICCV),
W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, 2019, pp. 2630–2640. URL: https://ieeexplore.ieee.
A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar- org/document/9009806. doi:10.1109/ICCV.2019.
das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko- 00272, iSSN: 2380-7504.
renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, [13] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang,
D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, D. Tao, ActivityNet-QA: A Dataset for
UnderstandP. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen- ing Complex Web Videos via Question Answering,
stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Proceedings of the AAAI Conference on Artificial
Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay- Intelligence 33 (2019) 9127–9134. URL: https://ojs.
lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, aaai.org/index.php/AAAI/article/view/4946. doi:10.
Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro- 1609/aaai.v33i01.33019127, number: 01.
driguez, R. Stojnic, S. Edunov, T. Scialom, Llama [14] J. Lei, L. Yu, M. Bansal, T. Berg, TVQA:
Lo2: Open Foundation and Fine-Tuned Chat Models, calized, Compositional Video Question
Answer2023. URL: http://arxiv.org/abs/2307.09288. doi:10. ing, in: E. Rilof, D. Chiang, J. Hockenmaier,
48550/arXiv.2307.09288, arXiv:2307.09288 [cs]. J. Tsujii (Eds.), Proceedings of the 2018
Confer[6] Anthropic PBC, Introducing the next generation ence on Empirical Methods in Natural Language
of Claude, 2024. URL: https://www.anthropic.com/ Processing, Association for Computational
Linguisnews/claude-3-family. tics, Brussels, Belgium, 2018, pp. 1369–1379. URL:
[7] C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, https://aclanthology.org/D18-1167. doi:10.18653/
C. Zhou, Y. Shen, M. Zhang, P. Chen, Y. Li, S. Lin, v1/D18-1167.</p>
      <p>S. Zhao, K. Li, T. Xu, X. Zheng, E. Chen, R. Ji, [15] Y. Jang, Y. Song, Y. Yu, Y. Kim, G. Kim, TGIF-QA:
X. Sun, Video-MME: The First-Ever Comprehen- Toward Spatio-Temporal Reasoning in Visual
Quessive Evaluation Benchmark of Multi-modal LLMs tion Answering, in: 2017 IEEE Conference on
Comin Video Analysis, 2024. URL: http://arxiv.org/ puter Vision and Pattern Recognition (CVPR), 2017,
abs/2405.21075. doi:10.48550/arXiv.2405.21075, pp. 1359–1367. URL: https://ieeexplore.ieee.org/
arXiv:2405.21075 [cs]. document/8099632. doi:10.1109/CVPR.2017.149,
[8] Y. Li, X. Chen, B. Hu, L. Wang, H. Shi, M. Zhang, iSSN: 1063-6919.</p>
      <p>VideoVista: A Versatile Benchmark for Video [16] K. Yi*, C. Gan*, Y. Li, P. Kohli, J. Wu, A. Torralba,
Understanding and Reasoning, 2024. URL: http: J. B. Tenenbaum, CLEVRER: Collision Events for
//arxiv.org/abs/2406.11303. doi:10.48550/arXiv. Video Representation and Reasoning, 2019. URL:
2406.11303, arXiv:2406.11303 [cs]. https://openreview.net/forum?id=HkxYzANYDB.
[9] R. Rawal, K. Saifullah, R. Basri, D. Jacobs, [17] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba,
G. Somepalli, T. Goldstein, CinePile: A Long R. Urtasun, S. Fidler, MovieQA: Understanding
StoVideo Question Answering Dataset and Benchmark, ries in Movies through Question-Answering, in:
2024. URL: http://arxiv.org/abs/2405.08813. doi:10. 2016 IEEE Conference on Computer Vision and
48550/arXiv.2405.08813, arXiv:2405.08813 [cs]. Pattern Recognition (CVPR), 2016, pp. 4631–4640.
[10] D. Gao, R. Wang, Z. Bai, X. Chen, Env-QA: A URL: https://ieeexplore.ieee.org/document/7780870.</p>
      <p>Video Question Answering Benchmark for Com- doi:10.1109/CVPR.2016.501, iSSN: 1063-6919.
prehensive Understanding of Dynamic Environ- [18] M. Grunde-McLaughlin, R. Krishna, M. Agrawala,
ments, in: 2021 IEEE/CVF International Conference AGQA: A Benchmark for Compositional
Spatioon Computer Vision (ICCV), 2021, pp. 1655–1665. Temporal Reasoning, in: 2021 IEEE/CVF
URL: https://ieeexplore.ieee.org/document/9711383. Conference on Computer Vision and Pattern
doi:10.1109/ICCV48922.2021.00170, iSSN: 2380- Recognition (CVPR), 2021, pp. 11282–11292. URL:
7504. https://ieeexplore.ieee.org/document/9577594.
[11] A. Yang, A. Miech, J. Sivic, I. Laptev, C. Schmid, doi:10.1109/CVPR46437.2021.01113, iSSN:
Just Ask: Learning to Answer Questions from Mil- 2575-7075.
lions of Narrated Videos, in: 2021 IEEE/CVF Inter- [19] T. Wang, J. Li, Z. Kong, X. Liu, H. Snoussi,
national Conference on Computer Vision (ICCV), H. Lv, Digital twin improved via visual
ques2021, pp. 1666–1677. URL: https://ieeexplore.ieee. tion answering for vision-language
interacorg/document/9710833. doi:10.1109/ICCV48922. tive mode in human–machine collaboration,
2021.00171, iSSN: 2380-7504. Journal of Manufacturing Systems 58 (2021)
[12] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, 261–269. URL: https://www.sciencedirect.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Online Resources</title>
      <p>The dataset, including synchronized video data with
annotated eye gaze as well as human formulated and model
generated question-answer pairs with reasoning type
annotations, is available via https://github.com/tha-atlas/
HowDoesSewingMachineWork.git.</p>
    </sec>
    <sec id="sec-8">
      <title>B. Crowdsourced</title>
    </sec>
    <sec id="sec-9">
      <title>Question-Answer Formulation</title>
      <p>Where is the sewing machine’s built-in thread
cutter located?
Wo befindet sich der integrierte Fadenschneider
der Maschine?
What does the seamstress check at the end of
the sewing?
Was kontrolliert die Näherin am Ende des
Nähens?
What color is the fabric in the video?
Welche Farbe hat der Stof in dem Video?
(b)
(c)
(d)</p>
    </sec>
    <sec id="sec-10">
      <title>D. Semantic Similarity of Human and Model Answers</title>
      <p>We also evaluated the similarity between human and
model answers for every ablation scenario as a sentence
BLEU-score [25] and BERT-scores [26] with precision,
recall and F1-score (see Table 4). However, we excluded
these metrics from the main evaluation, since they do
not provide a direct measure for the factual correctness
of the model’s responses. As expected, the reference
model with access to the same textual information that
annotators were using to formulate QA pairs achieves
the highest semantic similarity to human answers.</p>
    </sec>
    <sec id="sec-11">
      <title>E. Model Prompts</title>
      <p>Why do you use a zigzag stitch on elastic fabrics?
8https://ai.google.dev/gemini-api/docs/prompting_with_media#
prompting-with-videos
You are a sewing machine assistant. Answer
questions about using a sewing machine as
accurately and precisely as possible.</p>
      <p>It may be dificult to answer the questions
based on the given context. However, there is
no way to ask follow-up questions.</p>
      <p>Therefore, always try to answer the question as
well as possible.</p>
      <p>The answer should be concise and directly
related to the question, such as:
Question: How do I ...?
Answer: In order to ..., you ...</p>
      <p>The question is:
&lt;question&gt;
Du bist ein Nähmaschinenassistent. Beantworte
Fragen zu Benutzung einer Nähmaschine so
korrekt und präzise wie möglich.</p>
      <p>Möglicherweise sind die Fragen anhand des
gegeben Kontexts schwer zu beantworten. Es
gibt jedoch keine Möglichkeit Rückfragen zu
stellen.</p>
      <p>Versuche also die Frage stets so gut wie möglich
zu beantworten.</p>
      <p>Die Antwort sollte bündig formuliert sein
und lediglich Bezug zur Frage aufweisen, also
beispielsweise:
Frage: Wie mache ich ...?
Antwort: Um ... durchzuführen wird ...</p>
      <p>Die Frage lautet:
&lt;question&gt;
(a)</p>
      <p>The following excerpt from a manual may be
related to the question:
Folgender Auszug aus einer Anleitung hat
möglicherweise Bezug zur Frage:
&lt;instructions&gt;
&lt;instructions&gt;
&lt;video&gt;
&lt;video&gt;
The answer to the question is demonstrated in
the attached video.</p>
      <p>Die Antwort auf die Frage wird im beigefügten
Video demonstriert.</p>
      <p>The gaze point of a person performing the task
is annotated with a red circle and highlights
important areas of the image.</p>
      <p>Der Blickpunkt eines Menschen beim
Durchführen der Aufgabe ist durch einen roten Kreis
annotiert und weist auf wichtige Bildbereiche
hin.</p>
      <p>In the video, the viewpoint of a person
performing the task is masked, thus restricting the
video to the most important area of the image.
Im Video ist der Blickpunkt eines Menschen
beim Durchführen der Aufgabe maskiert, das
Video also auf den wichtigsten Bildbereich
beschränkt.</p>
      <p>Question: What color is the fabric in the video?
Welche Farbe hat der Stof in dem Video?</p>
      <sec id="sec-11-1">
        <title>Answer (naive baseline): It is impossible to de</title>
        <p>termine the color of the fabric in a video without
seeing the video.</p>
        <p>Es ist unmöglich die Farbe des Stofes in einem
Video zu bestimmen, ohne das Video zu sehen.
Answer (FPV): The color of the fabric is blue.</p>
        <p>Die Farbe des Stofes ist blau.</p>
        <p>Question: How should excess threads be removed
from the work piece?
Wie sollten die überschüssigen Fäden des
Werkstückes entfernt werden?</p>
      </sec>
      <sec id="sec-11-2">
        <title>Answer (naive baseline): To remove excess</title>
        <p>threads, use a seam ripper to carefully open the
seams and remove the unwanted threads.
Um überschüssige Fäden zu entfernen, verwende
einen Nahtauftrenner, um die Nähte vorsichtig zu
öfnen und die unerwünschten Fäden zu entfernen.
Answer (FPV): To remove excess threads, cut
them of with fabric scissors.</p>
        <p>Um überschüssige Fäden zu entfernen, schneidet
man diese mit einer Stofschere ab.</p>
        <p>Question: How do I manually insert the needle
into the fabric?
Wie führe ich die Nadel manuell in den Stof ein?</p>
      </sec>
      <sec id="sec-11-3">
        <title>Answer (naive baseline): The needle can be</title>
        <p>manually inserted using the handwheel.</p>
        <p>Mithilfe des Handrads kann die Nadel manuell
eingeführt werden.</p>
        <p>Answer (TPV): To manually insert the needle into
the fabric, turn the handwheel on the right side of
the sewing machine towards yourself.</p>
        <p>Um die Nadel manuell in den Stof einzuführen,
dreht man das Handrad an der rechten Seite der
Nähmaschine zu sich heran.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ilaslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lim</surname>
          </string-name>
          , M. Shou,
          <article-title>GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze TaskOriented Collaborations</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>10462</fpage>
          -
          <lpage>10479</lpage>
          . URL: https: //aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>648</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>648</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Leong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fang</surname>
          </string-name>
          , Y. Cheng, N. Gauthier,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <article-title>TaskOriented Multi-Modal Question Answering For Collaborative Applications</article-title>
          , in: 2020
          <source>IEEE International Conference on Image Processing (ICIP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1426</fpage>
          -
          <lpage>1430</lpage>
          . URL: https://ieeexplore.ieee. org/document/9190659. doi:
          <volume>10</volume>
          .1109/ICIP40778.
          <year>2020</year>
          .
          <volume>9190659</volume>
          , iSSN:
          <fpage>2381</fpage>
          -
          <lpage>8549</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Gemini</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024</article-title>
          . URL: http://arxiv.org/abs/2403.05530. doi:
          <volume>10</volume>
          . 48550/arXiv.2403.05530.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] OpenAI, GPT-4
          <source>Technical Report</source>
          ,
          <year>2024</year>
          . URL: http: //arxiv.org/abs/2303.08774. doi:
          <volume>10</volume>
          .48550/arXiv. 2303.08774.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>