<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Computer and System Sciences 55 (1997) 119-139. URL: https://doi.org/10.
1006/jcss.1997.1504.
[34] L. Breiman</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.compedu.2021</article-id>
      <title-group>
        <article-title>Video Features for Predicting Knowledge Gain in Search as Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wolfgang Bitter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anett Hoppe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralph Ewerth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TIB - Leibniz Information Centre for Science and Technology</institution>
          ,
          <addr-line>Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Marburg and hessian.AI - Hessian Center for Artificial Intelligence</institution>
          ,
          <addr-line>Marburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>12748</volume>
      <fpage>318</fpage>
      <lpage>330</lpage>
      <abstract>
        <p>While video platforms increasingly serve as primary learning resources during exploratory web searches, current approaches to predicting knowledge gain largely ignore video-specific features. This paper bridges this gap by examining how video interaction features (e.g., pausing, rewinding, forward navigation, viewing coverage) and video resource features (e.g., words per minute in speech transcripts, complex word ratios, and video file size density) correlate with learning outcomes. Using a publicly available dataset of 94 participants who engaged with educational videos during their search sessions, our analysis reveals that video interaction features, particularly those related to interaction frequency, are the strongest predictors of learning outcomes. Moreover, we analyze the influence of individual features on classification performance, revealing distinct relationships between diferent types of video interactions and knowledge gain. While our study is exploratory and based on a limited dataset, it provides valuable first insights and a foundation for future research on video-based learning behavior in search as learning settings. These insights can inform the design of adaptive learning systems that recognize and promote productive video engagement behaviors. To support future research, we release our feature extraction pipeline and analysis code1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Search as Learning</kwd>
        <kwd>Knowledge Gain Prediction</kwd>
        <kwd>Video Learning</kwd>
        <kwd>Video Interactions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The internet has transformed how people acquire knowledge, with web searches playing a pivotal role
in informal learning.</p>
      <p>Educational videos have become increasingly important in addition to traditional textual resources.
They ofer learners an engaging way to process complex topics through rich visual and auditory
elements. Platforms like YouTube are crucial tools in these learning journeys. They enable users to
control the pace of their exploration and revisit challenging content sections.</p>
      <p>The research area Search as Learning (SaL) investigates web search sessions with a learning
intent [1]. Recently, research on SaL has made significant strides in understanding how web searches
facilitate knowledge acquisition. A considerable body of work has explored the relationship between
learning outcomes and both user behavior (e.g., query patterns, clickstreams) [2, 3, 4] and the
properties of consumed resources (e.g., textual complexity, readability) [5, 6, 7, 8]. However, while these
studies ofer valuable insights into textual resources, videos—which are inherently multimodal and
interactive—remain understudied in the context of SaL.</p>
      <p>Educational videos are uniquely suited for learning because they combine multiple forms of
information, as emphasized in the Cognitive Theory of Multimedia Learning (CTML)[9]. Interactions such as
pausing, rewinding, and forward jumping allow learners to adapt content delivery to their needs,
potentially enhancing comprehension and retention. Prior studies have examined how these interactions
correlate with learning in controlled settings [10, 11], but there is limited research investigating their
role in authentic web search contexts.</p>
      <p>This paper addresses this gap by analyzing user interactions with videos during learning-oriented
web searches. Specifically, we explore the extent to which interaction features can predict knowledge
gain (KG), providing a deeper understanding of how video consumption contributes to learning in
informal settings. The study focuses on two key research questions:</p>
      <p>RQ1: How do user interactions (e.g., pausing, rewinding) correlate with learning outcomes
during video consumption in web search sessions?</p>
      <p>RQ2: To what extent can these interactions predict knowledge gain in such contexts?
To answer these questions, we use the publicly available SaL-Lightning dataset [12]. We developed a
semi-automated approach for extracting detailed interaction logs from screen recordings. These logs
capture user behaviors and are analyzed to uncover their relationships with knowledge outcomes. By
focusing on videos as part of the broader SaL framework, our study expands the current understanding
of multimodal learning resources and their role in web-based knowledge acquisition.</p>
      <p>The remainder of the paper is structured as follows: Section 2 summarizes the recent research on
SaL and video learning. Next, in Section 3, we explain how we extracted interaction logs from screen
recordings and define the extracted video features. In Section 4, we describe the evaluation process and
give insights into the relationship between video interaction and KG. Finally, in Section 5, we conclude
the results and summarize the implications for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Search engines are increasingly used as learning tools. Therefore, it becomes imperative to design
systems prioritizing knowledge acquisition [13]. The field of Search as Learning investigates user
behavior and system design to enhance learning outcomes during web-based searches [14, 15].</p>
      <p>Research in SaL has explored diverse aspects of learning-related interactions. For instance,
Vakkari [16] identified features that reflect users’ learning needs and their influence on knowledge
acquisition during search activities. Similarly, Roy et al. [17] highlighted that learning is afected by
prior knowledge, emphasizing the importance of modeling users’ knowledge states and their evolution
during the search process [18, 19, 20, 21].</p>
      <p>Eforts to study factors influencing KG during the search can be broadly categorized into two research
streams: a focus on (a) characteristics of web resources and (b) user behavior. For instance, Syed and
Collins-Thompson [22] studied document-level features to improve learning outcomes, particularly
for vocabulary acquisition. Ghafourian et al. [6] and Gritz et al. [5] explored readability metrics and
textual complexity, demonstrating their impact on user behavior and KG prediction. Yu et al. [23, 7]
utilized a wide range of features, including text and HTML statistics, to predict KG, while Otto et al. [24]
investigated how multimedia features complement readability and linguistic factors in predicting
learning outcomes. Recently, Gritz et al. [8] found a moderating influence of the visual complexity of
web pages on learning outcomes.</p>
      <p>Furthermore, research has shown that user behavior difers across more and less successful web
searches. For example, input queries [2], navigation logs [3], and behavioral features such as time spent
on pages or click patterns [4] have been linked to learning outcomes. These studies provide insights
into how user interactions reflect and impact knowledge acquisition.</p>
      <p>In the context of SaL, videos ofer a multimodal learning experience that extends beyond traditional
textual resources. Videos enhance comprehension and retention by combining visual and auditory
elements, a principle supported by the Cognitive Theory of Multimedia Learning (CTML) [9]. User
interactions, such as pausing, rewinding, and forwarding, are critical in learners’ engagement with
video content. For example, the segmenting efect described in CTML suggests that breaking multimedia
content into smaller segments can reduce cognitive load and improve learning outcomes [25, 11]. Pausing
behavior, in particular, can reflect moments when learners process or integrate new information, aligning
with points of high complexity or meaningful content structure [10, 26].</p>
      <p>Despite the advancements in SaL and video learning research, the role of user interactions with
videos within exploratory web searches remains relatively unexplored. Previous research has primarily
focused on textual resources, leaving a gap in understanding how video-based interactions contribute
to knowledge acquisition. Our work addresses this gap by examining video-specific user interaction
features and their influence on KG prediction. This contributes to a more holistic understanding of
learning in the context of SaL.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>We developed systematic methods with three main components to investigate how video interactions
and resource characteristics influence learning outcomes during web searches. First, we selected a
dataset that captures both user interactions with educational videos and the measurement of knowledge
gain (Section 3.1). Next, we implemented a semi-automatic approach to extract video interaction data
from screen recordings (Section 3.2). Finally, we derived two sets of features: interaction features that
quantify user engagement patterns and resource features that characterize video content properties
(Section 3.3). This set of methods enables us to analyze how diferent aspects of video engagement
correlate with knowledge acquisition during exploratory searches.</p>
      <sec id="sec-3-1">
        <title>3.1. Rationale for Selection of the Dataset</title>
        <p>After reviewing datasets for exploratory web searches from the literature [4, 27, 12], we decided to use
the SaL-Lightning dataset [12]. This dataset proved optimal for our purposes for several reasons. First,
it includes pre and post-test data necessary for measuring learning gains, unlike alternatives such as
CoST [27]. Second, it captures diverse web navigation patterns with substantial video engagement, with
82 % of participants (94 of 114) accessing YouTube videos. This contrasts with other datasets such as
Gadiraju et al. [4], where participants primarily accessed textual content like Wikipedia articles. While
the published dataset includes standard user actions (clicks, scrolling), it lacks interactions with videos.
We obtained access to the original screen recordings, enabling us to extract detailed video interaction
data through semi-automated processing.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Extraction of Video Interactions from Screen Recordings</title>
        <p>The manual screen recording annotation is time-consuming and error-prone, requiring continuous
attention and precise temporal documentation. To address this, we developed a semi-automated
approach, as depicted in Figure 1. The core idea involves fine-tuning and overfitting the object detection
algorithm YOLO [28] and the OCR model TrOCR [29] on the study data to generate accurate interaction
logs.</p>
        <p>Using the provided timelines, we automatically extracted individual clips from the screen recordings,
each representing a continuous sequence of a participant watching a YouTube video. Each clip was
sampled at 10 frames per second. YOLO was used to detect the play/pause icon and the video playback
position displayed on the interface. We fine-tuned YOLO using an initial training set of 25 manually
annotated framed. Through iterative quality reviews, we addressed detected errors by expanding the
training dataset, ultimately achieving reliable performance with 79 annotated frames.</p>
        <p>The OCR algorithm TrOCR was applied to extract video timestamps. Similarly, misrecognized
timestamps identified during quality checks were iteratively added to the training data, resulting in
a total of 344 annotations. This process yielded interaction logs at a resolution of 10 timestamps per
second, capturing whether the video was playing or paused.</p>
        <p>The data were smoothed using a rolling maximum approach to address frame-to-frame inconsistencies
in the detection results. We applied a seven-frame sliding window (three frames before and after each
target frame) to determine the video timestamp and the playing status (paused/playing) through a
add training examples and re-train models
extract
youtube
controls</p>
        <p>OCR to extract</p>
        <p>video
timestamps
parse and</p>
        <p>refine
extracted logs</p>
        <p>Actor
manual quality
control
Action
Logs
majority vote. To validate the resulting interaction logs, we randomly sampled clips from diferent
participants and manually cross-checked them against the original screen recordings until no further
errors were identified.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Feature Calculation</title>
        <p>The primary focus of this study is to investigate user interactions with videos during learning intended
web search (see Section 3.3.1). However, we additionally experiment with video resource features,
analyzing how the selected videos themselves can influence learning (see Section 3.3.2). Table 1 shows
a complete list of extracted features.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Video Interaction Features</title>
          <p>First, we define features based on accessing the videos ( 1-5) rather than actions on the video (e.g.,
pausing the video). Features 1 and 2 represent the number of videos and total dwell time on these
videos. Next, 3-5 indicate how much time the user spends on the video in relation to the video
duration (e.g., 80 % of the videos on average). Features 6 and 7 reflect when the user interacts with
the videos within the session or clip, which expresses whether a user primarily interacts with the videos
relatively early or late. On the other hand, 8 captures the timestamp of the earliest interaction across
all watched videos. 9-12 count the absolute number of pauses, rewinds, seek forwards, and all video
interactions per user across the whole search session:</p>
          <p>
            9 = ∑︁ , 10 = ∑︁ , 11 = ∑︁  , 12 = 9 + 10 + 11,
  
where  is the set of visited videos by a learner. Additionally, we capture the interaction rate by diving
the total number of interactions by the total dwell time:
13 =

9 , 14 = 10 , 15 = 11 , 16 = 12 ,  = ∑︁  
    
where  is the total dwell time per user on videos. These features give insights into how actively the
users interact with the videos. 17 − 19 measure the total duration of pauses, respectively, the total
rewind and forward jump distance in seconds. 20 is the average pausing length, covering whether a
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
learner takes short or longer pauses. Further, 21 is the average rewind distance and reflects whether a
user rewinds relatively small portions of the video or repeats whole videos. On the other hand, 22 is
the average forward jump distance and indicates whether a user searches thoroughly or broadly for
information. Similarly, 23 − 25 measure the average pausing duration, rewind, and forward jump
distance, independent of the dwell time. 26-28 measures the relationship between rewinds (and pauses)
and forward jumps, indicating whether a user is profoundly engaging with videos or broadly scanning
(e.g., searching for specific information). Finally, 29-32 reflect at which average percentage of the
video the user performs an action (e.g., on average, after 20 % of the video a user pauses the video).
 () =
∑︀  ·  ,
∑︀=1  
where  is the set of videos accessed by a learner and  ∈ {characters, syllables, words, complex words}.
Subsequently, we define the features as follows:
33 =  (characters), 34 =  (syllables), 35 =  (words), 36 =  (complex words).
(
            <xref ref-type="bibr" rid="ref4">4</xref>
            )
We further divide all features by the total dwell time on videos, ensuring that our features capture
interaction patterns rather than time spent on videos. Additionally, we then recalculate these features
relative to video duration rather than total word count, yielding measures of speech rate (e.g., words
per minute):
() = ∑︀=1  ·
          </p>
          <p>∑︀  
Again, we define the features as follows:</p>
          <p>37 = (characters)38 = (syllables), 39 = (words), 40 = (complex words).
Finally, we use the file size of the MP4 files, normalized by the number of frames and per pixel:</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Video Resource Features</title>
          <p>Using the tool yt-dlp [30], we downloaded each accessed video along with its associated metadata (e.g.,
video length, file size). Next, we used Whisper [31] (version large-v3) to get speech transcripts of
the videos. Inspired by text complexity and readability research, we extracted the number of characters,
syllables, words, and complex words from every speech transcript with the Python tool readability [32].
Moreover, similar to Gritz et al. [8], we used file size per frame per pixel as a proxy for visual complexity
based on the principle that more complex visual content typically results in larger compressed file sizes.</p>
          <p>Since every learner can access multiple videos, we calculate the average of the features per search
session, weighted by the dwell time per video length:</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>In this section, we assess the efectiveness of our features by analyzing their performance for the
task of knowledge gain prediction. We begin by introducing the dataset used in our experiments
(see Section 4.1), followed by a detailed explanation of the knowledge gain (see Section 4.2) and the
evaluation metrics (see Section 4.3). For evaluation, we formulate the knowledge gain prediction as
a classification task. In this regard, we present the experimental setup (see Section 4.4), including
the classifiers, baselines, and feature selection methodology, and conclude with a discussion of the
results (see Section 4.4.4) and an analysis of feature importance (see Section 4.5). This comprehensive
evaluation aims to shed light on the role of video interactions in predicting learning outcomes and to
identify the most relevant features contributing to classification performance.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          For our evaluation, we utilized the publicly available SaL-Lightning dataset, which focuses on exploratory
web searches [12]. A total of 130 university students took part in the study, of which the data of 114
learners remained after filtering by the authors. Participants in this study were instructed to learn as
much as they could about the generation of lightning and thunder within a time limit of 30 minutes. Still,
they were allowed to finish whenever they wanted. Despite the time limit, the search was unrestricted,
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
meaning that any search engine and any web page on the Internet could be accessed, resulting in 808
unique visited URLs. To assess learning, the participants completed identical 10-question
multiplechoice tests both one week before and immediately following the search sessions. Compensation was
provided to all participants for their participation.
with the videos at all.
        </p>
        <p>Since we are interested in the interactions with videos in this study, we filtered the participants
according to the criterion that they accessed at least one YouTube video (N=94). The remaining
participants were predominantly female (79 female, 15 male) and 22.8 ± 2.8 years old. On overage,
the participants visited 4.5 ±</p>
        <p>
          2.6 (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">1-17</xref>
          ) videos for a total duration of 598 s ±
participants had 14.5 ± 18.8 (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-135</xref>
          ) interactions (pause, rewind, jump forward), while 7 did not interact
338 s (82 s-1728 s). The
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Definition of Knowledge Gain</title>
        <p>In recent works, knowledge gain was primarily measured as the diference between post-test and
pre-test scores (correctly answered items). However, this does not consider the learner’s confidence
(e.g., guessing the answer). Therefore, we define a new measure to incorporate confidence. In the first
step, we weigh the pre and post-tests with confidence.</p>
        <p>=1
,  = ∑︁(2 *  − 1) ·
 
3
,
 and define the knowledge gain as:
where  ∈ {0, 1} represents whether an answer was correct and   ∈ {0, 1, 2, 3}
the submitted confidence in the correctness for item  for n=10 items. This results in a score for each
item between − 1 (very confident but wrong) and 1 (very confident and correct). We combine  and
 = (
1 +  − 
1 +  − 
, 0).</p>
        <p>We assume the learner’s knowledge cannot decrease through a web search. Thus, the values can range
between 0 and 1, where 0 means that nothing was learned and 1 that everything was correct in the
post-test with full confidence.</p>
        <p>Based on the literature, the actual values are less important than the classification of whether a web
search leads to low, moderate, or high knowledge gain. Therefore, we define three categories—low,
moderate, and high—and assign participants to these categories based on their KG as follows:
() =


,
where  represents the mean and  the standard deviation of the knowledge gains.
This results in the following distribution:
• low: 33 participants
• moderate: 25 participants
• high: 36 participants</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Metrics</title>
        <p>defined as follows:
To evaluate the classification performance of the models and thus the predictive power of the features,
we utilize the metrics Precision (), Recall (), F1-score (1), and Accuracy ()). These metrics are
 =</p>
        <p>+</p>
        <p>
          +  
,  =
2 ·  ·  ,
 + 
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(10)
(11)
where   ,   ,   , and   denote the number of true positives, false positives, true negatives, and
false negatives, respectively. A true positive would mean that a model predicted the same KG class (e.g.,
high) for a learner as in ground truth.
4.4.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Knowledge Gain Prediction</title>
        <p>We perform a knowledge gain prediction to assess the influence of the video interaction features on the
knowledge gain. We use a recently published evaluation script for the same task [8]. The evaluation
script consists of a stratified 10-fold cross-validation for eight classifiers, including hyperparameter
optimization and feature selection. To increase the robustness of our results, we repeat the 10-fold
cross-validation 5 times with diferent random states and calculate the metrics based on all predictions.
The classifiers consist of adaboost (ada) [ 33], decision tree (dt) [34], naive bayes (nb) [35], gradient
boosting (gboost) [36], k-nearest neighbors (knn) [37], multilayer perceptron (mlp) [38], random forest
(rf) [39], and support vector machine (svm) [40].</p>
        <p>
          Additionally to the two feature sets and eight classifiers, we define three baselines that predict (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) the
majority class, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) a stratified distribution, and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) a uniform distribution based on the training set.
These form the lower limit for evaluating the classification results since an equal or lower value would
indicate that the features are not better suited as predictors than guessing.
        </p>
        <sec id="sec-4-4-1">
          <title>4.4.3. Feature Selection</title>
          <p>When dealing with limited data, feature selection is a fundamental preprocessing step. Feature selection
is part of hyperparameter optimization; the ideal value is determined based on the validation data in
each cross-validation iteration. As the authors in [8], we use the top  features to correlate with the</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>4.4.1. Classifiers</title>
        </sec>
        <sec id="sec-4-4-3">
          <title>4.4.2. Baselines</title>
          <p>knowledge gain.
4.4.4. Results
The complete classification results are presented in Table 2.</p>
          <p>Initially, we observed that video resource features appear to be poor predictors of knowledge gain.
None of the classifiers significantly outperforms the baselines, which correspond to (weighted) random
guessing. On average, the classifiers achieve a similar 1-score to the stratified baseline, indicating that
these features alone do not provide meaningful predictive power.</p>
          <p>In contrast, classification based on interaction features outperforms all baselines. The 1-score is, on
average, 13.7 % higher than the best baseline, suggesting that user interactions with the videos capture
some aspects of learning success. To confirm the robustness of these results, we conducted significance
tests on the 1-score and accuracy across all classifiers based on the baselines for the three diferent
settings, applying Bonferroni correction for multiple comparisons. For a result to be deemed significant,
the -value must satisfy the condition:
 &lt;</p>
          <p>=

3
The 1-score and accuracy for interaction features were significantly better than those of the baselines.</p>
          <p>Ultimately, the results for the combined features highlight the positive impact of interaction features
on classification performance. This finding is somewhat surprising since the interaction features were
calculated independently of the videos’ content and design. A plausible hypothesis suggests that
interaction features may indirectly capture video content elements. For example, successful users might
instinctively pause or rewind during challenging or crucial moments, aligning their actions with the
video’s complexity or significance. The following experiment will examine which features contributed
most significantly to achieving the best classification results.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Feature Importance</title>
        <p>In this experiment, we aim to identify which features contribute most to the best classification example.
We use the code provided in Gritz et al. [8] to perform a permutation feature importance analysis.
We choose random forest for the interaction features that achieved 52.6 % accuracy and repeat the
evaluation with the corresponding hyperparameters and features from the experiment before. In every
iteration of the cross-validation, each feature is discarded 100 times, and the decrease in accuracy is
measured. Figure 2 shows the result.</p>
        <p>Feature Importance
Feature Name</p>
        <p>The most important feature is the interaction rate, which reflects how actively a learner engages
with the videos. Surprisingly, we found a weak negative correlation (Pearson  = − 0.20) between this
feature and the continuous knowledge gain value. The second to fifth important features (although
not all selected in every iteration) depend on the rewind interactions. We also found a weak negative
correlation (Pearson  = − 0.29) between the rewind rate and knowledge gain. One possible explanation
is that these interactions indicate challenges in learning with videos rather than deeper engagement.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>
        In this paper, we investigated the relationship between video interaction features and knowledge gain
during a learning intended web search on the SaL-Lightning dataset [12]. First, we derived interaction
logs from screen recordings through a semi-automatic procedure. Next, we developed both (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) features
representing the user interactions with the videos seen across the web searches and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) features
indicating the speech rate and (visual) complexity of the video resources. We performed a knowledge
gain prediction based on the classification framework provided by recent research [ 8]. Finally, we
analyzed the importance of the individual features in the best classification result.
      </p>
      <p>Surprisingly, the classification results based on the video resource features did not show better
values than random guessing. These characteristics may be insuficient to capture the diversity of
educational videos and require further research. On the other hand, we observed a 13.7 % significantly
increased 1-score for the video interaction features, showing that the learning outcomes can be
partially explained by the users’ interaction with the videos in the learning sessions. Additionally, we
found that the interaction rate, especially the rewind rate, is the most important predictor for knowledge
gain. These features revealed a weak negative correlation with knowledge gain, indicating that these
values might indicate dificulties in learning with videos. However, a limitation of the results is that
they were obtained using data from a single study with a single learning task and require verification.
Nevertheless, our analysis provides a basis for further research on video-based learning in real-life
settings.</p>
      <p>These results could be considered when designing assistive tools (e.g., browser add-ons) to support
learners actively experiencing dificulties. Furthermore, video designers could adjust their videos
accordingly when many users exhibit this behavior (if the information is available). Future research
could investigate which aspects of a video lead to increased interaction rates. An additional step could
be to predict specific moments in a video that trigger interactions to provide further support (e.g., extra
information or system pauses).</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Part of this work was financially supported by the Leibniz Association, Germany (Leibniz Competition
2023, funding line "Collaborative Excellence", project VideoSRS [K441/2022]).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT, Grammarly, and DeepL in order
to: Grammar and spelling check, Paraphrase and reword, Improve writing style, and Text Translation.
After using these tools/services, the author(s) reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[10] T.-C. Liu, Y.-C. Lin, S. Kalyuga, Efects of complexity-determined system pausing on learning from
multimedia presentations, Australasian Journal of Educational Technology 38 (2021) 102–114.
doi:10.14742/ajet.7267.
[11] I. A. Spanjers, T. van Gog, P. Wouters, J. J. van Merriënboer, Explaining the segmentation efect in
learning from animations: The role of pausing and temporal cueing, Computers &amp; Education 59
(2012) 274–280. doi:10.1016/j.compedu.2011.12.024.
[12] C. Otto, M. Rokicki, G. Pardi, W. Gritz, D. Hienert, R. Yu, J. von Hoyer, A. Hoppe, S. Dietze, P. Holtz,
Y. Kammerer, R. Ewerth, Sal-lightning dataset: Search and eye gaze behavior, resource interactions
and knowledge gain during web search, in: D. Elsweiler (Ed.), CHIIR ’22: ACM SIGIR Conference
on Human Information Interaction and Retrieval, Regensburg, Germany, March 14 - 18, 2022,
ACM, 2022, pp. 347–352. URL: https://doi.org/10.1145/3498366.3505835. doi:10.1145/3498366.
3505835.
[13] A. Z. Broder, A taxonomy of web search, SIGIR Forum 36 (2002) 3–10. URL: https://doi.org/10.</p>
      <p>1145/792550.792552.
[14] Y. Ghafourian, A. Hanbury, P. Knoth, Ranking for learning: Studying users’ perceptions of
relevance, understandability, and engagement, in: International Conference on Theory and
Practice of Digital Libraries, TPDL 2023, Zadar, Croatia, volume 14241 of Lecture Notes in Computer
Science, Springer, 2023, pp. 284–291. URL: https://doi.org/10.1007/978-3-031-43849-3_25.
[15] M. Rokicki, R. Yu, D. Hienert, Learning to rank for knowledge gain, in: Joint Proceedings of
the International Workshop on News Recommendation and Analytics (INRA 2022) and the 3rd
International Workshop on Investigating Learning During Web Search (IWILDS 2022) co-located
with 45th International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR, Madrid, Spain, CEUR-WS.org, 2022, pp. 60–68. URL: https://ceur-ws.org/Vol-3411/
IWILDS-paper2.pdf.
[16] P. Vakkari, Searching as learning: A systematization based on literature, Journal of Information</p>
      <p>Science 42 (2016) 7–18. URL: https://doi.org/10.1177/0165551515615833.
[17] N. Roy, F. Moraes, C. Hauf, Exploring users’ learning gains within search sessions, in: Conference
on Human Information Interaction and Retrieval, CHIIR, Vancouver, Canada, ACM, 2020, pp.
432–436. URL: https://doi.org/10.1145/3343413.3378012.
[18] A. Câmara, D. E. Zein, C. da Costa Pereira, RULK: A framework for representing user knowledge in
search-as-learning, in: International Conference on Design of Experimental Search &amp; Information
REtrieval Systems, DESIRES, San Jose, USA, CEUR-WS.org, 2022, pp. 1–13. URL: https://ceur-ws.
org/Vol-3480/paper-01.pdf.
[19] H. Nasser, D. E. Zein, C. da Costa Pereira, C. Escazut, A. Tettamanzi, RULKKG: estimating
user’s knowledge gain in search-as-learning using knowledge graphs, in: Conference on Human
Information Interaction and Retrieval, CHIIR, Shefield, United Kingdom, ACM, 2024, pp. 364–369.</p>
      <p>URL: https://doi.org/10.1145/3627508.3638331.
[20] D. E. Zein, A. Câmara, C. da Costa Pereira, A. Tettamanzi, RULKNE: representing user knowledge
state in search-as-learning with named entities, in: Conference on Human Information Interaction
and Retrieval, CHIIR, Austin, USA, ACM, 2023, pp. 388–393. URL: https://doi.org/10.1145/3576840.
3578330.
[21] D. E. Zein, C. da Costa Pereira, The evolution of user knowledge during search-as-learning sessions:
A benchmark and baseline, in: Conference on Human Information Interaction and Retrieval, CHIIR
2023, Austin, TX, USA, ACM, 2023, pp. 454–458. URL: https://doi.org/10.1145/3576840.3578273.
[22] R. Syed, K. Collins-Thompson, Exploring document retrieval features associated with improved
short- and long-term vocabulary learning outcomes, in: Conference on Human Information
Interaction and Retrieval, CHIIR, New Brunswick, USA, ACM, 2018, pp. 191–200. URL: https:
//doi.org/10.1145/3176349.3176397.
[23] R. Yu, U. Gadiraju, P. Holtz, M. Rokicki, P. Kemkes, S. Dietze, Predicting user knowledge gain
in informational search sessions, in: International Conference on Research &amp; Development in
Information Retrieval, SIGIR, Ann Arbor, USA, ACM, 2018, pp. 75–84. URL: https://doi.org/10.
1145/3209978.3210064.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Holtz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kammerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ewerth</surname>
          </string-name>
          ,
          <article-title>Current challenges for studying search as learning processes</article-title>
          , in: Workshop on Learning &amp;
          <article-title>Education with Web Data (LILE2018), in conjunction with ACM Web Science</article-title>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Collins-Thompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Rieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Haynes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Syed</surname>
          </string-name>
          ,
          <article-title>Assessing learning outcomes in web search: A comparison of tasks and query strategies</article-title>
          ,
          <source>in: Conference on Human Information Interaction and Retrieval</source>
          ,
          <string-name>
            <surname>CHIIR</surname>
          </string-name>
          , Carrboro, USA, ACM,
          <year>2016</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>172</lpage>
          . URL: https://doi.org/10.1145/2854946. 2854972.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Eickhof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Teevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <article-title>Lessons from the journey: a query log analysis of within-session learning</article-title>
          ,
          <source>in: International Conference on Web Search and Data Mining</source>
          ,
          <string-name>
            <surname>WSDM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2014</year>
          , ACM,
          <year>2014</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>232</lpage>
          . URL: https://doi.org/10.1145/2556195.2556217.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>U.</given-names>
            <surname>Gadiraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Holtz</surname>
          </string-name>
          ,
          <article-title>Analyzing knowledge gain of users in informational search sessions on the web</article-title>
          , in: C.
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          <string-name>
            <surname>Belkin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Byström</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Scholer</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2018 Conference on Human Information Interaction and Retrieval</source>
          ,
          <string-name>
            <surname>CHIIR</surname>
          </string-name>
          <year>2018</year>
          , New Brunswick, NJ, USA, March
          <volume>11</volume>
          -15,
          <year>2018</year>
          , ACM,
          <year>2018</year>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>11</lpage>
          . URL: https://doi.org/10.1145/3176349.3176381. doi:
          <volume>10</volume>
          .1145/3176349.3176381.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Gritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ewerth</surname>
          </string-name>
          ,
          <article-title>On the impact of features and classifiers for measuring knowledge gain during web search - A case study, in: Workshops co-located with the International Conference on Information and Knowledge Management, CIKM</article-title>
          , Gold Coast, Australia, CEUR-WS.org,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3052</volume>
          /paper6.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ghafourian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knoth</surname>
          </string-name>
          ,
          <article-title>Readability measures as predictors of understandability and engagement in searching to learn</article-title>
          ,
          <source>in: International Conference on Theory and Practice of Digital Libraries, TPDL</source>
          <year>2023</year>
          , Zadar, Croatia, volume
          <volume>14241</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2023</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>181</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -43849-3_
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rokicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Gadiraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <article-title>Topic-independent modeling of user knowledge in informational search sessions</article-title>
          ,
          <source>Information Retrieval Journal</source>
          <volume>24</volume>
          (
          <year>2021</year>
          )
          <fpage>240</fpage>
          -
          <lpage>268</lpage>
          . URL: https: //doi.org/10.1007/s10791-021-09391-7.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Gritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ewerth</surname>
          </string-name>
          ,
          <article-title>Unraveling the impact of visual complexity on search as learning, 2025</article-title>
          . URL: https://arxiv.org/abs/2501.05289. arXiv:
          <volume>2501</volume>
          .
          <fpage>05289</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Mayer</surname>
          </string-name>
          (Ed.),
          <source>The Cambridge Handbook of Multimedia Learning</source>
          , 2nd ed., Cambridge University Press,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .1017/CBO9781139547369.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>