<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>E. Navarrete); anett.hoppe@tib.eu
(A. Hoppe); ralph.ewerth@tib.eu (R. Ewerth)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Review on Recent Advances in Video-based Learning Research: Video Features, Interaction, Tools, and Technologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Evelyn Navarrete</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anett Hoppe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralph Ewerth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center, Leibniz University Hannover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TIB - Leibniz Information Centre for Science and Technology</institution>
          ,
          <addr-line>Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Human learning shifts stronger than ever towards online settings, and especially towards video platforms. There is an abundance of tutorials and lectures covering diverse topics, from fixing a bike to particle physics. While it is advantageous that learning resources are freely available on the Web, the quality of the resources varies a lot. Given the number of available videos, users need algorithmic support in finding helpful and entertaining learning resources. In this paper, we present a review of the recent research literature (2020-2021) on video-based learning. We focus on publications that examine the characteristics of video content, analyze frequently used features and technologies, and, finally, derive conclusions on trends and possible future research directions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Video-based Learning</kwd>
        <kwd>web-based learning</kwd>
        <kwd>literature review</kwd>
        <kwd>video features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>(c) teachers who seek to enhance their teaching by
integrating adapted multimedia resources, and finally, (d)
stuThe Web has fundamentally changed the way we learn. dents who pursue a more efective learning process and
Especially video platforms, such as YouTube, play an experience.
increasing role here – from the about 30 million video In this paper, we provide a review of recent research
views a day, about 50% relate to some kind of learning on the analysis of video-based learning, with a focus on
content [1]. Another YouTube-related statistic states that studies that examine video characteristics. For this
puras of May 2019, about 500 hours of new content were pose, we performed a systematic literature search. Here,
uploaded to the platform every minute [2], implying that we present a preliminary summary of our findings from
users are reliant on efective search and recommenda- the most recent publications, covering the years 2020 and
tion algorithms to find content that is relevant to their 2021. We systematize the contents of 41 reviewed papers
learning needs. and provide an overview on (a) the chosen research
ap</p>
      <p>These algorithms need to consider multiple factors to proach (e.g., empirical study, tool prototype, production
provide suitable rankings and recommendations. Espe- guidelines), (b) considered video characteristics, and (c)
cially in learning contexts, an open question is: When tasks and technologies used to develop VBL tools and
is a video the best way to learn, depending on the indi- frameworks. From these dimensions, we derive recent
vidual user, the learning objective, and context factors? research trends and identify gaps that provide directions
This question has been investigated in several studies for future research.
in recent years, examining characteristics of the videos The rest of this paper is structured as follows:
Secand the surrounding platforms to determine when video- tion 2 presents the related work; Section 3 describes
based learning (VBL) processes are especially success- our methodology; Section 4 provides an overview of the
ful. The resulting insights are of interest for all the in- video-based learning field, its benefits, potentials, and the
volved stakeholders: (a) Platform providers who wish to current challenges; and explains why identifying relevant
provide their users with the best possible content from features in videos is important. Section 5 presents the
their database, but also (b) content producers who aim results of the review. Finally, we summarize our findings,
to develop eficient and entertaining learning resources, point out the limitations, and conclude with indications
for future research in Section 6.
Ref.
[3] (2013)
[4] (2014)
[5] (2018)
[6] (2020)
2008-2019</p>
      <sec id="sec-1-1">
        <title>This literature review covers studies extracted from three</title>
        <p>academic databases: (a) the Digital Bibliography and
Library Project (DBLP), (b) the Association for Computing
Machinery (ACM), (c) and SpringerLink databases.</p>
        <p>These are specialized in the computer science field and,
thus, provide suitable domain specificity not only for our
review of technological systems surrounding video-based
learning but also to identify the state of the art in this
specific research area.</p>
        <p>We iteratively refined our set of query terms, to include
new vocabulary from the research papers. Whenever we
encountered a new term to describe research around VBL
in an article, we went back to the databases and re-ran</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Video-based Learning</title>
      <p>learner can be supported in her video-based learning tra- current insights and research trends. The discussion of
jectory. It is widely recognized that learners have diverse learner-related features, such as previous knowledge and
needs [8], depending on their current learning objective, skills, is out of the scope of this review (see, for instance,
context, and general preferences. How to analyze and [14] for a recent review).
index video resources to satisfy these individual
requirements is still an open research question. As we will show
below, exploring possible features for the automatic deter- 5. Literature Review
mination of video content and quality is one active area
of research. These explorations are indispensable in the This section summarizes our findings from the analysis
quest to allow diverse learners a satisfying video-based of 41 publications on video-based learning in the
pelearning experience. riod 2020-2021. We analyze (a) the principal research</p>
      <p>Lastly, MOOCs (Massive Open Online Courses) and approach (Section 5.1); (b) the target task and the used
similar educational courses are a central research area, technologies (Section 5.2); (c) the video features explored
mainly because they have enabled data analysis at scale. in the studies (Section 5.3); (d) and, finally, verify former
But it is precisely the large-scale nature of MOOCs that surveys’ finding that there is a specific focus on STEM
can carry issues in handling and processing such a mas- subjects in VBL studies (Section 5.4).
sive amount of information. As stated by Guo et al. “The
scale of data from MOOC interaction logs hundreds of 5.1. Research Approaches
thousands of students from around the world and mil- Four main research approaches have been identified in
lions of video watching sessions is four orders of mag- the reviewed literature: (1) Controlled experiments,
nitude larger than those available in prior studies” [8, which in this study are defined as experiments where
p. 42]. Moreover, critics directed to the research commu- one or a few variables are manipulated (according to the
nity have been raised as well. For example, a challenge learning context, video characteristics, or content) and
that needs to be confronted in these big scenarios is find- where the impact of this manipulation is measured by
ing how to incorporate controlled experiments [4] and some target metric. This group includes experiments run
methods or instruments from other fields, such as the in laboratory settings, but also those performed in
realifeld of psychology, instead of relying only on data anal- istic academic environments such as university courses.
ysis [5]. This category accounts for 22 instances (54%) in the
re</p>
      <p>Importance of features: Research on VBL poses a viewed papers. (2) Novel tools, architectures, or
analnumber of relevant questions: How can eficient learning ysis pipelines surrounding video-based learning
scebe facilitated and improved? How to enable long-term narios are covered in 16 (39%) articles. (3) Design
prinlearning success? How to ensure the learner’s engage- ciples and guidelines for the creation of educational
ment and satisfaction? The answer to these questions videos are presented in two publications (5%). (4) Results
strongly relies on one base research question: What char- of the analysis of data generated by users in a
videoacterizes an efective learning video? [ 8, 4, 5] The search based online learning platform are presented in one paper
for relevant video features has been widely studied from (2%).
the perspective of diferent research fields (e.g., [ 9]), and Table 4 lists all the publications in each category. The
in an exploration of what is possible with current tech- high number of empirical studies matches Poquet et al.’s
nologies (e.g., [10]). In consequence, there is a multitude [5] statement that, especially after 2016, empirical works
of variables that have been extracted, studied, and ma- are on the rise. However, an extension of the reviewed
penipulated. Video is a complex medium combining audio, riod time is necessary to see if our findings concord with
text, and visual information from all of which features the authors’ temporal placement of the change. Within
can be extracted and combined. These can range from this category (controlled experiments), 21 out of 22
studlow-level features like the use of audio quality features ies reported the number of participants in the experiment
(e.g.,[11]), to high-level features where semantics try to (sample size). The smallest study reported 12 participants
be captured (e.g., [12]). Features can also not necessar- [15] and the biggest acquired a group of 229 learners
ily come directly from the video but can result from the [16]. The average number of participants was 85
learninteraction with a video, e.g., views and likes (e.g., [13]). ers (SD=56). Studies that focused on the development of
Additional to the characteristics of the video, it is vital tools (group (2)) or data analysis of learning platforms
to study its environment. The success of a VBL process such as MOOCs (group (4)) often do not report a sample
is further influenced by functionalities of the surround- size.
ing platform, the learner’s context, and the instructor Some of the subsequent sections focus on a subset of
captured in the video. the groups since not all dimensions can be meaningfully</p>
      <p>In our review, we focus on those features directly extracted from all research objectives. The following
related to the learning video, with the goal to outline
review of used technologies, for instance, only makes
sense in the context of works that have a technology
component, and will, consequently, mostly cover works
of the group (2).
5.2. Target Task and Used Technologies
a Coursera course (a MOOC provider) [37]. All these
recent examples make use of deep learning
technologies, specifically, neural networks with gated recurrent
units [37], recurrent neural networks (RNN) [38], or long
short-term memory neural networks (LSTM) [36]. These
classifiers have been mainly fed with real-time features
[36, 37], especially play, pause, and speed rate, however,
aggregated data [37] and results of quizzes have also been
explored [38].</p>
      <p>Two of the 16 reviewed papers in this section (13%)
focus on video segmentation. Video as a learning
medium has the downside that all information is
presented sequentially. Unlike text, videos cannot be easily
skimmed to find relevant information. This can be
alleviated by automatic video segmentation. In our sample,
one paper aims to automatically determine important
segments in the educational content using a bidirectional
LSTM network classifier built on pre-trained models to
extract text and visual features [11]. Das &amp; Das [12]
improve on topical segmentation of lectures videos by
identifying concepts in the speech transcripts. To achieve
this, speech-to-text-technologies are used, pre-trained
neural network models are to analyze the resulting text,
and the cluster centroid algorithm is used to group the
resulting concepts.</p>
      <p>The remaining six studies (38%) focus on diverse tasks
such as video indexing [46], video ranking [43], video
customization according to the type of learner [47],
prediction of instructors’ enthusiasm [48], matching videos
with practical exercises [44] and helping educators with
the creation of teaching materials [45].</p>
      <p>This section focuses on the 16 studies which present a
technology component. From these, roughly a third of
the studies considered in this section (N=5, 31%) present
recommender systems that aim to support learners in
ifnding suitable learning material. One of the systems
does not only recommend video resources but also
learning materials in other modalities (e.g., text articles such
as Wikipedia pages [39]). Another two systems provide
the user with a video sequence (playlist) instead of single
learning contents [41, 42].</p>
      <p>A popular approach to recommend the material is
based on the extraction of topical keywords from the
video’s speech transcript [39, 40, 42], especially, by using 5.3. Features
the tf-idf algorithm. Those are then used to compute
the similarity of the currently viewed video to the other The core focus of this review is the collection of video
ones in the database [39, 40]. Tavakoli et al. [13] suggest characteristics that have been studied in the literature
not only use the current video for this but all resources (2020 and 2021). The objective is to provide an account
previously viewed by the user. They combine this infor- of how video features have been investigated, extracted,
mation with a profile of the learner’s skill set and predict and varied in the context of studies on video-based
learnrelevant videos using a Random Forest classifier. Another ing in the computer science field. We developed a feature
technique is to detect prerequisite relations between the taxonomy that groups the video features into categories,
videos to generate sequences organized by dificulty and which allows for a structured presentation of our
findcomplexity [41, 42]. Lastly, Tang et al. [42] addition- ings.
ally analyzes viewer comments with sentiment analysis We identified eight high-level categories: (a) audio
feato generate a playlist, assuming that positive-sounding tures, (b) visual features, (c) textual features, (d) features
comments point to engaging learning material. related to the instructor’s behavior, (e) features
result</p>
      <p>Other common use cases are systems for the forecast- ing from the interaction of the learner with the video
ing of learning success (N=3, 19%). “Success” can be (e.g., likes), (f) interactive features, (g) production style
operationalized in diferent ways: In two studies, the (e.g., Khan style), and (h) features related to instructional
objective is to predict the final score of the learner as design principles. All of these will be defined and their
“pass” or “fail”. The first case is in the context of passing appearances reviewed in separate paragraphs.
a university MOOC course [36], and the second case is Audio features are all features that relate to the
acabout passing video quizzes presented by videos in an tual video’s audio stream and also the features that can
e-learning platform [38]. In a third study, the target is be extracted from it. This includes audio records (e.g.,
to forecast the probability of a student dropping out of [39]), and quality-related features, such as energy,
en</p>
      <sec id="sec-2-1">
        <title>Titles, keywords, tags, video length Visual text (text extracted from frames) Textual cues (guiding attention to a key location in a slide)</title>
        <p>Beat gestures (e.g., hand strokes, natural
handwaving, and pointing gestures that aim at
highlight the speech), facial emotions
Volume (expansion of the teacher’s body), pose</p>
      </sec>
      <sec id="sec-2-2">
        <title>Play, pause, rate-change (speed), seek forward,</title>
        <p>seek backward
Percentage of the video that was watched
Average proportion of videos watched per week
Standard deviation of the proportion of videos
watched per week
Number of views, user ratings, relevancy score
(rank in search results)
Quizzes, annotation, feedback, exchange
(actions to interact with other learners)
Tutorial, lecture, Khan style, talking head,
Dialogue and monologue, voice over slides, only-text,
animations included
Coherence, signalling, spatial contiguity,
segmentation (information in segments, rather than
a long stream), pre-training (present first the
basic information), modality (graphics and
spoken words), multimedia (words together with
pictures), personalization (informal conversational
style), voice (human rather than a computer
voice), image (animations)
tropy, and spectral features which allow making pre- to text and still image, is that movement can guide the
dictions on how well the auditory information can be learner’s attention. These types of features are captured
perceived, e.g., [9, 11, 45]. A second big cluster is person- in the category visual representation features where we
related audio features which groups characteristics di- can find, for example, highlighting and visual cues (e.g.,
rectly related to the instructor’s expression [24, 45, 48]. animated and appearing elements) [24, 25, 26, 27, 19,
This includes, for instance, metrics for the teacher’s voice 45]. There is also work investigating visual quality
fea[45, 48], speech rate [50], used vocabulary [50], emphasis tures [9]. Finally, some works explore enhanced visual
on specific words that guide the listener’s attention [ 24]. representations by including 360-degree scene
represen</p>
        <p>The category visual features contains analyses or tations [20, 10, 21], augmented [23], and virtual reality
adaptations related to the video’s image information. features [22].</p>
        <p>This includes the analysis of information in video frames In the category text features, we group every
textas images [11, 12, 44]. Particular to video, in contrast based information which is available about the video
or can be extracted from it. It is common practice to several (dialogue, interview), and in which perspective
generate a speech transcript from the spoken content in the content is presented (upfront, or over-the-shoulder
a learning video. This pre-processing step transforms view in tutorials). Production style was mentioned
sevhard-to-analyze speech into written text, for which so- eral times as a feature in our sample [18, 32, 15, 17, 19].
phisticated and eficient analysis methods are available. There is a number of psychological and educational
Indeed, in our sample, speech transcripts are widely theories on how to eficiently support learners with
mulused [11, 12, 13, 39, 40, 41, 42, 43, 44, 46], and are ref- timedia resources, references to those are collected in the
erenced in all the works on video segmentation [11, 12], category instructional design principles. The only
and on recommending subsequent videos [36, 37, 38]. theoretical element referenced in our sample is the set</p>
        <p>Many video platforms provide metadata about the of Multimedia Learning Principles introduced by Mayer
hosted videos, which are widely used as an easily avail- [55], which was used by Eradze et al. [9]. They collected
able data source. These include structured information manual annotations that state whether a video follows
about the video’s title, attributed keywords, tags, and sim- those design principles (see Table 5 for examples). The
ilar [13, 40, 42]. Video length has been used, for instance, aim was to find a correlation between the principles and
by Mubarak et al. [36] as a factor to predict the learning students’ perceptions regarding the quality of the video.
outcome, and by Tavakoli et al. [13] to suggest similar As shown by the taxonomy in Table 5, from all the
videos. Finally, text, especially in learning videos, also studies reviewed in this research work (2020-2021), the
appears to support and visualize the speakers’ message most explored video features are text-related features,
in the form of superimposed scene text, often in the form especially transcripts. The next most investigated are
inof presentation slides (visual text) [24, 44]. teractive features, which study additional functionalities</p>
        <p>In every learning process, the teacher is a central fa- (quizzes, note-taking) that impact the learning success.
cilitator, there is thus a number of features related to In third place, we find real-time features. Other usually
instructor behavior [16, 25, 45, 48]. This group cap- explored features are visual features, especially features
tures gesturing [16, 25] and facial emotions [48], features that direct attention (e.g., animation), enhanced visuals
related to the body such as volume and pose [48], and also (e.g., 360-degree features), and production style (e.g.,
talkinformation about the speaker’s personality [45], which ing head). These findings are similar to the results of
have been used as base techniques to determine higher- Poquet et al. [5] who show that text, animation, audio,
level information. and production style are the most explored features.</p>
        <p>Several studies use learner interactions with the Lastly, we found that the impact of the above
menvideo as a feature. This includes real-time features of tioned features has been measured mainly through the
a user interacting with a certain instructional video learning outcome using metrics such as recall and
trans[36, 37, 33, 26, 34, 35, 47], e.g., playing and pausing a fer [16, 33, 30, 23, 24, 25, 28, 26, 27, 10, 32, 15, 29, 22, 34, 17,
video, skipping video content or interrupting; and aggre- 35, 19], which, again, conforms to the results of Poquet
gated measures, such as the number and ratio of videos et al. [5]. Other frequently used metrics are: cognitive
watched in a time period, or the aggregated behavior load [16, 25, 32, 22, 34, 17, 31, 19, 33, 23, 26, 18, 32, 22],
of several users on a certain video (e.g., [47]). Besides, motivation [30, 23, 20, 10, 22, 17], self-eficacy [ 28, 22, 17],
user interactions are used as indicators of video popular- mental efort [ 16, 32, 19], eye-gaze direction [25, 15, 17],
ity [13, 9], in form of the number of views and ratings. engagement [20, 26], comprehension (understanding)</p>
        <p>Interactive functionalities surrounding the [21, 29], and social presence [16, 17]. Of these,
motivideo: A critical challenge in videos is that information vation, cognitive load, mental efort, and the results of
is transient [51] and consumed in a rather passive eye-tracking were also included in the findings of
Poway [52, 53]. This invites superficial processing quet et al. [5] as often used metrics. On the other hand,
and leads to challenges in knowledge acquisition the less frequently used metrics encompass: enjoyment
and integration. In consequence, several studies [10], agent-persona (credibility, engagement, and
learninvestigate functionalities to actively involve the ing facilitating) [16], afective rating [ 16], para-social
user [38, 33, 30, 28, 26, 32, 29, 31] by adding functionali- interaction [16], self-regulated learning [33], sense of
ties such as interactive quizzes, collaborative elements, presence (feeling of being present in the environment)
or note-taking areas. [10], perceived benefits [ 21], confidence [ 19], creative</p>
        <p>The production style category classifies character- thinking [22], application [21], analysis [21], synthesis
istic types of video design [54]. It includes rough sub- [21], evaluation [21], attention span [32], views about
categories of how the learning setting is composed – e.g., the teaching technique [27], facial temperature [19], and
if people are visible (talking-head video, lecture setting challenge [19].
with or without an audience) or not (voice-over-slides,
Khan-style lecture, animations, and films of real-world
phenomena). If there is a single speaker (monologue) or
5.4. Covered Subject Domain to improve video segmentation for educational videos;
the approaches widely rely on pre-trained models for the
Previous literature reviews stated that STEM (Science, extraction of textual and visual features.
Technology, Engineering &amp; Mathematics) areas are over- Video features: In general, the most explored video
represented in the investigation of VBL [3, 5]. This is features are text-related, especially metadata and speech
confirmed in our sample of research papers from 2020- transcripts. The second most studied features are
inter2021. In our first category of controlled experiments, 14 active (such as interactive quizzes and other additional
of 22 studies (64%) use videos on some STEM domain. functionalities), followed by click-stream features, visual
Specifically, computer science [ 28, 20, 17, 35] and physics features, and production style.
[30, 18] are often targeted. In the remaining eight pa- Specifically, in the controlled experiments category,
pers, teaching English as a foreign language is the most the main features manipulated are related to interactive
common discipline [23, 29], followed by other social and features (32% of the studies), principally quizzes and
anhumanities domains. notations. Other often explored features in this category</p>
        <p>In the research categories “Data analysis” and “Design are enhanced visual features, mainly 360-degree features,
principles &amp; guidelines” the main focus is also on STEM production style features (18% of the studies) with an
spevideo content. However, the small sample size in these cial emphasis in dialogue and monologue, and features
categories does not lead to further interpretation. Papers that direct the attention of the learner (18% of the studies)
in the “Tools” research category do not usually focus on (e.g., animations.)
specific disciplines but aim at ofering general solutions Subject domain: Based on our sample, we found that
for VBL. This is why no specific analysis regarding the STEM domains are prevalent in the targeted subject areas.
subject domain is provided. Tendencies and directions: In comparison to
previous reviews, the tendencies that could be reafirmed in
6. Conclusions this study are mainly related to the type of research that
was performed, the type of features that have been
manipIn this paper, we have presented a survey on 41 papers ulated, and the subject domain in which the educational
in the field of video-based learning (2020-2021) to pro- videos have focused. Specifically, the trends discovered
vide a structured account of current tendencies in the by Poquet et al. [5] could be confirmed by our review:
ifeld. Specifically, we identified (a) common research (1) It was observed that the research community-directed
approaches, (b) target applications and the used tech- efort principally towards controlled experiments (54%
nologies, (c) investigated video features and developed of the reviewed papers). (2) Text, animation, audio, and
a taxonomy which allows structuring the related work, production style are situated among the most explored
(d) and, finally, collected information on the covered sub- features. (3) Recall and transfer are the most-used metrics
ject domains of the learning environments. to measure learning outcome and efectiveness. (4)
Cog</p>
        <p>Research approaches: The results show that more nitive load, mental efort, motivation, and the results
than half of the studies are dedicated to controlled ex- of eye-tracking are considered among the most popular
periments (54% of the studies). A significant proportion metrics for measuring learning experience. (5) Target
disfocused on tool development (39% of the studies), while ciplines of the videos were primarily addressing STEM
a very small proportion of studies dedicated efort to topics
analyze data from learning platforms such as MOOCs Research on video, text, and image analysis has been
(2% of the studies) and proposed design principles and dominated by deep learning approaches in recent years.
guidelines (5% of the studies). Increasingly, these also find application in the analysis of</p>
        <p>Tools and technologies: The main applications in educational data, exemplified by their dominance in
learnour sample are three: (1) recommender systems (35% ing outcome prediction, educational video segmentation,
of the studies in the “Tools” category), (2) predictors and feature extraction. In most cases, the developed
sysof learning outcome (19%), and (3) video segmentation tems make use of general-purpose pre-trained models,
techniques (13%). Technology-wise, keyword-based anal- even though first examples of models specifically trained
yses (e.g., using tf-idf to detect pertinent keywords) are on educational data exist, e.g., EduBERT [56], a word
the most commonly adopted. Approaches that aim to embedding trained on educational materials. It will be
provide video playlists (as opposed to singular video rec- exciting to see the potential of deep learning techniques
ommendations) often refer to techniques from the area properly adapted to educational media, and how they
of prerequisite detection. Recent approaches to forecast- will enhance educational applications.
ing learning success mainly build upon deep learning There are various studies that investigate learning on
techniques (e.g., multilayer perceptrons, gated recurrent the web, mainly focusing on features of user behavior
units, RNNs, and LSTMs). A similar tendency towards [57] or textual materials. Only recent publications such as
deep learning approaches can be seen in methods aiming [58] explore the role of videos in such web-based learning
processes. However, as shown here, the analysis of video
as a learning resource is a highly active research area.</p>
        <p>The related work outlined here can be a guide for future
works which aim to integrate video-based resources into
individualized online-learning trajectories, and will
provide interested scientists with a starting point for their
research.</p>
        <p>The reviewed papers investigated a wide range of video
features, including information from the visual content,
be it textual or image, and audio information. However,
the modalities were mainly considered separately,
without regard to their interactions. Research on VBL
eficiency could be brought forward by a truly multi-modal
analysis of educational videos, which explores how
spoken language, shown text and image, and speaker
behavior exhibit a combined message as, for instance,
explored by Shi et al. [59]. This is in line with psychological
research on instructional design, and might bring new
insights for educational research in quantifying formal
relationships between image, text and speech in their
impact on learning success.</p>
        <p>Limitations and future work: This review, with a
focus on publications from the years 2020 and 2021, only
considers a short period time. It is planned to extend the
study in the future, to provide a more thorough analysis
of tendencies and directions in VBL. Moreover, although
the keyword set used in the paper retrieval process is
extensive, we plan to broaden this set to ensure a more
comprehensive literature review. As pointed out by the
reviewers of this paper, there is a surprisingly low
number of MOOC-related studies included in our dataset. This
will be considered in the extension of our keyword set.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This work has been partly supported by the Ministry</title>
        <p>of Science and Education of Lower Saxony, Germany,
through the Graduate training network “LernMINT:
Data-assisted classroom teaching in the MINT subjects”.</p>
        <p>Also, this study has been partly supported by the
Leibniz Association, Germany (Leibniz Competition 2018,
funding line “Collaborative Excellence”, project SALIENT
[K68/2017]). Finally, we would like to thank the reviewers
for their valuable feedback.
compedu.2020.104096. doi:10.1016/j.compedu. on Human Factors in Computing Systems,
Hon2020.104096. olulu, HI, USA, April 25-30, 2020, ACM, 2020, pp.
[11] J. A. Ghauri, S. Hakimov, R. Ewerth, Classifica- 1–12. URL: https://doi.org/10.1145/3313831.3376845.
tion of important segments in educational videos doi:10.1145/3313831.3376845.
using multimodal features, in: S. Conrad, I. Tiddi [18] A. Pérez-Navarro, V. Garcia, J. Conesa, Students
per(Eds.), Proceedings of the CIKM 2020 Workshops ception of videos in introductory physics courses
co-located with 29th ACM International Confer- of engineering in face-to-face and online
environence on Information and Knowledge Management ments, Multim. Tools Appl. 80 (2021) 1009–1028.
(CIKM 2020), Galway, Ireland, October 19-23, 2020, URL: https://doi.org/10.1007/s11042-020-09665-0.
volume 2699 of CEUR Workshop Proceedings, CEUR- doi:10.1007/s11042-020-09665-0.
WS.org, 2020. URL: http://ceur-ws.org/Vol-2699/ [19] N. Srivastava, S. Nawaz, J. M. Lodge, E. Velloso, S. M.
paper15.pdf. Erfani, J. Bailey, Exploring the usage of thermal
[12] A. Das, P. P. Das, Incorporating domain knowl- imaging for understanding video lecture designs
edge to improve topic segmentation of long and students’ experiences, in: C. Rensing, H.
DrachMOOC lecture videos, CoRR abs/2012.07589 sler (Eds.), LAK ’20: 10th International Conference
(2020). URL: https://arxiv.org/abs/2012.07589. on Learning Analytics and Knowledge, Frankfurt,
arXiv:2012.07589. Germany, March 23-27, 2020, ACM, 2020, pp. 250–
[13] M. Tavakoli, S. Hakimov, R. Ewerth, G. Kismi- 259. URL: https://doi.org/10.1145/3375462.3375514.
hók, A recommender system for open educa- doi:10.1145/3375462.3375514.
tional videos based on skill requirements, in: [20] J. C. Muñoz-Carpio, M. A. Cowling, J. R. Birt,
Doc20th IEEE International Conference on Advanced toral colloquium - exploring the benefits of
usLearning Technologies, ICALT 2020, Tartu, Esto- ing 3600 video immersion to enhance motivation
nia, July 6-9, 2020, IEEE, 2020, pp. 1–5. URL: https: and engagement in system modelling education,
//doi.org/10.1109/ICALT49669.2020.00008. doi:10. in: D. Economou, A. Klippel, H. Dodds, A.
Peña1109/ICALT49669.2020.00008. Ríos, M. J. W. Lee, D. Beck, J. Pirker, A. Dengel,
[14] A. Abyaa, M. Khalidi Idrissi, S. Bennani, Learner T. M. Peres, J. Richter (Eds.), 6th International
Conmodelling: systematic review of the literature ference of the Immersive Learning Research
Netfrom the last 5 years, Educational Technology work, iLRN 2020, San Luis Obispo, CA, USA, June
Research and Development 67 (2019) 1105–1143. 21-25, 2020, IEEE, 2020, pp. 403–406. URL: https:
doi:10.1007/s11423-018-09644-1. //doi.org/10.23919/iLRN47897.2020.9155100. doi:10.
[15] A. Nugraha, I. A. Wahono, J. Zhanghe, T. Harada, 23919/iLRN47897.2020.9155100.</p>
        <p>T. Inoue, Creating dialogue between a tutee agent [21] W. Daher, H. Sleem, Middle school students’
learnand a tutor in a lecture video improves students’ ing of social studies in the video and 360-degree
attention, in: A. Nolte, C. Alvarez, R. Hishiyama, videos contexts, IEEE Access 9 (2021) 78774–78783.
I. Chounta, M. J. Rodríguez-Triana, T. Inoue (Eds.), URL: https://doi.org/10.1109/ACCESS.2021.3083924.
Collaboration Technologies and Social Computing doi:10.1109/ACCESS.2021.3083924.
- 26th International Conference, CollabTech 2020, [22] H. Huang, G. Hwang, C. Chang, Learning to be
Tartu, Estonia, September 8-11, 2020, Proceedings, a writer: A spherical video-based virtual reality
volume 12324 of Lecture Notes in Computer Science, approach to supporting descriptive article writing
Springer, 2020, pp. 96–111. URL: https://doi.org/10. in high school chinese courses, Br. J. Educ. Technol.
1007/978-3-030-58157-2_7. doi:10.1007/978-3- 51 (2020) 1386–1405. URL: https://doi.org/10.1111/
030-58157-2\_7. bjet.12893. doi:10.1111/bjet.12893.
[16] M. Beege, M. Ninaus, S. Schneider, S. Nebel, [23] C.-H. Chen, Ar videos as scafolding to foster
stuJ. Schlemmel, J. Weidenmüller, K. Moeller, G. D. Rey, dents’ learning achievements and motivation in efl
Investigating the efects of beat and deictic gestures learning, British Journal of Educational Technology
of a lecturer in educational videos, Comput. Educ. 51 (2020) 657–672.
156 (2020) 103955. URL: https://doi.org/10.1016/j. [24] X. Wang, L. Lin, M. Han, J. M. Spector,
Imcompedu.2020.103955. doi:10.1016/j.compedu. pacts of cues on learning: Using eye-tracking
tech2020.103955. nologies to examine the functions and designs of
[17] B. Lee, K. Muldner, Instructional video design: In- added cues in short instructional videos,
Comvestigating the impact of monologue- and dialogue- put. Hum. Behav. 107 (2020) 106279. URL: https:
style presentations, in: R. Bernhaupt, F. F. Mueller, //doi.org/10.1016/j.chb.2020.106279. doi:10.1016/
D. Verweij, J. Andres, J. McGrenere, A. Cockburn, j.chb.2020.106279.</p>
        <p>I. Avellino, A. Goguey, P. Bjøn, S. Zhao, B. P. Sam- [25] J. Moon, J. Ryu, The efects of social and cognitive
son, R. Kocielnik (Eds.), CHI ’20: CHI Conference
cues on learning comprehension, eye-gaze pattern, 158 (2020) 104000. URL: https://doi.org/10.1016/j.
and cognitive load in video instruction, J. Comput. compedu.2020.104000. doi:10.1016/j.compedu.
High. Educ. 33 (2021) 39–63. URL: https://doi.org/10. 2020.104000.
1007/s12528-020-09255-x. doi:10.1007/s12528- [34] N. Garrett, Segmentation’s failure to improve
soft020-09255-x. ware video tutorials, Br. J. Educ. Technol. 52 (2021)
[26] A. Cookson, D. Kim, T. Hartsell, Enhancing stu- 318–336. URL: https://doi.org/10.1111/bjet.13000.
dent achievement, engagement, and satisfaction doi:10.1111/bjet.13000.
using animated instructional videos, Int. J. Inf. [35] D. Lang, G. Chen, K. Mirzaei, A. Paepcke, Is faster
Commun. Technol. Educ. 16 (2020) 113–125. URL: better?: a study of video playback speed, in: C.
Renshttps://doi.org/10.4018/IJICTE.2020070108. doi:10. ing, H. Drachsler (Eds.), LAK ’20: 10th International
4018/IJICTE.2020070108. Conference on Learning Analytics and Knowledge,
[27] M. A. Al-Khateeb, A. M. Alduwairi, Efect of Frankfurt, Germany, March 23-27, 2020, ACM, 2020,
teaching geometry by slow-motion videos on the pp. 260–269. URL: https://doi.org/10.1145/3375462.
8th graders’ achievement, Int. J. Interact. Mob. 3375466. doi:10.1145/3375462.3375466.
Technol. 14 (2020) 57–67. URL: https://www.online- [36] A. A. Mubarak, H. Cao, S. A. M. Ahmed, Predictive
journals.org/index.php/i-jim/article/view/12985. learning analytics using deep learning model in
[28] M. C. Sözeri, S. B. Kert, Inefectiveness of online in- moocs’ courses videos, Educ. Inf. Technol. 26 (2021)
teractive video content developed for programming 371–392. URL:
https://doi.org/10.1007/s10639-020education, Int. J. Comput. Sci. Educ. Sch. 4 (2021) 10273-6. doi:10.1007/s10639-020-10273-6.
49–69. URL: https://doi.org/10.21585/ijcses.v4i3.99. [37] B. Jeon, N. Park, Dropout prediction over weeks
doi:10.21585/ijcses.v4i3.99. in moocs by learning representations of clicks and
[29] A. A. Kuhail, M. S. Aqel, Interactive digital videos videos, CoRR abs/2002.01955 (2020). URL: https:
and their impact on sixth graders’ english read- //arxiv.org/abs/2002.01955. arXiv:2002.01955.
ing and vocabulary skills and retention, Int. J. [38] H. E. Aouifi, Y. Es-Saady, M. E. Hajji, M. Mimis,
Inf. Commun. Technol. Educ. 16 (2020) 42–56. URL: H. Douzi, Toward student classification in
educahttps://doi.org/10.4018/IJICTE.2020070104. doi:10. tional video courses using knowledge tracing, in:
4018/IJICTE.2020070104. M. Fakir, M. Baslam, R. E. Ayachi (Eds.), Business
[30] D. Leisner, C. G. Zahn, A. Ruf, A. A. P. Cattaneo, Dif- Intelligence - 6th International Conference, CBI
ferent ways of interacting with videos during learn- 2021, Beni Mellal, Morocco, May 27-29, 2021,
Proing in secondary physics lessons, in: C. Stephanidis, ceedings, volume 416 of Lecture Notes in Business
M. Antona (Eds.), HCI International 2020 - Posters Information Processing, Springer, 2021, pp. 73–82.
- 22nd International Conference, HCII 2020, Copen- URL: https://doi.org/10.1007/978-3-030-76508-8_6.
hagen, Denmark, July 19-24, 2020, Proceedings, Part doi:10.1007/978-3-030-76508-8\_6.
II, volume 1225 of Communications in Computer [39] C. Schulten, S. Manske, A. Langner-Thiele, H. U.
and Information Science, Springer, 2020, pp. 284– Hoppe, Digital value-adding chains in vocational
291. URL: https://doi.org/10.1007/978-3-030-50729- education: Automatic keyword extraction from
9_40. doi:10.1007/978-3-030-50729-9\_40. learning videos to provide learning resource
recom[31] S. Chen, D. Wang, Y. Huang, Exploring the comple- mendations, in: C. Alario-Hoyos, M. J.
Rodríguezmentary features of audio and text notes for video- Triana, M. Schefel, I. A. Sánchez, S. Dennerlein
based learning in mobile settings, in: Extended (Eds.), Addressing Global Challenges and Quality
Abstracts of the 2021 CHI Conference on Human Education - 15th European Conference on
TechFactors in Computing Systems, 2021, pp. 1–7. nology Enhanced Learning, EC-TEL 2020,
Heidel[32] X. Lu, Q. Li, X. Wang, Research on the impacts berg, Germany, September 14-18, 2020, Proceedings,
of feedback in instructional videos on college stu- volume 12315 of Lecture Notes in Computer Science,
dents’ attention and learning efects, in: W. Shen, Springer, 2020, pp. 15–29. URL: https://doi.org/10.
J. A. Barthès, J. Luo, Y. Shi, J. Zhang (Eds.), 24th 1007/978-3-030-57717-9_2.
doi:10.1007/978-3IEEE International Conference on Computer Sup- 030-57717-9\_2.
ported Cooperative Work in Design, CSCWD 2021, [40] J. Jordán, S. Valero, C. Turró, V. J. Botti,
RecomDalian, China, May 5-7, 2021, IEEE, 2021, pp. 513– mending learning videos for moocs and flipped
516. URL: https://doi.org/10.1109/CSCWD49262. classrooms, in: Y. Demazeau, T. Holvoet, J. M.
2021.9437774. doi:10.1109/CSCWD49262.2021. Corchado, S. Costantini (Eds.), Advances in
Practi9437774. cal Applications of Agents, Multi-Agent Systems,
[33] D. C. D. van Alten, C. Phielix, J. Janssen, L. Kester, and Trustworthiness. The PAAMS Collection - 18th
Self-regulated learning support in flipped learning International Conference, PAAMS 2020, L’Aquila,
videos enhances learning outcomes, Comput. Educ. Italy, October 7-9, 2020, Proceedings, volume 12092
of Lecture Notes in Computer Science, Springer, cational videos hierarchical indexing with ebooks,
2020, pp. 146–157. URL: https://doi.org/10.1007/ in: H. Mitsuhara, Y. Goda, Y. Ohashi, M. M. T.
Ro978-3-030-49778-1_12. doi:10.1007/978-3-030- drigo, J. Shen, N. Venkatarayalu, G. Wong, M.
Ya49778-1\_12. mada, C. Lei (Eds.), IEEE International Conference
[41] M. C. Aytekin, S. Räbiger, Y. Saygin, Discov- on Teaching, Assessment, and Learning for
Engiering the prerequisite relationships among in- neering, TALE 2020, Takamatsu, Japan, December
structional videos from subtitles, in: A. N. 8-11, 2020, IEEE, 2020, pp. 482–489. URL: https:
Raferty, J. Whitehill, C. Romero, V. Cavalli- //doi.org/10.1109/TALE48869.2020.9368461. doi:10.
Sforza (Eds.), Proceedings of the 13th Interna- 1109/TALE48869.2020.9368461.
tional Conference on Educational Data Mining, [47] S. Lallé, C. Conati, A data-driven student model
EDM 2020, Fully virtual conference, July 10-13, to provide adaptive support during video watching
2020, International Educational Data Mining Soci- across moocs, in: I. I. Bittencourt, M. Cukurova,
ety, 2020. URL: https://educationaldatamining.org/ K. Muldner, R. Luckin, E. Millán (Eds.), Artificial
ifles/conferences/EDM2020/papers/paper_99.pdf. Intelligence in Education - 21st International
Con[42] C. Tang, J. Liao, H. Wang, C. Sung, W. Lin, Con- ference, AIED 2020, Ifrane, Morocco, July 6-10, 2020,
ceptguide: Supporting online video learning with Proceedings, Part I, volume 12163 of Lecture Notes
concept map-based recommendation of learning in Computer Science, Springer, 2020, pp. 282–295.
path, in: J. Leskovec, M. Grobelnik, M. Najork, URL: https://doi.org/10.1007/978-3-030-52237-7_23.
J. Tang, L. Zia (Eds.), WWW ’21: The Web Con- doi:10.1007/978-3-030-52237-7\_23.
ference 2021, Virtual Event / Ljubljana, Slovenia, [48] Y. Chen, C. Wang, Z. Jian, Research on evaluation
April 19-23, 2021, ACM / IW3C2, 2021, pp. 2757– algorithm of teacher’s teaching enthusiasm based
2768. URL: https://doi.org/10.1145/3442381.3449808. on video, in: ICRAI 2020: 6th International
Conferdoi:10.1145/3442381.3449808. ence on Robotics and Artificial Intelligence,
Singa[43] D. I. Bleoanca, S. Heras, J. Palanca, V. Julián, M. C. pore, November 20-22, 2020, ACM, 2020, pp. 184–
Mihaescu, LSI based mechanism for educational 191. URL: https://doi.org/10.1145/3449301.3449333.
videos retrieval by transcripts processing, in: doi:10.1145/3449301.3449333.
C. Analide, P. Novais, D. Camacho, H. Yin (Eds.), [49] T. Weinert, M. T. de Gafenco, M. S. Billert,
Intelligent Data Engineering and Automated Learn- N. Boerner, Fostering interaction in higher
ing - IDEAL 2020 - 21st International Conference, education with deliberate design of interactive
Guimaraes, Portugal, November 4-6, 2020, Pro- learning videos, in: J. F. George, S. Paul,
ceedings, Part I, volume 12489 of Lecture Notes R. De’, E. Karahanna, S. Sarker, G.
Oestreicherin Computer Science, Springer, 2020, pp. 88–100. Singer (Eds.), Proceedings of the 41st
InternaURL: https://doi.org/10.1007/978-3-030-62362-3_9. tional Conference on Information Systems, ICIS
doi:10.1007/978-3-030-62362-3\_9. 2020, Making Digital Inclusive: Blending the
Lo[44] X. Wang, W. Huang, Q. Liu, Y. Yin, Z. Huang, cak and the Global, Hyderabad, India, December
L. Wu, J. Ma, X. Wang, Fine-grained similarity 13-16, 2020, Association for Information Systems,
measurement between educational videos and ex- 2020. URL: https://aisel.aisnet.org/icis2020/digital_
ercises, in: C. W. Chen, R. Cucchiara, X. Hua, learning_env/digital_learning_env/12.
G. Qi, E. Ricci, Z. Zhang, R. Zimmermann (Eds.), [50] J. Ge, X. Li, Design strategies of EFL learning
MM ’20: The 28th ACM International Confer- videos: Exampled by a china MOOC, in: ICEIT 2020,
ence on Multimedia, Virtual Event / Seattle, WA, Proceedings of the 9th International Conference
USA, October 12-16, 2020, ACM, 2020, pp. 331– on Educational and Information Technology,
Ox339. URL: https://doi.org/10.1145/3394171.3413783. ford, UK, February11-13, 2020, ACM, 2020, pp. 68–
doi:10.1145/3394171.3413783. 71. URL: https://doi.org/10.1145/3383923.3383927.
[45] A. Hartholt, A. Reilly, E. Fast, S. Mozgai, Introduc- doi:10.1145/3383923.3383927.
ing canvas: Combining nonverbal behavior gener- [51] R. E. Mayer, C. Pilegard, Principles for managing
ation with user-generated content to rapidly cre- essential processing in multimedia learning:
Segate educational videos, in: S. Marsella, R. Jack, menting, pretraining, and modality principles, The
H. H. Vilhjálmsson, P. Sequeira, E. S. Cross (Eds.), Cambridge handbook of multimedia learning (2005)
IVA ’20: ACM International Conference on In- 169–182.
telligent Virtual Agents, Virtual Event, Scotland, [52] R. Ramachandran, E. M. Sparck, M. Levis-Fitzgerald,
UK, October 20-22, 2020, ACM, 2020, pp. 25:1– Investigating the efectiveness of using
application25:3. URL: https://doi.org/10.1145/3383652.3423880. based science education videos in a general
chemdoi:10.1145/3383652.3423880. istry lecture course, J. Chem. Educ. 96 (2019)
[46] S. Horovitz, Y. Ohayon, Boocture: Automatic edu- 479–485. URL: https://doi.org/10.1021/acs.jchemed.</p>
        <p>8b00777. doi:10.1021/acs.jchemed.8b00777.
[53] G. Salomon, Television is "easy" and print is
"tough": The diferential investment of mental
efort in learning as a function of perceptions
and attributions., Journal of Educational
Psychology 76 (1984) 647–658. doi:https://doi.org/10.</p>
        <p>1037/0022-0663.76.4.647.
[54] K. Chorianopoulos, A taxonomy of asynchronous
instructional video styles, The International Review
of Research in Open and Distributed Learning 19
(2018) 294–311.
[55] R. Mayer, R. E. Mayer, The Cambridge handbook of
multimedia learning, Cambridge university press,
2005.
[56] B. Clavié, K. Gal, Edubert: Pretrained deep
language models for learning analytics, CoRR
abs/1912.00690 (2019). URL: http://arxiv.org/abs/
1912.00690. arXiv:1912.00690.
[57] U. Gadiraju, R. Yu, S. Dietze, P. Holtz, Analyzing
knowledge gain of users in informational search
sessions on the web, in: C. Shah, N. J. Belkin,
K. Byström, J. Huang, F. Scholer (Eds.),
Proceedings of the 2018 Conference on Human
Information Interaction and Retrieval, CHIIR 2018, New
Brunswick, NJ, USA, March 11-15, 2018, ACM, 2018,
pp. 2–11. URL: https://doi.org/10.1145/3176349.</p>
        <p>3176381. doi:10.1145/3176349.3176381.
[58] C. Otto, R. Yu, G. Pardi, J. von Hoyer, M. Rokicki,</p>
        <p>A. Hoppe, P. Holtz, Y. Kammerer, S. Dietze, R.
Ewerth, Predicting knowledge gain during web search
based on multimedia resource consumption, in:
I. Roll, D. S. McNamara, S. A. Sosnovsky, R. Luckin,
V. Dimitrova (Eds.), Artificial Intelligence in
Education - 22nd International Conference, AIED 2021,
Utrecht, The Netherlands, June 14-18, 2021,
Proceedings, Part I, volume 12748 of Lecture Notes
in Computer Science, Springer, 2021, pp. 318–330.</p>
        <p>URL: https://doi.org/10.1007/978-3-030-78292-4_26.</p>
        <p>doi:10.1007/978-3-030-78292-4\_26.
[59] J. Shi, C. Otto, A. Hoppe, P. Holtz, R. Ewerth,
Investigating correlations of automatically extracted
multimodal features and lecture video quality, in:
Proceedings of the 1st International Workshop on
Search as Learning with Multimedia Information,
SALMM ’19, Association for Computing
Machinery, New York, NY, USA, 2019, p. 11–19. URL: https:
//doi.org/10.1145/3347451.3356731. doi:10.1145/
3347451.3356731.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>