<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classification of Important Segments in Educational Videos using Multimodal Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Junaid Ahmed Ghauri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sherzod Hakimov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralph Ewerth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center, Leibniz University Hannover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TIB - Leibniz Information Centre for Science and Technology</institution>
          ,
          <addr-line>Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Videos are a commonly-used type of content in learning during Web search. Many e-learning platforms provide quality content, but sometimes educational videos are long and cover many topics. Humans are good in extracting important sections from videos, but it remains a significant challenge for computers. In this paper, we address the problem of assigning importance scores to video segments, that is how much information they contain with respect to the overall topic of an educational video. We present an annotation tool and a new dataset of annotated educational videos collected from popular online learning platforms. Moreover, we propose a multimodal neural architecture that utilizes state-of-the-art audio, visual and textual features. Our experiments investigate the impact of visual and temporal information, as well as the combination of multimodal features on importance prediction.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;educational videos</kwd>
        <kwd>importance prediction</kwd>
        <kwd>video analysis</kwd>
        <kwd>video summarization</kwd>
        <kwd>MOOC</kwd>
        <kwd>deep learning</kwd>
        <kwd>e-learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the era of e-learning, videos are one of the most
important medium to convey information for
learners, being also intensively used during informal
learning on the Web [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Many academic institutions
started to host their educational content with
recordings while various platforms like Massive Open
Online Courses (MOOC) have emerged where a large part
of the available educational content consists of videos.
      </p>
      <p>
        Such educational videos on MOOC platforms are also Figure 1: Sample video with annotations of importance
exploited in search as learning scenarios, their poten- scores for each segment
tial advantages compared with informal Web search
have been investigated by Moraes et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Although
many platforms pay a lot of attention to the quality
of the video content, the length of videos is not al- tempo are other key factors for engagement in a video
ways considered as a major factor. Many academic lecture as described by Zolotykhin and Mashkina [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
institutions provide content where the whole lecture In this paper, we introduce computational models
is recorded without any breaks. Such lengthy content that predict the importance of segments in (lengthy)
can be dificult for learners to follow in distant learn- videos. Our model architectures incorporate visual,
ing. As mentioned by Guo et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] shorter videos are audio, and text (transcription of audio) information to
more engaging in contrast to pre-recorded classroom predict importance scores for each segment of an
edlectures split into smaller pieces for MOOC. Moreover, ucational video. A sample video and its importance
pre-planned educational videos, talking head, illustra- scores are shown in Figure 1. A value between 1 and
tions using hand drawings on board or table, and speech 10 is assigned to each segment indicating the score
of a specific segment whether it refers to an
imporProceedings of the CIKM 2020 Workshops, October 19–20, Galway, tant information regarding the overall topic of a video.
Iermelaainld: junaid.ghauri@tib.eu (J.A. Ghauri); We refer to it as the importance score of video
segsherzod.hakimov@tib.eu (S. Hakimov); ralph.ewerth@tib.eu (R. ments in educational domain, similar to the
annotaEwerth) tions provided by TVSum dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] on various Web
orcid: 0000-0001-9248-5444 (J.A. Ghauri); 0000-0002-7421-6213 (S. videos. We have developed an annotation tool that
alHakimov); 0000-0003-0918-6297 (R. Ewerth)
      </p>
      <p>© 2020 Copyright for this paper by its authors. Use permitted under Creative lows annotators to assign importance scores to video
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns WLiceonrsekAsthtriobuptioPnr4o.0cIneteerdnaitniognasl ((CCC EBYU4R.0)-.WS.org) segments and created a new dataset for this task (see
Section 4). The contributions of this paper are summa- students when they are going through the video
lecrized as follows: tures.</p>
      <p>
        Research in the field of video summarization addresses
• Video annotation tool and an annotated dataset a similar problem, where important and relevant
con• Analysis of influence of multimodal features and tent from videos is classified to generate summaries
parameters (history window) for educational video(for instance, [10, 11] and [12]). All of these methods
summarization are based on TVSum [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and SumMe [13] datasets that
consist of Web videos. The nature of these datasets is
• Multimodal neural architectures for the predic- very diferent to videos from the educational domain.
tion of importance scores for video segments These datasets can be a good source of visual features
but spoken words or textual content are relatively rare
• The source code for defined the deep learning or not present at all. Inspired from video
summarizamodels, the annotation tool and the newly cre- tion work, Davila and Zanibbi [14] presented a method
ated dataset are shared publicly1 with the re- to detect written content in videos, e.g. on whiteboards.
search community. This research focuses on a sub-task which only takes
The remaining sections of the paper are organized into account the lectures in which the written
conas follows. Section 2 presents an overview of related tent is available, and also addresses only the topic of
work in video-based e-learning and computational ar- mathematics. Xu et al. [15] focused on another kind of
chitectures covering multiple modalities in educational technique where speaker pose information can help in
domain. In Section 3, we provide detailed description action classification like writing, explaining, or
erasof model architectures. Section 4 presents the described ing. Here, the most important segments are explaining,
annotation tool and the created dataset. Section 5 cov- which could be an indication of an important segment
ers the experimental results and discussions on the find- in educational videos.
ings of the paper and Section 6 concludes the papers. Another important aspect of e-learning is student
engagement for diferent types of online resources. Guo
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] analyzed various aspects for MOOC videos
2. Related work and provided a number of related recommendations.
Shi et al. [16] analyzed the correlation of features and
Various studies have been conducted that address the lecture quality by considering visual features from slides,
quality of online education, create personalized rec- linguistic elements and audio features like energy,
freommendations for learners, or focus on highlighting quency, pitch, etc. to highlight important and
emphathe most important parts in lecture videos. Student in- sized statements in a lecture video. As suggested by
teraction with lecture videos ofers new opportunities YukiIchimura [17], one of the best practices in MOOCs
to understand the performance of students or for the is to ofer information on which parts of a lecture video
analysis of their learning progress. Recently, Mubarak are dificult or need more attention, which could
poet al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposed an architecture that uses features tentially lead to a more flexible and personalized
learnfrom e-learning platforms such as watch time, plays, ing experience. In order to perform such tasks by
mapauses, forward and backward to train deep learning chines, they need to incorporate multimodal
informamodels for predictive learning analytics. In a similar tion from educational content. To deal with
multiway, Shukor and Abdullah [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used watch time, clicks, modal data is not easy and this is also true for
mulcompleted number of assignments for the same pur- timodal learning, as explained by Wang et al. [18]. If
pose. Another method by Tang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a concept- user interaction data are available for videos along with
map based approach that analyzes the transcripts of visual, textual information, then the task can be solved
videos collected from YouTube and visual recommen- by multimodal deep learning models.
dations to improve learning path and provide
personalized content. In order to improve student
performance and enhance the learning paradigm, high-tech 3. Multimodal Architecture
devices are recommended for the classroom setting and
content presentation. For instance, instructors or
presenters can highlight important sections which can be
saved along with the video data and later be used by
      </p>
      <sec id="sec-1-1">
        <title>In this section, we describe the proposed model archi</title>
        <p>tecture that predicts importance scores for each video
segment by fusing audio, visual and textual features.</p>
        <p>
          Each video contains audio, visual and textual
(subtitles) content in the three diferent modalities. To join
1https://github.com/VideoAnalysis/EDUVSUM
diferent modalities we adapt and extend ideas from subtitles provided for each video. The text features are
Majumder et al. [19], who apply fusion to three kinds extracted by encoding words in subtitles using BERT
of modalities available in videos: visual, audio, and (Bidirectional Encoder Representations from
Transformtext. The overall architecture is depicted in Fig. 2. In ers) [21] embeddings. BERT is a pre-trained
transorder to deal with the temporal aspect of videos, we former (denoted as   ) that takes the sentence context
use Bidirectional Long Short-term Memory (BiLSTM) into account in order to assign a dense vector
reprelayers to incorporate information from each modality sentation to each word in a sentence. The textual
fea[
          <xref ref-type="bibr" rid="ref7">7, 12, 20</xref>
          ]. We use state-of-the-art pre-trained mod- tures are 768-dimensional vectors that are extracted by
els to encode each modality in order to extract fea- encoding subtitles of videos. Later, these features are
tures. After the extraction of feature embeddings for passed to a layer with 64 BiLSTM cells.
each modality, they are fed into separate BiLSTM lay- Audio Features: The audio content is utilized by
ers. The outputs of these layers are then concatenated means of various features that represent the zero
crossin a time-oriented way and then fed into another BiL- ing rate, energy, entropy or energy, spectral features
STM layer, which has 64 units. The output is fed into (centroid, spread, flux, roll-of) and others. In total,
two dense layers with size of 32 and 16, respectively. there are 34×   features, where   depends on the
Lastly, the output from the last dense layer is fed into window size and step size which are 0.05 and 0.025 %
a softmax layer that outputs a 10-dimensional vector of the audio track length in a video. The combination
indicating the importance score of a given input video of the rate of change of all these features yields a total
frame belonging to a certain segment. In addition to number of 68 features. We use pyAudioAnalysis [22]
the current frame, the model also includes history in- toolkit (denoted as   ) to extract these features. These
formation that consists of  previous frames according features are fed into a layer with 64 BiLSTM units. We
to the setting of history window size parameter. Our keep the same number of units in the BiLSTM layer of
experimental results show diferent configurations and all modalities.
corresponding results, where we evaluate diferent his- Visual Features: We explored diferent visual
modtory windows sizes. Next, we describe the feature em- els like Xception [23], ResNet-50 [24], VGG-16 [25]
beddings for each modality and the corresponding mod- and Inception-v3 [26] pre-trained on ImageNet dataset.
els to extract them. Visual content of the videos is encoded using one of
        </p>
        <p>Textual Features: The textual content is based on the visual descriptors mentioned above, denoted as   .
Our ablation study in Section 5 provides further
details on the importance of choice of visual descriptors.</p>
        <p>Once the features are extracted, they are fed into a
BiLSTM layer with a size of 64.</p>
        <p>Consider a video input of  sampled frames, i.e.,
 = (  )t=1,. . . ,T,   is the visual frame at point in time
t. The variable  depends on the number of selected
frames per second in a video. The original frame rate
is 30 per second (fps) for a video. The input video is
split into uniform segments of 5 seconds from which
we select 3 frames per second as a sampling rate. The
input of the model are the current frame (  ) at time
step t and the preceding frames (  −1,   −2, . . . ,   −ℎ)
according to the selected history window size h. The
features from a modality are extracted as defined above
and passed to the respective layers. The model outputs
an importance score for the given input frame (  ).</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Dataset and Annotation Tool</title>
      <p>We present a Web-based tool to annotate video data for
various tasks. Each annotator is required to provide a
value between 1 and 10 for every 5 second segment of
a video. A sample screenshot of the annotation tool
is shown in Figure 3. The higher values indicate the
higher importance of that specific segment in terms of
information it includes related to a topic of a video.</p>
      <p>We present a new dataset called EDUVSUM
(Educational Video Summarization) to train video
summarization methods for the educational domain. We
have collected educational videos with subtitles from
three popular e-learning platforms: Edx, YouTube, and
TIB AV-Portal2 that cover the following topics with
their corresponding number of videos: computer
science and software engineering (18), python and Web
programming (18), machine learning and computer
vision (18), crash course on history of science and
engineering (23), and Internet of things (IoT) (21). In total,
the current version of the dataset contains 98 videos
with ground truth values annotated by the main
author who has an academic background in computer
science. In the future, we plan to provide annotation
instructions and guidance via tutorials on how to use
the software for human annotators.</p>
    </sec>
    <sec id="sec-3">
      <title>5. Experimental Results</title>
      <sec id="sec-3-1">
        <title>In this section, we describe the experimental config</title>
        <p>urations and the obtained results. We use our newly</p>
      </sec>
      <sec id="sec-3-2">
        <title>2https://av.tib.eu/</title>
        <p>
          created dataset consisting of 98 videos for the
experimental evaluation of model architectures. The dataset
is randomly shufled before dividing it into disjoint
train and test splits using 84.7% (83 videos) and 15.3%
(15 videos), respectively. The videos are equally
distributed among the topics of the dataset. The dataset
splits and frame sampling strategy are compliant with
previous work in the field of video summarization (Zhang
et al. [10], Gygli et al. [13] and Song et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]).
        </p>
        <p>We evaluated diferent configurations of model
architectures as classification and regression tasks. The
experimental configurations include varying visual
feature extractors, history window sizes, audio features,
and textual features. In our experiments, we sampled
3 frames per second in order to not include too much
redundant information where variation the between
consecutive frames is low. This sampling rate
corresponds to 10% of the original frame rate of the video
which has 30 frames per second. Additionally, we
analyzed the efects of multimodal information by
including or excluding one of the modalities. The results are
given in Table 1. All models are trained for 50 epochs
over the training split of the dataset using Adam
optimizer. To avoid over-fitting we applied dropout with
0.2 on BiLSTM layers. Due to many configurations
of experimental variables, we listed the best
performing four models for each visual descriptor along with
the respective history window sizes and input features
from specific modalities or all.</p>
        <p>Each trained model outputs an importance score for
every frame in a video. We computed Top-1, Top-2 and
Top-3 accuracy on the predicted importance scores of
each frame by treating it as a classification task. The
best performing model for Top-1 accuracy is VGG-16
with a history window size of 2 achieving an accuracy
of 26.3, where only visual and textual features are used
for training. The model with Top-2 accuracy is
ResNet50 with the history window of 3 that is trained on
visual, audio, textual features and it achieves an
accuracy of 47.3. The best performing Top-3 model is again
VGG-16 with a history window 3, visual and audio
features, and it achieves an accuracy of 67.9.</p>
        <p>In addition, we compute the Mean Absolute Error
(MAE) values for each trained model by treating the
problem as a regression task. Each model listed in
Table 1 includes an average MAE value based on
either each frame (   ) or segment (  ). We
performed the following post-processing in order to
compare the values against ground truth where every
segment (5 second window) of a video contains an
importance scores between 1 or 10. As explained above,
trained models output an importance scores for each
frame in a video. For the calculation of    , every
frame that belongs to the same segment is assigned the
same value in the ground truth videos. For calculation
of   , predicted importance scores of each frame
belonging to the same segment are averaged. This
average value is then assigned as a predicted value to
a segment. The   is an average MAE between
predicted importance score of a segment and ground
truth. Based on the presented results in Table 1, the
model that uses VGG-16 for visual features together
with audio features and history window of 3 performs
with the least error for both frame and segment-based
calculation of the average MAE.
5.1. Discussion
For a deeper analysis of errors made by the trained
models, we plot ground truth labels along with
predictions and select two videos with relatively low (left
video) and high (right video) accuracy. These plots are
shown in Figure 4. The video on the left side has low
accuracy (18%) because the predicted values are far of
from the ground truth. The reason could be the fact
that frames in the video have less visual variation and
the model predicts the same or similar values for those
frames. Another reason could be that the visual
features are not well suited for the educational domain,
since we use pre-trained models on ImageNet dataset
where the task is to recognize distinct 1000 objects.
On the other hand, the video on the right side has
relatively high accuracy (34%). Even though the
importance scores for frames are not exact, we can observe
that the model predicts lower importance scores when
ground truth values are also lower, and the same
pattern is observed when importance scores are increased
as well. As shown in Table 1, the best model obtains
an error of 1.49 (MAE) on average, but it is observable
that most of the important segments (regardless of the
predicted values) are detected by the trained model.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusion</title>
      <p>In this paper, we have presented an approach to
predict the importance of segments in educational videos
by fusing multimodal information. This study presents
and validates a working pipeline that consists of
lecture video annotation and, based on that, a supervised
(machine) learning task to predict importance scores
for the content throughout the video. The results show
the importance of each individual modality and
limitations of each model configuration. It also highlights
that it is not straight forward to exploit the full
potential from heterogeneous source of features, i.e., using
all modalities does not guarantee a better result.</p>
      <p>One further direction of research is to enhance the
architecture for binary and ternary fusion where
modalities are fused on diferent levels. As a second future
direction, we will focus on the release of another
version of the dataset that covers more topics and videos.
Finally, we will investigate other types of visual
descriptors that better fit to the educational domain.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>Part of this work is financially supported by the Leibniz Association, Germany (Leibniz Competition 2018, funding line "Collaborative Excellence", project SALIENT [K68/2017]).</title>
        <p>CHI Conference on Human Factors in Comput- tional Workshop on Search as Learning with
ing Systems, CHI 2020, Honolulu, HI, USA, April Multimedia Information, SALMM ’19,
Associa25-30, 2020, ACM, 2020, pp. 1–8. URL: https:// tion for Computing Machinery, New York, NY,
doi.org/10.1145/3334480.3382943. doi:10.1145/ USA, 2019, p. 11–19. URL: https://doi.org/10.
3334480.3382943. 1145/3347451.3356731. doi:10.1145/3347451.
[10] K. Zhang, W. Chao, F. Sha, K. Grauman, Video 3356731.</p>
        <p>summarization with long short-term memory, [17] H. N. K. S. YukiIchimura, Keiko Noda,
Prein: Computer Vision - ECCV 2016 - 14th Eu- scriptive analysis on instructional structure of
ropean Conference, Amsterdam, The Nether- moocs:toward attaining learning objectives for
lands, October 11-14, 2016, Proceedings, Part VII, diverse learners, The Journal of Information
volume 9911 of Lecture Notes in Computer Sci- and Systems in Education 19 N0. 1 (2019) 32–37.
ence, Springer, 2016, pp. 766–782. URL: https: doi:10.12937/ejsise.19.32.
//doi.org/10.1007/978-3-319-46478-7_47. doi:10. [18] W. Wang, D. Tran, M. Feiszli, What makes
1007/978-3-319-46478-7\_47. training multi-modal networks hard?, CoRR
[11] H. Yang, C. Meinel, Content based lecture abs/1905.12681 (2019).</p>
        <p>video retrieval using speech and video text in- [19] N. Majumder, D. Hazarika, A. F. Gelbukh,
formation, IEEE Trans. Learn. Technol. 7 (2014) E. Cambria, S. Poria, Multimodal
senti142–154. URL: https://doi.org/10.1109/TLT.2014. ment analysis using hierarchical fusion with
2307305. doi:10.1109/TLT.2014.2307305. context modeling, Knowl. Based Syst. 161
[12] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, (2018) 124–133. URL: https://doi.org/10.1016/
T. Tan, Stacked memory network for video sum- j.knosys.2018.07.041. doi:10.1016/j.knosys.
marization, in: Proceedings of the 27th ACM In- 2018.07.041.
ternational Conference on Multimedia, MM 2019, [20] K. Zhang, K. Grauman, F. Sha, Retrospective
enNice, France, October 21-25, 2019, ACM, 2019, pp. coders for video summarization, in: Computer
836–844. doi:10.1145/3343031.3350992. Vision - ECCV 2018 - 15th European Conference,
[13] M. Gygli, H. Grabner, H. Riemenschneider, Munich, Germany, September 8-14, 2018,
ProL. V. Gool, Creating summaries from user ceedings, Part VIII, volume 11212 of Lecture Notes
videos, in: Computer Vision - ECCV 2014 in Computer Science, Springer, 2018, pp. 391–408.
- 13th European Conference, Zurich, Switzer- doi:10.1007/978-3-030-01237-3\_24.
land, September 6-12, 2014, Proceedings, Part [21] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:
VII, volume 8695 of Lecture Notes in Computer pre-training of deep bidirectional transformers
Science, Springer, 2014, pp. 505–520. URL: https: for language understanding, in: Proceedings
//doi.org/10.1007/978-3-319-10584-0_33. doi:10. of the 2019 Conference of the North
Ameri1007/978-3-319-10584-0\_33. can Chapter of the Association for
Computa[14] K. Davila, R. Zanibbi, Whiteboard video summa- tional Linguistics: Human Language
Technolorization via spatio-temporal conflict minimiza- gies, NAACL-HLT 2019, Minneapolis, MN, USA,
tion, in: 14th IAPR International Conference June 2-7, 2019, Volume 1 (Long and Short Papers),
on Document Analysis and Recognition, ICDAR Association for Computational Linguistics, 2019,
2017, Kyoto, Japan, November 9-15, 2017, IEEE, pp. 4171–4186. URL: https://doi.org/10.18653/v1/
2017, pp. 355–362. URL: https://doi.org/10.1109/ n19-1423. doi:10.18653/v1/n19-1423.</p>
        <p>ICDAR.2017.66. doi:10.1109/ICDAR.2017.66. [22] T. Giannakopoulos, pyaudioanalysis: An
open[15] F. Xu, K. Davila, S. Setlur, V. Govindaraju, Con- source python library for audio signal analysis,
tent extraction from lecture video via speaker PloS one 10 (2015).
action classification based on pose information, [23] F. Chollet, Xception: Deep learning with
depthin: 2019 International Conference on Document wise separable convolutions, in: 2017 IEEE
ConAnalysis and Recognition, ICDAR 2019, Sydney, ference on Computer Vision and Pattern
RecogAustralia, September 20-25, 2019, IEEE, 2019, pp. nition, CVPR 2017, Honolulu, HI, USA, July
1047–1054. URL: https://doi.org/10.1109/ICDAR. 21-26, 2017, IEEE Computer Society, 2017, pp.
2019.00171. doi:10.1109/ICDAR.2019.00171. 1800–1807. URL: https://doi.org/10.1109/CVPR.
[16] J. Shi, C. Otto, A. Hoppe, P. Holtz, R. Ewerth, 2017.195. doi:10.1109/CVPR.2017.195.</p>
        <p>Investigating correlations of automatically ex- [24] K. He, X. Zhang, S. Ren, J. Sun, Deep
residtracted multimodal features and lecture video ual learning for image recognition, in: 2016
quality, in: Proceedings of the 1st Interna- IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2016, Las Vegas, NV,
USA, June 27-30, 2016, IEEE Computer Society,
2016, pp. 770–778. URL: https://doi.org/10.1109/</p>
        <p>CVPR.2016.90. doi:10.1109/CVPR.2016.90.
[25] K. Simonyan, A. Zisserman, Very deep
convolutional networks for large-scale image
recognition, in: 3rd International Conference on
Learning Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track
Proceedings, 2015. URL: http://arxiv.org/abs/1409.1556.
[26] C. Szegedy, V. Vanhoucke, S. Iofe, J. Shlens,</p>
        <p>Z. Wojna, Rethinking the inception
architecture for computer vision, in: 2016 IEEE
Conference on Computer Vision and Pattern
Recognition, CVPR 2016, Las Vegas, NV, USA, June
27-30, 2016, IEEE Computer Society, 2016, pp.
2818–2826. URL: https://doi.org/10.1109/CVPR.
2016.308. doi:10.1109/CVPR.2016.308.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pardi</surname>
          </string-name>
          , J. von Hoyer,
          <string-name>
            <given-names>P.</given-names>
            <surname>Holtz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kammerer</surname>
          </string-name>
          ,
          <article-title>The role of cognitive abilities and time spent on texts and videos in a multimodal searching as learning task</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval</source>
          , CHIIR '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>378</fpage>
          -
          <lpage>382</lpage>
          . URL: https://doi.org/10.1145/3343413.3378001. doi:
          <volume>10</volume>
          . 1145/3343413.3378001.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Holtz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kammerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ewerth</surname>
          </string-name>
          ,
          <article-title>Current challenges for studying search as learning processes</article-title>
          ,
          <source>Proceedings of Learning and Education with Web Data</source>
          , Amsterdam, Netherlands (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Moraes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Putra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauf</surname>
          </string-name>
          ,
          <article-title>Contrasting search as a learning activity with instructordesigned learning</article-title>
          ,
          <source>in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management</source>
          , CIKM '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>167</fpage>
          -
          <lpage>176</lpage>
          . URL: https://doi.org/10. 1145/3269206.3271676. doi:
          <volume>10</volume>
          .1145/3269206. 3271676.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <article-title>How video production afects student engagement: An empirical study of mooc videos</article-title>
          ,
          <source>in: Proceedings of the first ACM conference on Learning@ scale conference</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zolotykhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mashkina</surname>
          </string-name>
          ,
          <article-title>Models of educational video implementation in massive open online courses</article-title>
          ,
          <source>in: Proceedings of the 1st International Scientific Practical Conference "The Individual and Society in the Modern Geopolitical Environment" (ISMGE</source>
          <year>2019</year>
          ), Atlantis Press,
          <year>2019</year>
          , pp.
          <fpage>567</fpage>
          -
          <lpage>571</lpage>
          . URL: https://doi.org/10.2991/ ismge-
          <fpage>19</fpage>
          .
          <year>2019</year>
          .
          <volume>107</volume>
          . doi:https://doi.org/10. 2991/ismge-
          <fpage>19</fpage>
          .
          <year>2019</year>
          .
          <volume>107</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vallmitjana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaimes</surname>
          </string-name>
          , Tvsum:
          <article-title>Summarizing web videos using titles</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>5179</fpage>
          -
          <lpage>5187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C. H. . A. S.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.A.</surname>
          </string-name>
          , et al.,
          <article-title>Predictive learning analytics using deep learning model in moocs' courses videos</article-title>
          , Springer, Educ Inf Technol (
          <year>2020</year>
          ). URL: https://doi.org/10.1007/s10639-020-10273-6. doi:
          <volume>10</volume>
          .1007/s10639-020-10273-6.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Shukor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abdullah</surname>
          </string-name>
          ,
          <article-title>Using learning analytics to improve MOOC instructional design</article-title>
          ,
          <source>iJET</source>
          <volume>14</volume>
          (
          <year>2019</year>
          )
          <fpage>6</fpage>
          -
          <lpage>17</lpage>
          . URL: https://www.online-journals.org/index.php/ i-jet/article/view/12185.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Supporting online video learning with concept map-based recommendation of learning path, in: Extended Abstracts of the 2020</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>