=Paper= {{Paper |id=Vol-2699/paper15 |storemode=property |title=Classification of Important Segments in Educational Videos using Multimodal Features |pdfUrl=https://ceur-ws.org/Vol-2699/paper15.pdf |volume=Vol-2699 |authors=Junaid Ahmed Ghauri,Sherzod Hakimov,Ralph Ewerth |dblpUrl=https://dblp.org/rec/conf/cikm/GhauriHE20 }} ==Classification of Important Segments in Educational Videos using Multimodal Features== https://ceur-ws.org/Vol-2699/paper15.pdf
Classification of Important Segments in Educational
Videos using Multimodal Features
Junaid Ahmed Ghauria , Sherzod Hakimova and Ralph Ewertha,b
a TIB – Leibniz Information Centre for Science and Technology, Hannover, Germany
b L3S Research Center, Leibniz University Hannover, Germany



                                    Abstract
                                    Videos are a commonly-used type of content in learning during Web search. Many e-learning platforms provide quality
                                    content, but sometimes educational videos are long and cover many topics. Humans are good in extracting important sec-
                                    tions from videos, but it remains a significant challenge for computers. In this paper, we address the problem of assigning
                                    importance scores to video segments, that is how much information they contain with respect to the overall topic of an
                                    educational video. We present an annotation tool and a new dataset of annotated educational videos collected from popular
                                    online learning platforms. Moreover, we propose a multimodal neural architecture that utilizes state-of-the-art audio, visual
                                    and textual features. Our experiments investigate the impact of visual and temporal information, as well as the combination
                                    of multimodal features on importance prediction.

                                    Keywords
                                    educational videos, importance prediction, video analysis, video summarization, MOOC, deep learning, e-learning


1. Introduction
In the era of e-learning, videos are one of the most
important medium to convey information for learn-
ers, being also intensively used during informal learn-
ing on the Web [1, 2]. Many academic institutions
started to host their educational content with record-
ings while various platforms like Massive Open On-
line Courses (MOOC) have emerged where a large part
of the available educational content consists of videos.
Such educational videos on MOOC platforms are also Figure 1: Sample video with annotations of importance
exploited in search as learning scenarios, their poten- scores for each segment
tial advantages compared with informal Web search
have been investigated by Moraes et al. [3]. Although
many platforms pay a lot of attention to the quality
of the video content, the length of videos is not al- tempo are other key factors for engagement in a video
ways considered as a major factor. Many academic lecture as described by Zolotykhin and Mashkina [5].
institutions provide content where the whole lecture                                     In this paper, we introduce computational models
is recorded without any breaks. Such lengthy content that predict the importance of segments in (lengthy)
can be difficult for learners to follow in distant learn- videos. Our model architectures incorporate visual,
ing. As mentioned by Guo et al. [4] shorter videos are audio, and text (transcription of audio) information to
more engaging in contrast to pre-recorded classroom predict importance scores for each segment of an ed-
lectures split into smaller pieces for MOOC. Moreover, ucational video. A sample video and its importance
pre-planned educational videos, talking head, illustra- scores are shown in Figure 1. A value between 1 and
tions using hand drawings on board or table, and speech 10 is assigned to each segment indicating the score
                                                                                      of a specific segment whether it refers to an impor-
Proceedings of the CIKM 2020 Workshops, October 19–20, Galway,                        tant information regarding the overall topic of a video.
Ireland
email: junaid.ghauri@tib.eu (J.A. Ghauri);                                            We refer to it as the importance score of video seg-
sherzod.hakimov@tib.eu (S. Hakimov); ralph.ewerth@tib.eu (R.                          ments in educational domain, similar to the annota-
Ewerth)                                                                               tions provided by TVSum dataset [6] on various Web
orcid: 0000-0001-9248-5444 (J.A. Ghauri); 0000-0002-7421-6213 (S.
                                                                                      videos. We have developed an annotation tool that al-
Hakimov); 0000-0003-0918-6297 (R. Ewerth)
         © 2020 Copyright for this paper by its authors. Use permitted under Creative lows annotators to assign importance scores to video
         Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
         CEUR Workshop Proceedings (CEUR-WS.org)
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                                                                      segments and created a new dataset for this task (see
Section 4). The contributions of this paper are summa- students when they are going through the video lec-
rized as follows:                                         tures.
                                                             Research in the field of video summarization addresses
     • Video annotation tool and an annotated dataset a similar problem, where important and relevant con-
     • Analysis of influence of multimodal features and tent from videos is classified to generate summaries
       parameters (history window) for educational video (for instance, [10, 11] and [12]). All of these methods
       summarization                                      are based on TVSum [6] and SumMe [13] datasets that
                                                          consist of Web videos. The nature of these datasets is
     • Multimodal neural architectures for the predic- very different to videos from the educational domain.
       tion of importance scores for video segments       These datasets can be a good source of visual features
                                                          but spoken words or textual content are relatively rare
     • The source code for defined the deep learning or not present at all. Inspired from video summariza-
       models, the annotation tool and the newly cre- tion work, Davila and Zanibbi [14] presented a method
       ated dataset are shared publicly1 with the re- to detect written content in videos, e.g. on whiteboards.
       search community.                                  This research focuses on a sub-task which only takes
   The remaining sections of the paper are organized into account the lectures in which the written con-
as follows. Section 2 presents an overview of related tent is available, and also addresses only the topic of
work in video-based e-learning and computational ar- mathematics. Xu et al. [15] focused on another kind of
chitectures covering multiple modalities in educational technique where speaker pose information can help in
domain. In Section 3, we provide detailed description action classification like writing, explaining, or eras-
of model architectures. Section 4 presents the described ing. Here, the most important segments are explaining,
annotation tool and the created dataset. Section 5 cov- which could be an indication of an important segment
ers the experimental results and discussions on the find- in educational videos.
ings of the paper and Section 6 concludes the papers.        Another important aspect of e-learning is student
                                                          engagement for different types of online resources. Guo
                                                          et al. [4] analyzed various aspects for MOOC videos
2. Related work                                           and provided a number of related recommendations.
                                                          Shi et al. [16] analyzed the correlation of features and
Various studies have been conducted that address the lecture quality by considering visual features from slides,
quality of online education, create personalized rec- linguistic elements and audio features like energy, fre-
ommendations for learners, or focus on highlighting quency, pitch, etc. to highlight important and empha-
the most important parts in lecture videos. Student in- sized statements in a lecture video. As suggested by
teraction with lecture videos offers new opportunities YukiIchimura [17], one of the best practices in MOOCs
to understand the performance of students or for the is to offer information on which parts of a lecture video
analysis of their learning progress. Recently, Mubarak are difficult or need more attention, which could po-
et al. [7] proposed an architecture that uses features tentially lead to a more flexible and personalized learn-
from e-learning platforms such as watch time, plays, ing experience. In order to perform such tasks by ma-
pauses, forward and backward to train deep learning chines, they need to incorporate multimodal informa-
models for predictive learning analytics. In a similar tion from educational content. To deal with multi-
way, Shukor and Abdullah [8] used watch time, clicks, modal data is not easy and this is also true for mul-
completed number of assignments for the same pur- timodal learning, as explained by Wang et al. [18]. If
pose. Another method by Tang et al. [9] is a concept- user interaction data are available for videos along with
map based approach that analyzes the transcripts of visual, textual information, then the task can be solved
videos collected from YouTube and visual recommen- by multimodal deep learning models.
dations to improve learning path and provide person-
alized content. In order to improve student perfor-
mance and enhance the learning paradigm, high-tech 3. Multimodal Architecture
devices are recommended for the classroom setting and
content presentation. For instance, instructors or pre- In this section, we describe the proposed model archi-
senters can highlight important sections which can be tecture that predicts importance scores for each video
saved along with the video data and later be used by segment by fusing audio, visual and textual features.
                                                          Each video contains audio, visual and textual (subti-
    1                                                     tles) content in the three different modalities. To join
    https://github.com/VideoAnalysis/EDUVSUM
Figure 2: Multimodal architecture for classification of important segments in educational videos




different modalities we adapt and extend ideas from subtitles provided for each video. The text features are
Majumder et al. [19], who apply fusion to three kinds extracted by encoding words in subtitles using BERT
of modalities available in videos: visual, audio, and (Bidirectional Encoder Representations from Transform-
text. The overall architecture is depicted in Fig. 2. In ers) [21] embeddings. BERT is a pre-trained trans-
order to deal with the temporal aspect of videos, we former (denoted as 𝜃𝑇 ) that takes the sentence context
use Bidirectional Long Short-term Memory (BiLSTM) into account in order to assign a dense vector repre-
layers to incorporate information from each modality sentation to each word in a sentence. The textual fea-
[7, 12, 20]. We use state-of-the-art pre-trained mod- tures are 768-dimensional vectors that are extracted by
els to encode each modality in order to extract fea- encoding subtitles of videos. Later, these features are
tures. After the extraction of feature embeddings for passed to a layer with 64 BiLSTM cells.
each modality, they are fed into separate BiLSTM lay-      Audio Features: The audio content is utilized by
ers. The outputs of these layers are then concatenated means of various features that represent the zero cross-
in a time-oriented way and then fed into another BiL- ing rate, energy, entropy or energy, spectral features
STM layer, which has 64 units. The output is fed into (centroid, spread, flux, roll-off) and others. In total,
two dense layers with size of 32 and 16, respectively. there are 34× 𝑛𝑎 features, where 𝑛𝑎 depends on the
Lastly, the output from the last dense layer is fed into window size and step size which are 0.05 and 0.025 %
a softmax layer that outputs a 10-dimensional vector of the audio track length in a video. The combination
indicating the importance score of a given input video of the rate of change of all these features yields a total
frame belonging to a certain segment. In addition to number of 68 features. We use pyAudioAnalysis [22]
the current frame, the model also includes history in- toolkit (denoted as 𝜃𝐴 ) to extract these features. These
formation that consists of 𝑛 previous frames according features are fed into a layer with 64 BiLSTM units. We
to the setting of history window size parameter. Our keep the same number of units in the BiLSTM layer of
experimental results show different configurations and all modalities.
corresponding results, where we evaluate different his-    Visual Features: We explored different visual mod-
tory windows sizes. Next, we describe the feature em- els like Xception [23], ResNet-50 [24], VGG-16 [25]
beddings for each modality and the corresponding mod- and Inception-v3 [26] pre-trained on ImageNet dataset.
els to extract them.                                     Visual content of the videos is encoded using one of
   Textual Features: The textual content is based on the visual descriptors mentioned above, denoted as 𝜃𝑉 .
Our ablation study in Section 5 provides further de-
tails on the importance of choice of visual descriptors.
Once the features are extracted, they are fed into a BiL-
STM layer with a size of 64.
   Consider a video input of 𝑇 sampled frames, i.e.,
𝑉 = (𝑓𝑡 )t=1,. . . ,T , 𝑓𝑡 is the visual frame at point in time
t. The variable 𝑇 depends on the number of selected
frames per second in a video. The original frame rate
is 30 per second (fps) for a video. The input video is
split into uniform segments of 5 seconds from which
we select 3 frames per second as a sampling rate. The
input of the model are the current frame (𝑓𝑡 ) at time
step t and the preceding frames (𝑓𝑡−1 , 𝑓𝑡−2 , . . . , 𝑓𝑡−ℎ ) ac-
cording to the selected history window size h. The fea-
tures from a modality are extracted as defined above
and passed to the respective layers. The model outputs
an importance score for the given input frame (𝑓𝑡 ).              Figure 3: Screenshot of the Web-based annotation tool for
                                                              labeling video segments

4. Dataset and Annotation Tool
We present a Web-based tool to annotate video data forcreated dataset consisting of 98 videos for the experi-
various tasks. Each annotator is required to provide amental evaluation of model architectures. The dataset
value between 1 and 10 for every 5 second segment of  is randomly shuffled before dividing it into disjoint
a video. A sample screenshot of the annotation tool   train and test splits using 84.7% (83 videos) and 15.3%
is shown in Figure 3. The higher values indicate the  (15 videos), respectively. The videos are equally dis-
higher importance of that specific segment in terms oftributed among the topics of the dataset. The dataset
information it includes related to a topic of a video.splits and frame sampling strategy are compliant with
   We present a new dataset called EDUVSUM (Ed-       previous work in the field of video summarization (Zhang
ucational Video Summarization) to train video sum-    et al. [10], Gygli et al. [13] and Song et al. [6]).
marization methods for the educational domain. We        We evaluated different configurations of model ar-
have collected educational videos with subtitles from chitectures as classification and regression tasks. The
three popular e-learning platforms: Edx, YouTube, and experimental configurations include varying visual fea-
TIB AV-Portal2 that cover the following topics with   ture extractors, history window sizes, audio features,
their corresponding number of videos: computer sci-   and textual features. In our experiments, we sampled
ence and software engineering (18), python and Web    3 frames per second in order to not include too much
programming (18), machine learning and computer vi-   redundant information where variation the between
sion (18), crash course on history of science and engi-
                                                      consecutive frames is low. This sampling rate corre-
neering (23), and Internet of things (IoT) (21). In total,
                                                      sponds to 10% of the original frame rate of the video
the current version of the dataset contains 98 videos which has 30 frames per second. Additionally, we ana-
with ground truth values annotated by the main au-    lyzed the effects of multimodal information by includ-
thor who has an academic background in computer       ing or excluding one of the modalities. The results are
science. In the future, we plan to provide annotation given in Table 1. All models are trained for 50 epochs
instructions and guidance via tutorials on how to use over the training split of the dataset using Adam opti-
the software for human annotators.                    mizer. To avoid over-fitting we applied dropout with
                                                      0.2 on BiLSTM layers. Due to many configurations
                                                      of experimental variables, we listed the best perform-
5. Experimental Results                               ing four models for each visual descriptor along with
In this section, we describe the experimental config- the respective history window sizes and input features
urations and the obtained results. We use our newly from specific modalities or all.
                                                         Each trained model outputs an importance score for
                                                      every frame in a video. We computed Top-1, Top-2 and
    2 https://av.tib.eu/                              Top-3 accuracy on the predicted importance scores of
Figure 4: Predictions of VGG-16 model for two videos. Left: model prediction with low accuracy (18%), Right: model
prediction with high accuracy (34%)




each frame by treating it as a classification task. The    Table 1
best performing model for Top-1 accuracy is VGG-16         Average accuracy and Mean Absolute Error (MAE) values
with a history window size of 2 achieving an accuracy      for different visual descriptors and history window (h) sizes.
of 26.3, where only visual and textual features are used   Modalities: Visual (V), Audio (A), Textual (T). 𝑎𝑣𝑔𝑓 𝑟𝑎 stands
for training. The model with Top-2 accuracy is ResNet-     for average MAE value based on all frames in a video, 𝑎𝑣𝑔𝑠𝑒𝑔
50 with the history window of 3 that is trained on vi-     stands for average MAE for each segment in a video.
sual, audio, textual features and it achieves an accu-      Visual Features
                                                                              h     Accuracy %              MAE
                                                                                                                              V   A   T
                                                                                  Top-1 Top-2      Top-3   𝑎𝑣𝑔𝑓 𝑟𝑎   𝑎𝑣𝑔𝑠𝑒𝑔
racy of 47.3. The best performing Top-3 model is again       Inception-v3     3   22.34    32.01   55.94    1.93      1.84
VGG-16 with a history window 3, visual and audio fea-                         2
                                                                              3
                                                                                  22.34
                                                                                  22.34
                                                                                           30.98
                                                                                           30.98
                                                                                                   55.94
                                                                                                   55.94
                                                                                                            1.93
                                                                                                            1.93
                                                                                                                      1.84
                                                                                                                      1.84            ×
tures, and it achieves an accuracy of 67.9.                                   2   22.34     47.3   55.94    1.93      1.84            ×
   In addition, we compute the Mean Absolute Error                            2
                                                                              3
                                                                                  23.95
                                                                                  23.48
                                                                                           43.48
                                                                                           44.07
                                                                                                    60.2
                                                                                                   64.29
                                                                                                            1.82
                                                                                                            1.73
                                                                                                                      1.74
                                                                                                                      1.66
                                                                                                                                  ×
                                                                                                                                  ×
(MAE) values for each trained model by treating the            VGG-16         1   22.43    47.29   66.33    1.92      1.84

problem as a regression task. Each model listed in
                                                                              2   22.37    37.47   57.92    1.87      1.81
                                                                              3   25.55    46.19   67.92    1.51      1.49            ×
Table 1 includes an average MAE value based on ei-                            2   22.91    45.08   58.93    1.83      1.79            ×
                                                                              2   26.26 41.92      63.09     1.6      1.57        ×
ther each frame (𝑎𝑣𝑔𝑓 𝑟𝑎 ) or segment (𝑎𝑣𝑔𝑠𝑒𝑔 ). We per-                      3   25.65    41.28   63.21    1.65      1.62        ×
formed the following post-processing in order to com-          Xception       1
                                                                              3
                                                                                   23.1
                                                                                  22.34
                                                                                           39.13
                                                                                           30.98
                                                                                                   57.33
                                                                                                   55.94
                                                                                                            1.88
                                                                                                            1.93
                                                                                                                       1.8
                                                                                                                      1.84
pare the values against ground truth where every seg-                         2   22.72    47.17   59.74    1.88       1.8            ×
ment (5 second window) of a video contains an im-                             1
                                                                              3
                                                                                  22.42
                                                                                  24.04
                                                                                            47.2
                                                                                           37.99
                                                                                                   67.12
                                                                                                   59.76
                                                                                                            1.86
                                                                                                            1.82
                                                                                                                      1.78
                                                                                                                      1.74        ×
                                                                                                                                      ×

portance scores between 1 or 10. As explained above,                          2   22.65    44.45   62.39    1.86      1.78        ×

trained models output an importance scores for each           ResNet-50       3
                                                                              2
                                                                                   22.6
                                                                                  22.39
                                                                                          47.31
                                                                                           37.03
                                                                                                   67.11
                                                                                                   57.53
                                                                                                             1.9
                                                                                                            1.92
                                                                                                                      1.82
                                                                                                                      1.84
frame in a video. For the calculation of 𝑎𝑣𝑔𝑓 𝑟𝑎 , every                      3   24.27    37.66   59.74    1.76      1.71            ×
                                                                              2   22.75    37.25   57.34    1.85      1.81            ×
frame that belongs to the same segment is assigned the                        2   22.69    31.59   56.66    1.85       1.8        ×
same value in the ground truth videos. For calculation                        1   22.67    31.61   57.39    1.81      1.78        ×

of 𝑎𝑣𝑔𝑠𝑒𝑔 , predicted importance scores of each frame
belonging to the same segment are averaged. This av-
                                                           5.1. Discussion
erage value is then assigned as a predicted value to
a segment. The 𝑎𝑣𝑔𝑠𝑒𝑔 is an average MAE between            For a deeper analysis of errors made by the trained
predicted importance score of a segment and ground         models, we plot ground truth labels along with pre-
truth. Based on the presented results in Table 1, the      dictions and select two videos with relatively low (left
model that uses VGG-16 for visual features together        video) and high (right video) accuracy. These plots are
with audio features and history window of 3 performs       shown in Figure 4. The video on the left side has low
with the least error for both frame and segment-based      accuracy (18%) because the predicted values are far off
calculation of the average MAE.                            from the ground truth. The reason could be the fact
                                                           that frames in the video have less visual variation and
the model predicts the same or similar values for those        on Human Information Interaction and Retrieval,
frames. Another reason could be that the visual fea-           CHIIR ’20, Association for Computing Machin-
tures are not well suited for the educational domain,          ery, New York, NY, USA, 2020, p. 378–382. URL:
since we use pre-trained models on ImageNet dataset            https://doi.org/10.1145/3343413.3378001. doi:10.
where the task is to recognize distinct 1000 objects.          1145/3343413.3378001.
On the other hand, the video on the right side has rel-    [2] A. Hoppe, P. Holtz, Y. Kammerer, R. Yu, S. Di-
atively high accuracy (34%). Even though the impor-            etze, R. Ewerth, Current challenges for study-
tance scores for frames are not exact, we can observe          ing search as learning processes, Proceedings of
that the model predicts lower importance scores when           Learning and Education with Web Data, Amster-
ground truth values are also lower, and the same pat-          dam, Netherlands (2018).
tern is observed when importance scores are increased      [3] F. Moraes, S. R. Putra, C. Hauff, Contrast-
as well. As shown in Table 1, the best model obtains           ing search as a learning activity with instructor-
an error of 1.49 (MAE) on average, but it is observable        designed learning, in: Proceedings of the 27th
that most of the important segments (regardless of the         ACM International Conference on Information
predicted values) are detected by the trained model.           and Knowledge Management, CIKM ’18, Associ-
                                                               ation for Computing Machinery, New York, NY,
                                                               USA, 2018, p. 167–176. URL: https://doi.org/10.
6. Conclusion                                                  1145/3269206.3271676. doi:10.1145/3269206.
                                                               3271676.
In this paper, we have presented an approach to pre-
                                                           [4] P. J. Guo, J. Kim, R. Rubin, How video production
dict the importance of segments in educational videos
                                                               affects student engagement: An empirical study
by fusing multimodal information. This study presents
                                                               of mooc videos, in: Proceedings of the first ACM
and validates a working pipeline that consists of lec-
                                                               conference on Learning@ scale conference, 2014,
ture video annotation and, based on that, a supervised
                                                               pp. 41–50.
(machine) learning task to predict importance scores
                                                           [5] S. Zolotykhin, N. Mashkina, Models of educa-
for the content throughout the video. The results show
                                                               tional video implementation in massive open on-
the importance of each individual modality and limi-
                                                               line courses, in: Proceedings of the 1st Inter-
tations of each model configuration. It also highlights
                                                               national Scientific Practical Conference "The In-
that it is not straight forward to exploit the full poten-
                                                               dividual and Society in the Modern Geopoliti-
tial from heterogeneous source of features, i.e., using
                                                               cal Environment" (ISMGE 2019), Atlantis Press,
all modalities does not guarantee a better result.
                                                               2019, pp. 567–571. URL: https://doi.org/10.2991/
   One further direction of research is to enhance the
                                                               ismge-19.2019.107. doi:https://doi.org/10.
architecture for binary and ternary fusion where modal-
                                                               2991/ismge-19.2019.107.
ities are fused on different levels. As a second future
                                                           [6] Y. Song, J. Vallmitjana, A. Stent, A. Jaimes, Tv-
direction, we will focus on the release of another ver-
                                                               sum: Summarizing web videos using titles, in:
sion of the dataset that covers more topics and videos.
                                                               Proceedings of the IEEE conference on computer
Finally, we will investigate other types of visual de-
                                                               vision and pattern recognition, 2015, pp. 5179–
scriptors that better fit to the educational domain.
                                                               5187.
                                                           [7] C. H. . A. S. Mubarak, A.A., et al.,          Pre-
Acknowledgments                                                dictive     learning     analytics  using    deep
                                                               learning model in moocs’ courses videos,
Part of this work is financially supported by the Leib-        Springer, Educ Inf Technol (2020). URL:
niz Association, Germany (Leibniz Competition 2018,            https://doi.org/10.1007/s10639-020-10273-6.
funding line "Collaborative Excellence", project SALIENT       doi:10.1007/s10639-020-10273-6.
[K68/2017]).                                               [8] N. A. Shukor, Z. Abdullah,          Using learn-
                                                               ing analytics to improve MOOC instruc-
                                                               tional design,        iJET 14 (2019) 6–17. URL:
References                                                     https://www.online-journals.org/index.php/
                                                               i-jet/article/view/12185.
  [1] G. Pardi, J. von Hoyer, P. Holtz, Y. Kammerer, The [9] C. Tang, J. Liao, H. Wang, C. Sung, Y. Cao,
       role of cognitive abilities and time spent on texts     W. Lin, Supporting online video learning with
       and videos in a multimodal searching as learn-          concept map-based recommendation of learn-
       ing task, in: Proceedings of the 2020 Conference        ing path, in: Extended Abstracts of the 2020
     CHI Conference on Human Factors in Comput-                tional Workshop on Search as Learning with
     ing Systems, CHI 2020, Honolulu, HI, USA, April           Multimedia Information, SALMM ’19, Associa-
     25-30, 2020, ACM, 2020, pp. 1–8. URL: https://            tion for Computing Machinery, New York, NY,
     doi.org/10.1145/3334480.3382943. doi:10.1145/             USA, 2019, p. 11–19. URL: https://doi.org/10.
     3334480.3382943.                                          1145/3347451.3356731. doi:10.1145/3347451.
[10] K. Zhang, W. Chao, F. Sha, K. Grauman, Video              3356731.
     summarization with long short-term memory,           [17] H. N. K. S. YukiIchimura, Keiko Noda, Pre-
     in: Computer Vision - ECCV 2016 - 14th Eu-                scriptive analysis on instructional structure of
     ropean Conference, Amsterdam, The Nether-                 moocs:toward attaining learning objectives for
     lands, October 11-14, 2016, Proceedings, Part VII,        diverse learners, The Journal of Information
     volume 9911 of Lecture Notes in Computer Sci-             and Systems in Education 19 N0. 1 (2019) 32–37.
     ence, Springer, 2016, pp. 766–782. URL: https:            doi:10.12937/ejsise.19.32.
     //doi.org/10.1007/978-3-319-46478-7_47. doi:10.      [18] W. Wang, D. Tran, M. Feiszli, What makes
     1007/978-3-319-46478-7\_47.                               training multi-modal networks hard?, CoRR
[11] H. Yang, C. Meinel, Content based lecture                 abs/1905.12681 (2019).
     video retrieval using speech and video text in-      [19] N. Majumder, D. Hazarika, A. F. Gelbukh,
     formation, IEEE Trans. Learn. Technol. 7 (2014)           E. Cambria, S. Poria,         Multimodal senti-
     142–154. URL: https://doi.org/10.1109/TLT.2014.           ment analysis using hierarchical fusion with
     2307305. doi:10.1109/TLT.2014.2307305.                    context modeling,        Knowl. Based Syst. 161
[12] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng,              (2018) 124–133. URL: https://doi.org/10.1016/
     T. Tan, Stacked memory network for video sum-             j.knosys.2018.07.041. doi:10.1016/j.knosys.
     marization, in: Proceedings of the 27th ACM In-           2018.07.041.
     ternational Conference on Multimedia, MM 2019,       [20] K. Zhang, K. Grauman, F. Sha, Retrospective en-
     Nice, France, October 21-25, 2019, ACM, 2019, pp.         coders for video summarization, in: Computer
     836–844. doi:10.1145/3343031.3350992.                     Vision - ECCV 2018 - 15th European Conference,
[13] M. Gygli, H. Grabner, H. Riemenschneider,                 Munich, Germany, September 8-14, 2018, Pro-
     L. V. Gool,      Creating summaries from user             ceedings, Part VIII, volume 11212 of Lecture Notes
     videos, in: Computer Vision - ECCV 2014                   in Computer Science, Springer, 2018, pp. 391–408.
     - 13th European Conference, Zurich, Switzer-              doi:10.1007/978-3-030-01237-3\_24.
     land, September 6-12, 2014, Proceedings, Part        [21] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:
     VII, volume 8695 of Lecture Notes in Computer             pre-training of deep bidirectional transformers
     Science, Springer, 2014, pp. 505–520. URL: https:         for language understanding, in: Proceedings
     //doi.org/10.1007/978-3-319-10584-0_33. doi:10.           of the 2019 Conference of the North Ameri-
     1007/978-3-319-10584-0\_33.                               can Chapter of the Association for Computa-
[14] K. Davila, R. Zanibbi, Whiteboard video summa-            tional Linguistics: Human Language Technolo-
     rization via spatio-temporal conflict minimiza-           gies, NAACL-HLT 2019, Minneapolis, MN, USA,
     tion, in: 14th IAPR International Conference              June 2-7, 2019, Volume 1 (Long and Short Papers),
     on Document Analysis and Recognition, ICDAR               Association for Computational Linguistics, 2019,
     2017, Kyoto, Japan, November 9-15, 2017, IEEE,            pp. 4171–4186. URL: https://doi.org/10.18653/v1/
     2017, pp. 355–362. URL: https://doi.org/10.1109/          n19-1423. doi:10.18653/v1/n19-1423.
     ICDAR.2017.66. doi:10.1109/ICDAR.2017.66.            [22] T. Giannakopoulos, pyaudioanalysis: An open-
[15] F. Xu, K. Davila, S. Setlur, V. Govindaraju, Con-         source python library for audio signal analysis,
     tent extraction from lecture video via speaker            PloS one 10 (2015).
     action classification based on pose information,     [23] F. Chollet, Xception: Deep learning with depth-
     in: 2019 International Conference on Document             wise separable convolutions, in: 2017 IEEE Con-
     Analysis and Recognition, ICDAR 2019, Sydney,             ference on Computer Vision and Pattern Recog-
     Australia, September 20-25, 2019, IEEE, 2019, pp.         nition, CVPR 2017, Honolulu, HI, USA, July
     1047–1054. URL: https://doi.org/10.1109/ICDAR.            21-26, 2017, IEEE Computer Society, 2017, pp.
     2019.00171. doi:10.1109/ICDAR.2019.00171.                 1800–1807. URL: https://doi.org/10.1109/CVPR.
[16] J. Shi, C. Otto, A. Hoppe, P. Holtz, R. Ewerth,           2017.195. doi:10.1109/CVPR.2017.195.
     Investigating correlations of automatically ex-      [24] K. He, X. Zhang, S. Ren, J. Sun, Deep resid-
     tracted multimodal features and lecture video             ual learning for image recognition, in: 2016
     quality, in: Proceedings of the 1st Interna-              IEEE Conference on Computer Vision and Pat-
     tern Recognition, CVPR 2016, Las Vegas, NV,
     USA, June 27-30, 2016, IEEE Computer Society,
     2016, pp. 770–778. URL: https://doi.org/10.1109/
     CVPR.2016.90. doi:10.1109/CVPR.2016.90.
[25] K. Simonyan, A. Zisserman, Very deep convo-
     lutional networks for large-scale image recogni-
     tion, in: 3rd International Conference on Learn-
     ing Representations, ICLR 2015, San Diego, CA,
     USA, May 7-9, 2015, Conference Track Proceed-
     ings, 2015. URL: http://arxiv.org/abs/1409.1556.
[26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens,
     Z. Wojna, Rethinking the inception architec-
     ture for computer vision, in: 2016 IEEE Con-
     ference on Computer Vision and Pattern Recog-
     nition, CVPR 2016, Las Vegas, NV, USA, June
     27-30, 2016, IEEE Computer Society, 2016, pp.
     2818–2826. URL: https://doi.org/10.1109/CVPR.
     2016.308. doi:10.1109/CVPR.2016.308.