1. Introduction

Classification of Important Segments in Educational Videos using Multimodal Features

Junaid Ahmed Ghauri

Sherzod Hakimov

Ralph Ewerth

0 1 0 L3S Research Center, Leibniz University Hannover , Germany 1 TIB - Leibniz Information Centre for Science and Technology , Hannover , Germany

Videos are a commonly-used type of content in learning during Web search. Many e-learning platforms provide quality content, but sometimes educational videos are long and cover many topics. Humans are good in extracting important sections from videos, but it remains a significant challenge for computers. In this paper, we address the problem of assigning importance scores to video segments, that is how much information they contain with respect to the overall topic of an educational video. We present an annotation tool and a new dataset of annotated educational videos collected from popular online learning platforms. Moreover, we propose a multimodal neural architecture that utilizes state-of-the-art audio, visual and textual features. Our experiments investigate the impact of visual and temporal information, as well as the combination of multimodal features on importance prediction.

eol>educational videos importance prediction video analysis video summarization MOOC deep learning e-learning

1. Introduction

In the era of e-learning, videos are one of the most important medium to convey information for learners, being also intensively used during informal learning on the Web [ 1, 2 ]. Many academic institutions started to host their educational content with recordings while various platforms like Massive Open Online Courses (MOOC) have emerged where a large part of the available educational content consists of videos.

Such educational videos on MOOC platforms are also Figure 1: Sample video with annotations of importance exploited in search as learning scenarios, their poten- scores for each segment tial advantages compared with informal Web search have been investigated by Moraes et al. [ 3 ]. Although many platforms pay a lot of attention to the quality of the video content, the length of videos is not al- tempo are other key factors for engagement in a video ways considered as a major factor. Many academic lecture as described by Zolotykhin and Mashkina [ 5 ]. institutions provide content where the whole lecture In this paper, we introduce computational models is recorded without any breaks. Such lengthy content that predict the importance of segments in (lengthy) can be dificult for learners to follow in distant learn- videos. Our model architectures incorporate visual, ing. As mentioned by Guo et al. [ 4 ] shorter videos are audio, and text (transcription of audio) information to more engaging in contrast to pre-recorded classroom predict importance scores for each segment of an edlectures split into smaller pieces for MOOC. Moreover, ucational video. A sample video and its importance pre-planned educational videos, talking head, illustra- scores are shown in Figure 1. A value between 1 and tions using hand drawings on board or table, and speech 10 is assigned to each segment indicating the score of a specific segment whether it refers to an imporProceedings of the CIKM 2020 Workshops, October 19–20, Galway, tant information regarding the overall topic of a video. Iermelaainld: junaid.ghauri@tib.eu (J.A. Ghauri); We refer to it as the importance score of video segsherzod.hakimov@tib.eu (S. Hakimov); ralph.ewerth@tib.eu (R. ments in educational domain, similar to the annotaEwerth) tions provided by TVSum dataset [ 6 ] on various Web orcid: 0000-0001-9248-5444 (J.A. Ghauri); 0000-0002-7421-6213 (S. videos. We have developed an annotation tool that alHakimov); 0000-0003-0918-6297 (R. Ewerth)

© 2020 Copyright for this paper by its authors. Use permitted under Creative lows annotators to assign importance scores to video CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns WLiceonrsekAsthtriobuptioPnr4o.0cIneteerdnaitniognasl ((CCC EBYU4R.0)-.WS.org) segments and created a new dataset for this task (see Section 4). The contributions of this paper are summa- students when they are going through the video lecrized as follows: tures.

Research in the field of video summarization addresses • Video annotation tool and an annotated dataset a similar problem, where important and relevant con• Analysis of influence of multimodal features and tent from videos is classified to generate summaries parameters (history window) for educational video(for instance, [10, 11] and [12]). All of these methods summarization are based on TVSum [ 6 ] and SumMe [13] datasets that consist of Web videos. The nature of these datasets is • Multimodal neural architectures for the predic- very diferent to videos from the educational domain. tion of importance scores for video segments These datasets can be a good source of visual features but spoken words or textual content are relatively rare • The source code for defined the deep learning or not present at all. Inspired from video summarizamodels, the annotation tool and the newly cre- tion work, Davila and Zanibbi [14] presented a method ated dataset are shared publicly1 with the re- to detect written content in videos, e.g. on whiteboards. search community. This research focuses on a sub-task which only takes The remaining sections of the paper are organized into account the lectures in which the written conas follows. Section 2 presents an overview of related tent is available, and also addresses only the topic of work in video-based e-learning and computational ar- mathematics. Xu et al. [15] focused on another kind of chitectures covering multiple modalities in educational technique where speaker pose information can help in domain. In Section 3, we provide detailed description action classification like writing, explaining, or erasof model architectures. Section 4 presents the described ing. Here, the most important segments are explaining, annotation tool and the created dataset. Section 5 cov- which could be an indication of an important segment ers the experimental results and discussions on the find- in educational videos. ings of the paper and Section 6 concludes the papers. Another important aspect of e-learning is student engagement for diferent types of online resources. Guo et al. [ 4 ] analyzed various aspects for MOOC videos 2. Related work and provided a number of related recommendations. Shi et al. [16] analyzed the correlation of features and Various studies have been conducted that address the lecture quality by considering visual features from slides, quality of online education, create personalized rec- linguistic elements and audio features like energy, freommendations for learners, or focus on highlighting quency, pitch, etc. to highlight important and emphathe most important parts in lecture videos. Student in- sized statements in a lecture video. As suggested by teraction with lecture videos ofers new opportunities YukiIchimura [17], one of the best practices in MOOCs to understand the performance of students or for the is to ofer information on which parts of a lecture video analysis of their learning progress. Recently, Mubarak are dificult or need more attention, which could poet al. [ 7 ] proposed an architecture that uses features tentially lead to a more flexible and personalized learnfrom e-learning platforms such as watch time, plays, ing experience. In order to perform such tasks by mapauses, forward and backward to train deep learning chines, they need to incorporate multimodal informamodels for predictive learning analytics. In a similar tion from educational content. To deal with multiway, Shukor and Abdullah [ 8 ] used watch time, clicks, modal data is not easy and this is also true for mulcompleted number of assignments for the same pur- timodal learning, as explained by Wang et al. [18]. If pose. Another method by Tang et al. [ 9 ] is a concept- user interaction data are available for videos along with map based approach that analyzes the transcripts of visual, textual information, then the task can be solved videos collected from YouTube and visual recommen- by multimodal deep learning models. dations to improve learning path and provide personalized content. In order to improve student performance and enhance the learning paradigm, high-tech 3. Multimodal Architecture devices are recommended for the classroom setting and content presentation. For instance, instructors or presenters can highlight important sections which can be saved along with the video data and later be used by

In this section, we describe the proposed model archi

tecture that predicts importance scores for each video segment by fusing audio, visual and textual features.

Each video contains audio, visual and textual (subtitles) content in the three diferent modalities. To join 1https://github.com/VideoAnalysis/EDUVSUM diferent modalities we adapt and extend ideas from subtitles provided for each video. The text features are Majumder et al. [19], who apply fusion to three kinds extracted by encoding words in subtitles using BERT of modalities available in videos: visual, audio, and (Bidirectional Encoder Representations from Transformtext. The overall architecture is depicted in Fig. 2. In ers) [21] embeddings. BERT is a pre-trained transorder to deal with the temporal aspect of videos, we former (denoted as ) that takes the sentence context use Bidirectional Long Short-term Memory (BiLSTM) into account in order to assign a dense vector reprelayers to incorporate information from each modality sentation to each word in a sentence. The textual fea[ 7, 12, 20 ]. We use state-of-the-art pre-trained mod- tures are 768-dimensional vectors that are extracted by els to encode each modality in order to extract fea- encoding subtitles of videos. Later, these features are tures. After the extraction of feature embeddings for passed to a layer with 64 BiLSTM cells. each modality, they are fed into separate BiLSTM lay- Audio Features: The audio content is utilized by ers. The outputs of these layers are then concatenated means of various features that represent the zero crossin a time-oriented way and then fed into another BiL- ing rate, energy, entropy or energy, spectral features STM layer, which has 64 units. The output is fed into (centroid, spread, flux, roll-of) and others. In total, two dense layers with size of 32 and 16, respectively. there are 34× features, where depends on the Lastly, the output from the last dense layer is fed into window size and step size which are 0.05 and 0.025 % a softmax layer that outputs a 10-dimensional vector of the audio track length in a video. The combination indicating the importance score of a given input video of the rate of change of all these features yields a total frame belonging to a certain segment. In addition to number of 68 features. We use pyAudioAnalysis [22] the current frame, the model also includes history in- toolkit (denoted as ) to extract these features. These formation that consists of previous frames according features are fed into a layer with 64 BiLSTM units. We to the setting of history window size parameter. Our keep the same number of units in the BiLSTM layer of experimental results show diferent configurations and all modalities. corresponding results, where we evaluate diferent his- Visual Features: We explored diferent visual modtory windows sizes. Next, we describe the feature em- els like Xception [23], ResNet-50 [24], VGG-16 [25] beddings for each modality and the corresponding mod- and Inception-v3 [26] pre-trained on ImageNet dataset. els to extract them. Visual content of the videos is encoded using one of

Textual Features: The textual content is based on the visual descriptors mentioned above, denoted as . Our ablation study in Section 5 provides further details on the importance of choice of visual descriptors.

Once the features are extracted, they are fed into a BiLSTM layer with a size of 64.

Consider a video input of sampled frames, i.e., = ( )t=1,. . . ,T, is the visual frame at point in time t. The variable depends on the number of selected frames per second in a video. The original frame rate is 30 per second (fps) for a video. The input video is split into uniform segments of 5 seconds from which we select 3 frames per second as a sampling rate. The input of the model are the current frame ( ) at time step t and the preceding frames ( −1, −2, . . . , −ℎ) according to the selected history window size h. The features from a modality are extracted as defined above and passed to the respective layers. The model outputs an importance score for the given input frame ( ).

4. Dataset and Annotation Tool

We present a Web-based tool to annotate video data for various tasks. Each annotator is required to provide a value between 1 and 10 for every 5 second segment of a video. A sample screenshot of the annotation tool is shown in Figure 3. The higher values indicate the higher importance of that specific segment in terms of information it includes related to a topic of a video.

We present a new dataset called EDUVSUM (Educational Video Summarization) to train video summarization methods for the educational domain. We have collected educational videos with subtitles from three popular e-learning platforms: Edx, YouTube, and TIB AV-Portal2 that cover the following topics with their corresponding number of videos: computer science and software engineering (18), python and Web programming (18), machine learning and computer vision (18), crash course on history of science and engineering (23), and Internet of things (IoT) (21). In total, the current version of the dataset contains 98 videos with ground truth values annotated by the main author who has an academic background in computer science. In the future, we plan to provide annotation instructions and guidance via tutorials on how to use the software for human annotators.

5. Experimental Results In this section, we describe the experimental config

urations and the obtained results. We use our newly

2https://av.tib.eu/

created dataset consisting of 98 videos for the experimental evaluation of model architectures. The dataset is randomly shufled before dividing it into disjoint train and test splits using 84.7% (83 videos) and 15.3% (15 videos), respectively. The videos are equally distributed among the topics of the dataset. The dataset splits and frame sampling strategy are compliant with previous work in the field of video summarization (Zhang et al. [10], Gygli et al. [13] and Song et al. [ 6 ]).

We evaluated diferent configurations of model architectures as classification and regression tasks. The experimental configurations include varying visual feature extractors, history window sizes, audio features, and textual features. In our experiments, we sampled 3 frames per second in order to not include too much redundant information where variation the between consecutive frames is low. This sampling rate corresponds to 10% of the original frame rate of the video which has 30 frames per second. Additionally, we analyzed the efects of multimodal information by including or excluding one of the modalities. The results are given in Table 1. All models are trained for 50 epochs over the training split of the dataset using Adam optimizer. To avoid over-fitting we applied dropout with 0.2 on BiLSTM layers. Due to many configurations of experimental variables, we listed the best performing four models for each visual descriptor along with the respective history window sizes and input features from specific modalities or all.

Each trained model outputs an importance score for every frame in a video. We computed Top-1, Top-2 and Top-3 accuracy on the predicted importance scores of each frame by treating it as a classification task. The best performing model for Top-1 accuracy is VGG-16 with a history window size of 2 achieving an accuracy of 26.3, where only visual and textual features are used for training. The model with Top-2 accuracy is ResNet50 with the history window of 3 that is trained on visual, audio, textual features and it achieves an accuracy of 47.3. The best performing Top-3 model is again VGG-16 with a history window 3, visual and audio features, and it achieves an accuracy of 67.9.

In addition, we compute the Mean Absolute Error (MAE) values for each trained model by treating the problem as a regression task. Each model listed in Table 1 includes an average MAE value based on either each frame ( ) or segment ( ). We performed the following post-processing in order to compare the values against ground truth where every segment (5 second window) of a video contains an importance scores between 1 or 10. As explained above, trained models output an importance scores for each frame in a video. For the calculation of , every frame that belongs to the same segment is assigned the same value in the ground truth videos. For calculation of , predicted importance scores of each frame belonging to the same segment are averaged. This average value is then assigned as a predicted value to a segment. The is an average MAE between predicted importance score of a segment and ground truth. Based on the presented results in Table 1, the model that uses VGG-16 for visual features together with audio features and history window of 3 performs with the least error for both frame and segment-based calculation of the average MAE. 5.1. Discussion For a deeper analysis of errors made by the trained models, we plot ground truth labels along with predictions and select two videos with relatively low (left video) and high (right video) accuracy. These plots are shown in Figure 4. The video on the left side has low accuracy (18%) because the predicted values are far of from the ground truth. The reason could be the fact that frames in the video have less visual variation and the model predicts the same or similar values for those frames. Another reason could be that the visual features are not well suited for the educational domain, since we use pre-trained models on ImageNet dataset where the task is to recognize distinct 1000 objects. On the other hand, the video on the right side has relatively high accuracy (34%). Even though the importance scores for frames are not exact, we can observe that the model predicts lower importance scores when ground truth values are also lower, and the same pattern is observed when importance scores are increased as well. As shown in Table 1, the best model obtains an error of 1.49 (MAE) on average, but it is observable that most of the important segments (regardless of the predicted values) are detected by the trained model.

6. Conclusion

In this paper, we have presented an approach to predict the importance of segments in educational videos by fusing multimodal information. This study presents and validates a working pipeline that consists of lecture video annotation and, based on that, a supervised (machine) learning task to predict importance scores for the content throughout the video. The results show the importance of each individual modality and limitations of each model configuration. It also highlights that it is not straight forward to exploit the full potential from heterogeneous source of features, i.e., using all modalities does not guarantee a better result.

One further direction of research is to enhance the architecture for binary and ternary fusion where modalities are fused on diferent levels. As a second future direction, we will focus on the release of another version of the dataset that covers more topics and videos. Finally, we will investigate other types of visual descriptors that better fit to the educational domain.

Acknowledgments Part of this work is financially supported by the Leibniz Association, Germany (Leibniz Competition 2018, funding line "Collaborative Excellence", project SALIENT [K68/2017]).

CHI Conference on Human Factors in Comput- tional Workshop on Search as Learning with ing Systems, CHI 2020, Honolulu, HI, USA, April Multimedia Information, SALMM ’19, Associa25-30, 2020, ACM, 2020, pp. 1–8. URL: https:// tion for Computing Machinery, New York, NY, doi.org/10.1145/3334480.3382943. doi:10.1145/ USA, 2019, p. 11–19. URL: https://doi.org/10. 3334480.3382943. 1145/3347451.3356731. doi:10.1145/3347451. [10] K. Zhang, W. Chao, F. Sha, K. Grauman, Video 3356731.

summarization with long short-term memory, [17] H. N. K. S. YukiIchimura, Keiko Noda, Prein: Computer Vision - ECCV 2016 - 14th Eu- scriptive analysis on instructional structure of ropean Conference, Amsterdam, The Nether- moocs:toward attaining learning objectives for lands, October 11-14, 2016, Proceedings, Part VII, diverse learners, The Journal of Information volume 9911 of Lecture Notes in Computer Sci- and Systems in Education 19 N0. 1 (2019) 32–37. ence, Springer, 2016, pp. 766–782. URL: https: doi:10.12937/ejsise.19.32. //doi.org/10.1007/978-3-319-46478-7_47. doi:10. [18] W. Wang, D. Tran, M. Feiszli, What makes 1007/978-3-319-46478-7\_47. training multi-modal networks hard?, CoRR [11] H. Yang, C. Meinel, Content based lecture abs/1905.12681 (2019).

video retrieval using speech and video text in- [19] N. Majumder, D. Hazarika, A. F. Gelbukh, formation, IEEE Trans. Learn. Technol. 7 (2014) E. Cambria, S. Poria, Multimodal senti142–154. URL: https://doi.org/10.1109/TLT.2014. ment analysis using hierarchical fusion with 2307305. doi:10.1109/TLT.2014.2307305. context modeling, Knowl. Based Syst. 161 [12] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, (2018) 124–133. URL: https://doi.org/10.1016/ T. Tan, Stacked memory network for video sum- j.knosys.2018.07.041. doi:10.1016/j.knosys. marization, in: Proceedings of the 27th ACM In- 2018.07.041. ternational Conference on Multimedia, MM 2019, [20] K. Zhang, K. Grauman, F. Sha, Retrospective enNice, France, October 21-25, 2019, ACM, 2019, pp. coders for video summarization, in: Computer 836–844. doi:10.1145/3343031.3350992. Vision - ECCV 2018 - 15th European Conference, [13] M. Gygli, H. Grabner, H. Riemenschneider, Munich, Germany, September 8-14, 2018, ProL. V. Gool, Creating summaries from user ceedings, Part VIII, volume 11212 of Lecture Notes videos, in: Computer Vision - ECCV 2014 in Computer Science, Springer, 2018, pp. 391–408. - 13th European Conference, Zurich, Switzer- doi:10.1007/978-3-030-01237-3\_24. land, September 6-12, 2014, Proceedings, Part [21] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: VII, volume 8695 of Lecture Notes in Computer pre-training of deep bidirectional transformers Science, Springer, 2014, pp. 505–520. URL: https: for language understanding, in: Proceedings //doi.org/10.1007/978-3-319-10584-0_33. doi:10. of the 2019 Conference of the North Ameri1007/978-3-319-10584-0\_33. can Chapter of the Association for Computa[14] K. Davila, R. Zanibbi, Whiteboard video summa- tional Linguistics: Human Language Technolorization via spatio-temporal conflict minimiza- gies, NAACL-HLT 2019, Minneapolis, MN, USA, tion, in: 14th IAPR International Conference June 2-7, 2019, Volume 1 (Long and Short Papers), on Document Analysis and Recognition, ICDAR Association for Computational Linguistics, 2019, 2017, Kyoto, Japan, November 9-15, 2017, IEEE, pp. 4171–4186. URL: https://doi.org/10.18653/v1/ 2017, pp. 355–362. URL: https://doi.org/10.1109/ n19-1423. doi:10.18653/v1/n19-1423.

ICDAR.2017.66. doi:10.1109/ICDAR.2017.66. [22] T. Giannakopoulos, pyaudioanalysis: An open[15] F. Xu, K. Davila, S. Setlur, V. Govindaraju, Con- source python library for audio signal analysis, tent extraction from lecture video via speaker PloS one 10 (2015). action classification based on pose information, [23] F. Chollet, Xception: Deep learning with depthin: 2019 International Conference on Document wise separable convolutions, in: 2017 IEEE ConAnalysis and Recognition, ICDAR 2019, Sydney, ference on Computer Vision and Pattern RecogAustralia, September 20-25, 2019, IEEE, 2019, pp. nition, CVPR 2017, Honolulu, HI, USA, July 1047–1054. URL: https://doi.org/10.1109/ICDAR. 21-26, 2017, IEEE Computer Society, 2017, pp. 2019.00171. doi:10.1109/ICDAR.2019.00171. 1800–1807. URL: https://doi.org/10.1109/CVPR. [16] J. Shi, C. Otto, A. Hoppe, P. Holtz, R. Ewerth, 2017.195. doi:10.1109/CVPR.2017.195.

Investigating correlations of automatically ex- [24] K. He, X. Zhang, S. Ren, J. Sun, Deep residtracted multimodal features and lecture video ual learning for image recognition, in: 2016 quality, in: Proceedings of the 1st Interna- IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 770–778. URL: https://doi.org/10.1109/

CVPR.2016.90. doi:10.1109/CVPR.2016.90. [25] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL: http://arxiv.org/abs/1409.1556. [26] C. Szegedy, V. Vanhoucke, S. Iofe, J. Shlens,

Z. Wojna, Rethinking the inception architecture for computer vision, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 2818–2826. URL: https://doi.org/10.1109/CVPR. 2016.308. doi:10.1109/CVPR.2016.308.

[1]

Pardi , J. von Hoyer,

Holtz ,

Kammerer , The role of cognitive abilities and time spent on texts and videos in a multimodal searching as learning task , in: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval , CHIIR '20, Association for Computing Machinery, New York, NY, USA, 2020 , p. 378 - 382 . URL: https://doi.org/10.1145/3343413.3378001. doi: 10 . 1145/3343413.3378001.

[2]

Hoppe ,

Holtz ,

Kammerer ,

Yu ,

Dietze ,

Ewerth , Current challenges for studying search as learning processes , Proceedings of Learning and Education with Web Data , Amsterdam, Netherlands ( 2018 ).

[3]

Moraes ,

S. R.

Putra ,

Hauf , Contrasting search as a learning activity with instructordesigned learning , in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management , CIKM '18, Association for Computing Machinery, New York, NY, USA, 2018 , p. 167 - 176 . URL: https://doi.org/10. 1145/3269206.3271676. doi: 10 .1145/3269206. 3271676.

[4]

P. J.

Guo ,

Kim ,

Rubin , How video production afects student engagement: An empirical study of mooc videos , in: Proceedings of the first ACM conference on Learning@ scale conference , 2014 , pp. 41 - 50 .

[5]

Zolotykhin ,

Mashkina , Models of educational video implementation in massive open online courses , in: Proceedings of the 1st International Scientific Practical Conference "The Individual and Society in the Modern Geopolitical Environment" (ISMGE 2019 ), Atlantis Press, 2019 , pp. 567 - 571 . URL: https://doi.org/10.2991/ ismge- 19 . 2019 . 107 . doi:https://doi.org/10. 2991/ismge- 19 . 2019 . 107 .

[6]

Song ,

Vallmitjana ,

Stent ,

Jaimes , Tvsum: Summarizing web videos using titles , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2015 , pp. 5179 - 5187 .

[7]

C. H. . A. S.

Mubarak , A.A. , et al., Predictive learning analytics using deep learning model in moocs' courses videos , Springer, Educ Inf Technol ( 2020 ). URL: https://doi.org/10.1007/s10639-020-10273-6. doi: 10 .1007/s10639-020-10273-6.

[8]

N. A.

Shukor ,

Abdullah , Using learning analytics to improve MOOC instructional design , iJET 14 ( 2019 ) 6 - 17 . URL: https://www.online-journals.org/index.php/ i-jet/article/view/12185.

[9]

Tang ,

Liao ,

Wang ,

Sung ,

Cao ,

Lin , Supporting online video learning with concept map-based recommendation of learning path, in: Extended Abstracts of the 2020