=Paper=
{{Paper
|id=Vol-2699/paper15
|storemode=property
|title=Classification of Important Segments in Educational Videos using Multimodal Features
|pdfUrl=https://ceur-ws.org/Vol-2699/paper15.pdf
|volume=Vol-2699
|authors=Junaid Ahmed Ghauri,Sherzod Hakimov,Ralph Ewerth
|dblpUrl=https://dblp.org/rec/conf/cikm/GhauriHE20
}}
==Classification of Important Segments in Educational Videos using Multimodal Features==
Classification of Important Segments in Educational Videos using Multimodal Features Junaid Ahmed Ghauria , Sherzod Hakimova and Ralph Ewertha,b a TIB – Leibniz Information Centre for Science and Technology, Hannover, Germany b L3S Research Center, Leibniz University Hannover, Germany Abstract Videos are a commonly-used type of content in learning during Web search. Many e-learning platforms provide quality content, but sometimes educational videos are long and cover many topics. Humans are good in extracting important sec- tions from videos, but it remains a significant challenge for computers. In this paper, we address the problem of assigning importance scores to video segments, that is how much information they contain with respect to the overall topic of an educational video. We present an annotation tool and a new dataset of annotated educational videos collected from popular online learning platforms. Moreover, we propose a multimodal neural architecture that utilizes state-of-the-art audio, visual and textual features. Our experiments investigate the impact of visual and temporal information, as well as the combination of multimodal features on importance prediction. Keywords educational videos, importance prediction, video analysis, video summarization, MOOC, deep learning, e-learning 1. Introduction In the era of e-learning, videos are one of the most important medium to convey information for learn- ers, being also intensively used during informal learn- ing on the Web [1, 2]. Many academic institutions started to host their educational content with record- ings while various platforms like Massive Open On- line Courses (MOOC) have emerged where a large part of the available educational content consists of videos. Such educational videos on MOOC platforms are also Figure 1: Sample video with annotations of importance exploited in search as learning scenarios, their poten- scores for each segment tial advantages compared with informal Web search have been investigated by Moraes et al. [3]. Although many platforms pay a lot of attention to the quality of the video content, the length of videos is not al- tempo are other key factors for engagement in a video ways considered as a major factor. Many academic lecture as described by Zolotykhin and Mashkina [5]. institutions provide content where the whole lecture In this paper, we introduce computational models is recorded without any breaks. Such lengthy content that predict the importance of segments in (lengthy) can be difficult for learners to follow in distant learn- videos. Our model architectures incorporate visual, ing. As mentioned by Guo et al. [4] shorter videos are audio, and text (transcription of audio) information to more engaging in contrast to pre-recorded classroom predict importance scores for each segment of an ed- lectures split into smaller pieces for MOOC. Moreover, ucational video. A sample video and its importance pre-planned educational videos, talking head, illustra- scores are shown in Figure 1. A value between 1 and tions using hand drawings on board or table, and speech 10 is assigned to each segment indicating the score of a specific segment whether it refers to an impor- Proceedings of the CIKM 2020 Workshops, October 19–20, Galway, tant information regarding the overall topic of a video. Ireland email: junaid.ghauri@tib.eu (J.A. Ghauri); We refer to it as the importance score of video seg- sherzod.hakimov@tib.eu (S. Hakimov); ralph.ewerth@tib.eu (R. ments in educational domain, similar to the annota- Ewerth) tions provided by TVSum dataset [6] on various Web orcid: 0000-0001-9248-5444 (J.A. Ghauri); 0000-0002-7421-6213 (S. videos. We have developed an annotation tool that al- Hakimov); 0000-0003-0918-6297 (R. Ewerth) © 2020 Copyright for this paper by its authors. Use permitted under Creative lows annotators to assign importance scores to video Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 segments and created a new dataset for this task (see Section 4). The contributions of this paper are summa- students when they are going through the video lec- rized as follows: tures. Research in the field of video summarization addresses • Video annotation tool and an annotated dataset a similar problem, where important and relevant con- • Analysis of influence of multimodal features and tent from videos is classified to generate summaries parameters (history window) for educational video (for instance, [10, 11] and [12]). All of these methods summarization are based on TVSum [6] and SumMe [13] datasets that consist of Web videos. The nature of these datasets is • Multimodal neural architectures for the predic- very different to videos from the educational domain. tion of importance scores for video segments These datasets can be a good source of visual features but spoken words or textual content are relatively rare • The source code for defined the deep learning or not present at all. Inspired from video summariza- models, the annotation tool and the newly cre- tion work, Davila and Zanibbi [14] presented a method ated dataset are shared publicly1 with the re- to detect written content in videos, e.g. on whiteboards. search community. This research focuses on a sub-task which only takes The remaining sections of the paper are organized into account the lectures in which the written con- as follows. Section 2 presents an overview of related tent is available, and also addresses only the topic of work in video-based e-learning and computational ar- mathematics. Xu et al. [15] focused on another kind of chitectures covering multiple modalities in educational technique where speaker pose information can help in domain. In Section 3, we provide detailed description action classification like writing, explaining, or eras- of model architectures. Section 4 presents the described ing. Here, the most important segments are explaining, annotation tool and the created dataset. Section 5 cov- which could be an indication of an important segment ers the experimental results and discussions on the find- in educational videos. ings of the paper and Section 6 concludes the papers. Another important aspect of e-learning is student engagement for different types of online resources. Guo et al. [4] analyzed various aspects for MOOC videos 2. Related work and provided a number of related recommendations. Shi et al. [16] analyzed the correlation of features and Various studies have been conducted that address the lecture quality by considering visual features from slides, quality of online education, create personalized rec- linguistic elements and audio features like energy, fre- ommendations for learners, or focus on highlighting quency, pitch, etc. to highlight important and empha- the most important parts in lecture videos. Student in- sized statements in a lecture video. As suggested by teraction with lecture videos offers new opportunities YukiIchimura [17], one of the best practices in MOOCs to understand the performance of students or for the is to offer information on which parts of a lecture video analysis of their learning progress. Recently, Mubarak are difficult or need more attention, which could po- et al. [7] proposed an architecture that uses features tentially lead to a more flexible and personalized learn- from e-learning platforms such as watch time, plays, ing experience. In order to perform such tasks by ma- pauses, forward and backward to train deep learning chines, they need to incorporate multimodal informa- models for predictive learning analytics. In a similar tion from educational content. To deal with multi- way, Shukor and Abdullah [8] used watch time, clicks, modal data is not easy and this is also true for mul- completed number of assignments for the same pur- timodal learning, as explained by Wang et al. [18]. If pose. Another method by Tang et al. [9] is a concept- user interaction data are available for videos along with map based approach that analyzes the transcripts of visual, textual information, then the task can be solved videos collected from YouTube and visual recommen- by multimodal deep learning models. dations to improve learning path and provide person- alized content. In order to improve student perfor- mance and enhance the learning paradigm, high-tech 3. Multimodal Architecture devices are recommended for the classroom setting and content presentation. For instance, instructors or pre- In this section, we describe the proposed model archi- senters can highlight important sections which can be tecture that predicts importance scores for each video saved along with the video data and later be used by segment by fusing audio, visual and textual features. Each video contains audio, visual and textual (subti- 1 tles) content in the three different modalities. To join https://github.com/VideoAnalysis/EDUVSUM Figure 2: Multimodal architecture for classification of important segments in educational videos different modalities we adapt and extend ideas from subtitles provided for each video. The text features are Majumder et al. [19], who apply fusion to three kinds extracted by encoding words in subtitles using BERT of modalities available in videos: visual, audio, and (Bidirectional Encoder Representations from Transform- text. The overall architecture is depicted in Fig. 2. In ers) [21] embeddings. BERT is a pre-trained trans- order to deal with the temporal aspect of videos, we former (denoted as 𝜃𝑇 ) that takes the sentence context use Bidirectional Long Short-term Memory (BiLSTM) into account in order to assign a dense vector repre- layers to incorporate information from each modality sentation to each word in a sentence. The textual fea- [7, 12, 20]. We use state-of-the-art pre-trained mod- tures are 768-dimensional vectors that are extracted by els to encode each modality in order to extract fea- encoding subtitles of videos. Later, these features are tures. After the extraction of feature embeddings for passed to a layer with 64 BiLSTM cells. each modality, they are fed into separate BiLSTM lay- Audio Features: The audio content is utilized by ers. The outputs of these layers are then concatenated means of various features that represent the zero cross- in a time-oriented way and then fed into another BiL- ing rate, energy, entropy or energy, spectral features STM layer, which has 64 units. The output is fed into (centroid, spread, flux, roll-off) and others. In total, two dense layers with size of 32 and 16, respectively. there are 34× 𝑛𝑎 features, where 𝑛𝑎 depends on the Lastly, the output from the last dense layer is fed into window size and step size which are 0.05 and 0.025 % a softmax layer that outputs a 10-dimensional vector of the audio track length in a video. The combination indicating the importance score of a given input video of the rate of change of all these features yields a total frame belonging to a certain segment. In addition to number of 68 features. We use pyAudioAnalysis [22] the current frame, the model also includes history in- toolkit (denoted as 𝜃𝐴 ) to extract these features. These formation that consists of 𝑛 previous frames according features are fed into a layer with 64 BiLSTM units. We to the setting of history window size parameter. Our keep the same number of units in the BiLSTM layer of experimental results show different configurations and all modalities. corresponding results, where we evaluate different his- Visual Features: We explored different visual mod- tory windows sizes. Next, we describe the feature em- els like Xception [23], ResNet-50 [24], VGG-16 [25] beddings for each modality and the corresponding mod- and Inception-v3 [26] pre-trained on ImageNet dataset. els to extract them. Visual content of the videos is encoded using one of Textual Features: The textual content is based on the visual descriptors mentioned above, denoted as 𝜃𝑉 . Our ablation study in Section 5 provides further de- tails on the importance of choice of visual descriptors. Once the features are extracted, they are fed into a BiL- STM layer with a size of 64. Consider a video input of 𝑇 sampled frames, i.e., 𝑉 = (𝑓𝑡 )t=1,. . . ,T , 𝑓𝑡 is the visual frame at point in time t. The variable 𝑇 depends on the number of selected frames per second in a video. The original frame rate is 30 per second (fps) for a video. The input video is split into uniform segments of 5 seconds from which we select 3 frames per second as a sampling rate. The input of the model are the current frame (𝑓𝑡 ) at time step t and the preceding frames (𝑓𝑡−1 , 𝑓𝑡−2 , . . . , 𝑓𝑡−ℎ ) ac- cording to the selected history window size h. The fea- tures from a modality are extracted as defined above and passed to the respective layers. The model outputs an importance score for the given input frame (𝑓𝑡 ). Figure 3: Screenshot of the Web-based annotation tool for labeling video segments 4. Dataset and Annotation Tool We present a Web-based tool to annotate video data forcreated dataset consisting of 98 videos for the experi- various tasks. Each annotator is required to provide amental evaluation of model architectures. The dataset value between 1 and 10 for every 5 second segment of is randomly shuffled before dividing it into disjoint a video. A sample screenshot of the annotation tool train and test splits using 84.7% (83 videos) and 15.3% is shown in Figure 3. The higher values indicate the (15 videos), respectively. The videos are equally dis- higher importance of that specific segment in terms oftributed among the topics of the dataset. The dataset information it includes related to a topic of a video.splits and frame sampling strategy are compliant with We present a new dataset called EDUVSUM (Ed- previous work in the field of video summarization (Zhang ucational Video Summarization) to train video sum- et al. [10], Gygli et al. [13] and Song et al. [6]). marization methods for the educational domain. We We evaluated different configurations of model ar- have collected educational videos with subtitles from chitectures as classification and regression tasks. The three popular e-learning platforms: Edx, YouTube, and experimental configurations include varying visual fea- TIB AV-Portal2 that cover the following topics with ture extractors, history window sizes, audio features, their corresponding number of videos: computer sci- and textual features. In our experiments, we sampled ence and software engineering (18), python and Web 3 frames per second in order to not include too much programming (18), machine learning and computer vi- redundant information where variation the between sion (18), crash course on history of science and engi- consecutive frames is low. This sampling rate corre- neering (23), and Internet of things (IoT) (21). In total, sponds to 10% of the original frame rate of the video the current version of the dataset contains 98 videos which has 30 frames per second. Additionally, we ana- with ground truth values annotated by the main au- lyzed the effects of multimodal information by includ- thor who has an academic background in computer ing or excluding one of the modalities. The results are science. In the future, we plan to provide annotation given in Table 1. All models are trained for 50 epochs instructions and guidance via tutorials on how to use over the training split of the dataset using Adam opti- the software for human annotators. mizer. To avoid over-fitting we applied dropout with 0.2 on BiLSTM layers. Due to many configurations of experimental variables, we listed the best perform- 5. Experimental Results ing four models for each visual descriptor along with In this section, we describe the experimental config- the respective history window sizes and input features urations and the obtained results. We use our newly from specific modalities or all. Each trained model outputs an importance score for every frame in a video. We computed Top-1, Top-2 and 2 https://av.tib.eu/ Top-3 accuracy on the predicted importance scores of Figure 4: Predictions of VGG-16 model for two videos. Left: model prediction with low accuracy (18%), Right: model prediction with high accuracy (34%) each frame by treating it as a classification task. The Table 1 best performing model for Top-1 accuracy is VGG-16 Average accuracy and Mean Absolute Error (MAE) values with a history window size of 2 achieving an accuracy for different visual descriptors and history window (h) sizes. of 26.3, where only visual and textual features are used Modalities: Visual (V), Audio (A), Textual (T). 𝑎𝑣𝑔𝑓 𝑟𝑎 stands for training. The model with Top-2 accuracy is ResNet- for average MAE value based on all frames in a video, 𝑎𝑣𝑔𝑠𝑒𝑔 50 with the history window of 3 that is trained on vi- stands for average MAE for each segment in a video. sual, audio, textual features and it achieves an accu- Visual Features h Accuracy % MAE V A T Top-1 Top-2 Top-3 𝑎𝑣𝑔𝑓 𝑟𝑎 𝑎𝑣𝑔𝑠𝑒𝑔 racy of 47.3. The best performing Top-3 model is again Inception-v3 3 22.34 32.01 55.94 1.93 1.84 VGG-16 with a history window 3, visual and audio fea- 2 3 22.34 22.34 30.98 30.98 55.94 55.94 1.93 1.93 1.84 1.84 × tures, and it achieves an accuracy of 67.9. 2 22.34 47.3 55.94 1.93 1.84 × In addition, we compute the Mean Absolute Error 2 3 23.95 23.48 43.48 44.07 60.2 64.29 1.82 1.73 1.74 1.66 × × (MAE) values for each trained model by treating the VGG-16 1 22.43 47.29 66.33 1.92 1.84 problem as a regression task. Each model listed in 2 22.37 37.47 57.92 1.87 1.81 3 25.55 46.19 67.92 1.51 1.49 × Table 1 includes an average MAE value based on ei- 2 22.91 45.08 58.93 1.83 1.79 × 2 26.26 41.92 63.09 1.6 1.57 × ther each frame (𝑎𝑣𝑔𝑓 𝑟𝑎 ) or segment (𝑎𝑣𝑔𝑠𝑒𝑔 ). We per- 3 25.65 41.28 63.21 1.65 1.62 × formed the following post-processing in order to com- Xception 1 3 23.1 22.34 39.13 30.98 57.33 55.94 1.88 1.93 1.8 1.84 pare the values against ground truth where every seg- 2 22.72 47.17 59.74 1.88 1.8 × ment (5 second window) of a video contains an im- 1 3 22.42 24.04 47.2 37.99 67.12 59.76 1.86 1.82 1.78 1.74 × × portance scores between 1 or 10. As explained above, 2 22.65 44.45 62.39 1.86 1.78 × trained models output an importance scores for each ResNet-50 3 2 22.6 22.39 47.31 37.03 67.11 57.53 1.9 1.92 1.82 1.84 frame in a video. For the calculation of 𝑎𝑣𝑔𝑓 𝑟𝑎 , every 3 24.27 37.66 59.74 1.76 1.71 × 2 22.75 37.25 57.34 1.85 1.81 × frame that belongs to the same segment is assigned the 2 22.69 31.59 56.66 1.85 1.8 × same value in the ground truth videos. For calculation 1 22.67 31.61 57.39 1.81 1.78 × of 𝑎𝑣𝑔𝑠𝑒𝑔 , predicted importance scores of each frame belonging to the same segment are averaged. This av- 5.1. Discussion erage value is then assigned as a predicted value to a segment. The 𝑎𝑣𝑔𝑠𝑒𝑔 is an average MAE between For a deeper analysis of errors made by the trained predicted importance score of a segment and ground models, we plot ground truth labels along with pre- truth. Based on the presented results in Table 1, the dictions and select two videos with relatively low (left model that uses VGG-16 for visual features together video) and high (right video) accuracy. These plots are with audio features and history window of 3 performs shown in Figure 4. The video on the left side has low with the least error for both frame and segment-based accuracy (18%) because the predicted values are far off calculation of the average MAE. from the ground truth. The reason could be the fact that frames in the video have less visual variation and the model predicts the same or similar values for those on Human Information Interaction and Retrieval, frames. Another reason could be that the visual fea- CHIIR ’20, Association for Computing Machin- tures are not well suited for the educational domain, ery, New York, NY, USA, 2020, p. 378–382. URL: since we use pre-trained models on ImageNet dataset https://doi.org/10.1145/3343413.3378001. doi:10. where the task is to recognize distinct 1000 objects. 1145/3343413.3378001. On the other hand, the video on the right side has rel- [2] A. Hoppe, P. Holtz, Y. Kammerer, R. Yu, S. Di- atively high accuracy (34%). Even though the impor- etze, R. Ewerth, Current challenges for study- tance scores for frames are not exact, we can observe ing search as learning processes, Proceedings of that the model predicts lower importance scores when Learning and Education with Web Data, Amster- ground truth values are also lower, and the same pat- dam, Netherlands (2018). tern is observed when importance scores are increased [3] F. Moraes, S. R. Putra, C. Hauff, Contrast- as well. As shown in Table 1, the best model obtains ing search as a learning activity with instructor- an error of 1.49 (MAE) on average, but it is observable designed learning, in: Proceedings of the 27th that most of the important segments (regardless of the ACM International Conference on Information predicted values) are detected by the trained model. and Knowledge Management, CIKM ’18, Associ- ation for Computing Machinery, New York, NY, USA, 2018, p. 167–176. URL: https://doi.org/10. 6. Conclusion 1145/3269206.3271676. doi:10.1145/3269206. 3271676. In this paper, we have presented an approach to pre- [4] P. J. Guo, J. Kim, R. Rubin, How video production dict the importance of segments in educational videos affects student engagement: An empirical study by fusing multimodal information. This study presents of mooc videos, in: Proceedings of the first ACM and validates a working pipeline that consists of lec- conference on Learning@ scale conference, 2014, ture video annotation and, based on that, a supervised pp. 41–50. (machine) learning task to predict importance scores [5] S. Zolotykhin, N. Mashkina, Models of educa- for the content throughout the video. The results show tional video implementation in massive open on- the importance of each individual modality and limi- line courses, in: Proceedings of the 1st Inter- tations of each model configuration. It also highlights national Scientific Practical Conference "The In- that it is not straight forward to exploit the full poten- dividual and Society in the Modern Geopoliti- tial from heterogeneous source of features, i.e., using cal Environment" (ISMGE 2019), Atlantis Press, all modalities does not guarantee a better result. 2019, pp. 567–571. URL: https://doi.org/10.2991/ One further direction of research is to enhance the ismge-19.2019.107. doi:https://doi.org/10. architecture for binary and ternary fusion where modal- 2991/ismge-19.2019.107. ities are fused on different levels. As a second future [6] Y. Song, J. Vallmitjana, A. Stent, A. Jaimes, Tv- direction, we will focus on the release of another ver- sum: Summarizing web videos using titles, in: sion of the dataset that covers more topics and videos. Proceedings of the IEEE conference on computer Finally, we will investigate other types of visual de- vision and pattern recognition, 2015, pp. 5179– scriptors that better fit to the educational domain. 5187. [7] C. H. . A. S. Mubarak, A.A., et al., Pre- Acknowledgments dictive learning analytics using deep learning model in moocs’ courses videos, Part of this work is financially supported by the Leib- Springer, Educ Inf Technol (2020). URL: niz Association, Germany (Leibniz Competition 2018, https://doi.org/10.1007/s10639-020-10273-6. funding line "Collaborative Excellence", project SALIENT doi:10.1007/s10639-020-10273-6. [K68/2017]). [8] N. A. Shukor, Z. Abdullah, Using learn- ing analytics to improve MOOC instruc- tional design, iJET 14 (2019) 6–17. URL: References https://www.online-journals.org/index.php/ i-jet/article/view/12185. [1] G. Pardi, J. von Hoyer, P. Holtz, Y. Kammerer, The [9] C. Tang, J. Liao, H. Wang, C. Sung, Y. Cao, role of cognitive abilities and time spent on texts W. Lin, Supporting online video learning with and videos in a multimodal searching as learn- concept map-based recommendation of learn- ing task, in: Proceedings of the 2020 Conference ing path, in: Extended Abstracts of the 2020 CHI Conference on Human Factors in Comput- tional Workshop on Search as Learning with ing Systems, CHI 2020, Honolulu, HI, USA, April Multimedia Information, SALMM ’19, Associa- 25-30, 2020, ACM, 2020, pp. 1–8. URL: https:// tion for Computing Machinery, New York, NY, doi.org/10.1145/3334480.3382943. doi:10.1145/ USA, 2019, p. 11–19. URL: https://doi.org/10. 3334480.3382943. 1145/3347451.3356731. doi:10.1145/3347451. [10] K. Zhang, W. Chao, F. Sha, K. Grauman, Video 3356731. summarization with long short-term memory, [17] H. N. K. S. YukiIchimura, Keiko Noda, Pre- in: Computer Vision - ECCV 2016 - 14th Eu- scriptive analysis on instructional structure of ropean Conference, Amsterdam, The Nether- moocs:toward attaining learning objectives for lands, October 11-14, 2016, Proceedings, Part VII, diverse learners, The Journal of Information volume 9911 of Lecture Notes in Computer Sci- and Systems in Education 19 N0. 1 (2019) 32–37. ence, Springer, 2016, pp. 766–782. URL: https: doi:10.12937/ejsise.19.32. //doi.org/10.1007/978-3-319-46478-7_47. doi:10. [18] W. Wang, D. Tran, M. Feiszli, What makes 1007/978-3-319-46478-7\_47. training multi-modal networks hard?, CoRR [11] H. Yang, C. Meinel, Content based lecture abs/1905.12681 (2019). video retrieval using speech and video text in- [19] N. Majumder, D. Hazarika, A. F. Gelbukh, formation, IEEE Trans. Learn. Technol. 7 (2014) E. Cambria, S. Poria, Multimodal senti- 142–154. URL: https://doi.org/10.1109/TLT.2014. ment analysis using hierarchical fusion with 2307305. doi:10.1109/TLT.2014.2307305. context modeling, Knowl. Based Syst. 161 [12] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, (2018) 124–133. URL: https://doi.org/10.1016/ T. Tan, Stacked memory network for video sum- j.knosys.2018.07.041. doi:10.1016/j.knosys. marization, in: Proceedings of the 27th ACM In- 2018.07.041. ternational Conference on Multimedia, MM 2019, [20] K. Zhang, K. Grauman, F. Sha, Retrospective en- Nice, France, October 21-25, 2019, ACM, 2019, pp. coders for video summarization, in: Computer 836–844. doi:10.1145/3343031.3350992. Vision - ECCV 2018 - 15th European Conference, [13] M. Gygli, H. Grabner, H. Riemenschneider, Munich, Germany, September 8-14, 2018, Pro- L. V. Gool, Creating summaries from user ceedings, Part VIII, volume 11212 of Lecture Notes videos, in: Computer Vision - ECCV 2014 in Computer Science, Springer, 2018, pp. 391–408. - 13th European Conference, Zurich, Switzer- doi:10.1007/978-3-030-01237-3\_24. land, September 6-12, 2014, Proceedings, Part [21] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: VII, volume 8695 of Lecture Notes in Computer pre-training of deep bidirectional transformers Science, Springer, 2014, pp. 505–520. URL: https: for language understanding, in: Proceedings //doi.org/10.1007/978-3-319-10584-0_33. doi:10. of the 2019 Conference of the North Ameri- 1007/978-3-319-10584-0\_33. can Chapter of the Association for Computa- [14] K. Davila, R. Zanibbi, Whiteboard video summa- tional Linguistics: Human Language Technolo- rization via spatio-temporal conflict minimiza- gies, NAACL-HLT 2019, Minneapolis, MN, USA, tion, in: 14th IAPR International Conference June 2-7, 2019, Volume 1 (Long and Short Papers), on Document Analysis and Recognition, ICDAR Association for Computational Linguistics, 2019, 2017, Kyoto, Japan, November 9-15, 2017, IEEE, pp. 4171–4186. URL: https://doi.org/10.18653/v1/ 2017, pp. 355–362. URL: https://doi.org/10.1109/ n19-1423. doi:10.18653/v1/n19-1423. ICDAR.2017.66. doi:10.1109/ICDAR.2017.66. [22] T. Giannakopoulos, pyaudioanalysis: An open- [15] F. Xu, K. Davila, S. Setlur, V. Govindaraju, Con- source python library for audio signal analysis, tent extraction from lecture video via speaker PloS one 10 (2015). action classification based on pose information, [23] F. Chollet, Xception: Deep learning with depth- in: 2019 International Conference on Document wise separable convolutions, in: 2017 IEEE Con- Analysis and Recognition, ICDAR 2019, Sydney, ference on Computer Vision and Pattern Recog- Australia, September 20-25, 2019, IEEE, 2019, pp. nition, CVPR 2017, Honolulu, HI, USA, July 1047–1054. URL: https://doi.org/10.1109/ICDAR. 21-26, 2017, IEEE Computer Society, 2017, pp. 2019.00171. doi:10.1109/ICDAR.2019.00171. 1800–1807. URL: https://doi.org/10.1109/CVPR. [16] J. Shi, C. Otto, A. Hoppe, P. Holtz, R. Ewerth, 2017.195. doi:10.1109/CVPR.2017.195. Investigating correlations of automatically ex- [24] K. He, X. Zhang, S. Ren, J. Sun, Deep resid- tracted multimodal features and lecture video ual learning for image recognition, in: 2016 quality, in: Proceedings of the 1st Interna- IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 770–778. URL: https://doi.org/10.1109/ CVPR.2016.90. doi:10.1109/CVPR.2016.90. [25] K. Simonyan, A. Zisserman, Very deep convo- lutional networks for large-scale image recogni- tion, in: 3rd International Conference on Learn- ing Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceed- ings, 2015. URL: http://arxiv.org/abs/1409.1556. [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architec- ture for computer vision, in: 2016 IEEE Con- ference on Computer Vision and Pattern Recog- nition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 2818–2826. URL: https://doi.org/10.1109/CVPR. 2016.308. doi:10.1109/CVPR.2016.308.