Analysis of Video Lessons: a Case for Smart Indexing and Topic Extraction Marco Arazzi 1, Marco Ferretti 1 and Antonino Nocera 1 1 Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 5, 27100 Pavia, Italy Abstract On-line teaching activity during the pandemic has generated huge amounts of data, mainly in the form of video. Leveraging this rich source of didactic material requires at least two somewhat complementary facilities: smart indexing into the corpus of lessons, and automatic extraction of a list of topics that best represent the main subject of a course, based on the set of its lessons. This paper shows a preliminary attempt to address both issues, using material produced at the University of Pavia in the year 2020, specifically a prototype bachelor course to develop the tools and the methods to be later applied on the big data repository of the whole set of lessons available. Keywords 1 Video indexing, topic modeling, online lessons, university repositories 1. Introduction One, possibly the only single nice, side effect of the pandemic storm that has recently raged all over the world is the strong thrust by public and private teaching institutions to deliver on their mission by resorting to online teaching in various forms, with different technologies and platforms. The most common means to keep teaching on has been the provision for streaming video lessons, and in many cases for setting up huge archives of recorded video and audio lessons. As an instance of this pattern, at the University of Pavia online lesson delivery was carried out starting with the spring semester of 2020, which began in almost all tracks at the end of February, and lasted throughout the year, continuing on in the winter semester 2020/2021, even if it was somehow intermingled with on-premise activity. Indeed, even when the pandemic allowed for on-premise delivery of lessons, this was combined with reduced-number classes, so that students attended alternatively from home and from regular university rooms; so, video streaming continued along with lessons recording. This alternating mode of lessons attendance called for quick access to online video material, so that a student could watch again a whole lesson or, more profitably, just rehearse those sections that required further attention. Online access to the video for sequential reading and for inspection when looking for an a-priori unknown spot containing the interesting material can be a frustrating experience; furthermore, often a subject is treated in many spots, and in many lessons, possibly under different perspectives. A second correlated issue is obtaining a quick list of the most important topics treated in a lesson and in the set of lessons that make up a course. Obviously, in many cases such a “short list” is contained in the institutional syllabus of the course, but unfortunately there is often a great diversity in the depth and accuracy of the content of official syllabi. ITADATA2022: The 1st Italian Conference on Big Data and Data Science, September 20-21 2022, Milan, Italy EMAIL: marco.arazzi01@universitadipavia.it (A. 1); marco.ferretti@unipv.it (A. 2); antonino.nocera@unipv.it (A. 3) ORCID: 0000-0002-3371-307X (A. 1); 0000-0003-3543-2383 (A. 2); 0000-0003-2120-2341 (A. 3) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) The combined availability of these two facilities, namely a “smart video index” and a “smart topics list” can really benefit students, offering them a tool for effective rehearsing learning material attached to lessons that they have attended possibly months earlier, or that they never attended at all. To appreciate the problem in terms of data to be processed, one has first to collect reliable statistics on the dimension of the archived videos, and secondly to figure out the processing required on some experimental course, on which a-priori knowledge is easily available. It is however quite evident that this endeavor can face tight limitations in terms of “privacy” and “intellectual property” issues, so that effective collaboration, both at the personal and at the institutional level is required. This is especially true if one wants to consistently apply this approach to the whole corpus of video lessons archived by the institution. To set the ground, an inquiry on the institutional repository of the university used to store video lessons is severely limited in the detail that can be extracted. The archive is allocated on a contracted Web service, and even system administrators have very limited grants on the metadata of the stored objects. For a limited time span (6 months backwards), it is possible to get a list of file names, filtered by file type (in our case, .mp4 only), but not their dimension, not to mention ownership. Furthermore, the institutional repository can be queried for the overall dimension of the allocated storage of each account, but this is of no real use, because in this case no breakdown can be produced in terms of file type. Notwithstanding these severe limitations, a fairly reliable estimate of the amount of data available can be obtained by considering that lessons recorded during the pandemic have not been deleted, on the contrary the university governors have requested professors to keep this material available online for one more academic year. So, the system look-up carried out in May 2022 for this paper show a total of 19,556 .mp4 files over six month, that are likely to be under-estimate of the online material actual available. The work we are reporting here, therefore, has been developed as a test-bed prototype, by choosing a bachelor level course (taught by one of the authors), for which full access is available. The set of lessons have been used both for devising a video indexing procedure and for analyzing the feasibility of extracting “meaningful” topics for the whole course. This paper is organized as follows: Section 2) briefly summarizes the most relevant scientific literature background. Section 3) covers the development of an indexing procedure which allows a student to look up in the corpus of lessons “elementary concepts” that are described by a few significant words, with a proper set of pointers into the videos, where such significant words are spoken. The goal is a most practical one, in that the tool targets easy recollection of positions in the huge sequence of spoken utterances, with limited, if no insight into the semantics of such utterances within the single lesson of the course. Some lexical analysis is however mandatory, but the parts-of-speech used for this task can be possibly tailored to the knowledge domain. Section 4) instead sets the foundations for the automatic construction of a course syllabus, but predominantly addresses the goal of identifying the most relevant topics using NLP approaches. Section 5) discusses the dimension of the problem in light of the results obtained on the test bed course and envisions a scenario for selecting a big data approach for managing the whole corpus of online material available in the university repository. 2. Related work Video indexing is a well-established field and, combined with speech analysis and recognition, has been used in many environments. Initial manual annotations have long been abandoned. Many approaches have been proposed, that are best suited for content based on slides and speech [1], and that also try to leverage existing material with an OCR processing phase carried out on slides extracted from the video track [2]. Indeed, OCR is very developed and precise when applied to typed text; however, this technique is not available if lessons make relevant use of blackboards. This is why audio, though less accurate than OCR, has become a major means to track the most important and detailed info. On the negative sides, there may still be difficulties in recognizing some voices, and some very specific terms can produce OOV (Out-Of-Vocabulary) words errors. The lack of a strong sentence structure, typical of written material, can also add to the difficulty. Nevertheless, mature applications exist for voice analysis, whose main goal is to minimize WER (Word Error Rate), especially to avoid OOV. A limited, non-exclusive list includes Gaudi [3], PodCastle [4, 5], NTU Virtual Instructor [6] and MIT Lecture Browser [7]. A critical point for the purpose of this project is the availability of a good Italian language model, a feature not always embedded int these applications. The actual indexing in a corpus of video/audio recording has been pursued in some special application domains, such as call centers [8] and news in broadcasting [9]. Many advanced features developed in these contexts, such as detection of change in speaker, do not add to the purpose of this research. As for the second objective of this paper, i.e., the implementation of an indexing strategy based on topic extracted from the video content, several approaches in the related literature focus on the problem of topic modeling from different textual sources. In particular, the approach of [10] identify the research topics that best describe the scope of a scientific publication. An application called Smart Topic Miner incorporates the solution, enabling editors of Springer Nature journals to annotate publications according to a set of topics drawn from a large ontology of Computer Science-related fields. The authors of [11] try to answer the question “in what research topics were the academic community of Computers & Education interested?”. Their approach consisted of bibliometrically analyzing a structural topic model (STM) to identify the topic hotspots of 3,963 articles published in Computers & Education between 1976 and 2018. In the context of education, the approach proposed in [12] focused on the difficulty of students to make informed decisions using the content available through online reviews of Academic institutions. The authors of the paper present an ensemble of Latent Dirichlet Allocation (E-LDA) topic models for automatically identifying key features of student discussion and categorizing each review statement into the most relevant topic. Similarly, the authors of [13] exploit a statistical algorithm (LDA) applied to the complete full-text corpus of one major journal of the field (Biology and Philosophy) to identify the key research topics that span across these 32 years (1986-2017). Finally, the authors of [14] developed an algorithm that takes URL of a video from user as input and they implemented a summarization process with the help of two algorithms. The model also gives the flexibility to the user to decide what percentage of summary is needed compared to the original lecture. The summarization technique is a subjective process. Two prominent methods were incorporated into the model. One is cosine similarity and the other is ROUGE score. Human-generated summaries are not needed in the former, whereas the latter requires it. Both TF- IDF and Gensim can obtain greater than 90% efficiency via Cosine similarity. When it comes to ROUGE scores, an efficiency of 40-50% can be obtained. 3. Smart Video Indexing The test bed used to carry out the project is a bachelor course on data base technology (Introduction to Data Bases), consisting of 21 recorded lessons, each some 1h and 50 m in length. The course has been delivered in the Italian language, but this has little bearings on the developed methodology. The total space for the online material is 2,192 GB (the lessons have been recorded in low resolution), with an average dimension of 104,4 MB per lesson. As anticipated, the first goal is to build an index that allows to search within the lessons a set of words, obtaining the time frames within each lesson where the words are spoken. As will be soon discussed, the “set of words” that turns out to be really meaningful has extremely low cardinality, and eventually only binomial (to be defined later) have been retained. To locate within any lesson binomials, and build the overall index with the list of time frames each binomial is spoken, we devised a procedure that leverages existing Web services and that applies specific transformations, including lexical analysis and merging, to get the desired final output. Our approach is very simple, because it tries to get the useful information (timestamp) with the least possible effort and simplest tools, to allow for a scale-out approach. 3.1. Indexing Procedure The indexing procedure consists of the following steps, carried out in each lesson: 1. Speech-to-text conversion, delivering “tokens” and associated timestamp annotation 2. Token lexical analysis and POS (part-of-speech) tagging 3. Tagged token grouping by time proximity, forming tentative “binomials” and “trinomials” 4. Binomials and trinomials filtering by selecting couples (triples) of significant POS tags 5. Lemmatization of selected binomials (trinomials) Once this procedure is carried out on each lesson, it is possible to merge the outcomes at the lesson level into a single course-level list of tagged and timestamped binomials (trinomials), which is the basis of the “index”. Each index entry actually contains a list of occurrences of the binomial (trinomial) within each lesson and at all timestamps in each lesson. In what follows we give briefly details of each step and keep the description at a fairly high level of abstraction, namely, we disregard practical implementation details, such as underlining DBMS and storage, which are really straightforward, at least when addressing a single course. 3.1.1. Speech-to-text and timestamping Speech-to-text is a very well-established technology, and the choice is whether to deploy existing packages, or resorting to online services, a classical “buy” or “pay per use” alternative. The main constraint for our case is that the timestamp of each token/word is attached to the transcript. Punctuation is a second feature, that may be a benefit if subsequent text analysis is carried out, as is our case for the “topic” identification second goal of this project. Finally, a reliable Italian model is also necessary. At the time of writing this paper, several research and industrial solutions of speech-to-text exit. The specific strategy to adopt is orthogonal to our proposal; indeed, any existing approach could be applied in our case. To keep our solution scalable, we leveraged industrial strategies based on Cloud Computing. We exploited the solutions provided by YouTube, by the Google Cloud Platform (GCP) [15] and the one available on the Amazon Web Services (AWS), namely AWS Transcribe [16]. Being this a prototype work, we selected to go for these online “pay per use” services, also leveraging free offers for limited workloads. We performed a WER analysis on the transcripts of the generated texts of three lessons of the course, with all platforms delivering acceptable results, YouTube being slightly more precise, and Google and AWS service providing punctuation. 3.1.2. Lexical analysis and POS tagging To come up with a “meaningful” search, it is obviously necessary to prune the generated list of tokens/timestamps, and this calls for lexical analysis and POS tagging (Part-of-Speech). This is where the availability of language specific tools and services plays its relevant role. In this experiment, we considered the NLTK (Natural Language Toolkit) Python library [17] and, in particular, its pos-tag module. NLTK provides native support for the English language; however, other libraries could be exploited to support other languages, such as TINT [18] for the Italian language. On the outcome of the speech-to-text, we set-up an “interesting words list” procedure, namely we selected sequences of time contiguous words. Since the ultimate goal of this part of the project is to offer students a facility to look up for spoken short sentences, we limited the list of time adjacent words to two/three words. The knowledge domain associated with the testbed course suggests pruning the “two/three word list” with POS filtering that privileges nouns (N) and adjectives (A). Inclusions of verbs seem to offer nonactual advantage. Should this procedure be applied to courses of other knowledge domain (literature, philosophy, and so forth) a different choice could be meaningful. So, we chose to stay with pairs of items in the form {N,N}, {N,A} and {A,N}, binomial in the following. Lemmatization is the final part of this procedure. Resulting binomials retain the timestamp of the first spoken part, and are collected lesson by lesson, to be later merged for generating the actual course index. Section 5 reports quantitative outcomes. 4. Topics extraction This section deals with the second objective of the research proposed in this paper, namely: the identification of topics discussed in the video material to enable an advance indexing mechanism considering also the concepts included in a course. This research objective requires different advanced NLP tasks to be carried out. In the next subsection we will provide a preliminary description of the different strategies adopted to build the complete solution. 4.1. Procedure Our solution leverages Natural Language Processing strategies specifically designed for textual data. Therefore, starting from the whole repository of a course video lessons, a preliminary step is the application of a speech-to-text activity in such a way as to obtain a conversion of the audio content of videos into text documents. To do so, we applied the same solution described in Section 2 exploiting Cloud Computing. Once the transcribed text is available, our solution proceeds by performing a pre-processing step designed to prepare the input for the subsequent topic modeling task. We adopted a POS (Part-of-Speech) tagging approach to label each word contained in the textual input with the corresponding grammatical role (e.g., verb, noun, adjective, article, and so forth) by leveraging, once again, the same approach described in Section 2. Of course, typically, in addition to concepts directly related to a specific topic, a lesson will include parts that are either derived by interactions with students (such, question and answering) or related to examples useful to improve students’ understanding of a concept. Although these parts could be interesting and deserve further investigation, in our preliminary solution we focus only on the general concepts because our objective is to build a solution, based on topic modeling, for enhancing the indexing strategy for video lessons. To filter out such content from our analysis, we adopted a supervised approach based on the use of advanced word embedding along with a dataset of real syllabi. At this point, leveraging the pre-processed dataset described above, we trained a Glove model [3] to obtain consistent embedding of words according to the text which they are involved in. From the syllabi in the training dataset, we extracted the important terms, using the same POS-tagging strategy described above, and selected the k-most similar words according to their representation obtained by the Glove model (preliminary we set k=20) from our input textual data. This step is required to filter-out noisy words from the running text possibly referring to the additional parts mentioned above (e.g., examples, interactions with students, and so forth), which are typically not included in course summaries and syllabi. Figure 1shows the steps performed to filter-out noisy words from the running text. Figure 1: The strategy adopted to create a dictionary of relevant terms In this way, we obtained a rich dictionary of terms that are inherent to what is “typically” included in a manually written syllabi. By leveraging the knowledge represented in this dictionary, it was possible to create a filter to remove all the portions of text from the lesson recordings that were not related to the concepts “described” in the dictionary (and, hence, in the reference training syllabi). As a next step, we proceeded by performing a topic extraction task using the BERTopic approach [4]. This approach consists in the following steps: 1. Organize the text in paragraph. 2. Create sentence embeddings using SBERT [5]. 3. Dimensionality reduction using UMAP [6]. 4. Clustering using HDBSCAN [7]. 5. Keyword extraction using a variation of TF/IDF. As for the first point, we organized our input text corpus into paragraph by adopting a window-based strategy. We imposed a limit to the number of words, say w, of each paragraph and, hence, we organized our text into portions of w words (in our preliminary experiments we set w=25). After that, following the strategy of [4], we applied the SBERT algorithm to the paragraphs obtained in the previous step and we obtained sentence-level embeddings. Due to the high dimensionality of embeddings, to be able to cluster sentences together, thus obtaining clusters of concepts, we applied a dimensionality reduction strategy. In particular, from the scientific literature, UMAP has been proved to be particularly adequate in contexts similar to ours, therefore, again as done in [4], we selected it as dimensionality reduction algorithm. At this point, any clustering strategy could be exploited to group together sentences using their reduced embeddings. By leveraging, once again, the recent results in this context reported in the scientific literature, HDBSCAN has been proved to achieve very high performance when applied to cluster text embeddings after the application of a UMAP-based dimensionality reduction. Finally, we used the modified version of the TF/IDF algorithm, described in [4], to select the most relevant keywords for each cluster (i.e., the most relevant words in the sentences of a cluster). Figure 2 shows a graphical representation of the steps described above to obtain topics along with their keywords. BERTopic Figure 2: The adopted flow for topic modeling 5. The Big Picture In what follows, we present the quantitative assessment resulting from processing the 21 “prof-of- concept” lessons of the course “Basi di Dati”. Table 1 lists the main measures; other values, such as length of pauses, duration of the speech and other similar quantities are not interesting for the purpose of this project, although they carry other interesting information. Table 1 Main measures on the testbed 21 lessons Lesson # .mp4 (MB) tokens tokens (KB) binomials words 1 92,4 10491 229 2966 1117 2 108,5 11676 255 3150 991 3 102,9 9555 198 2538 803 4 105,4 9646 212 2624 840 5 114,3 8349 181 2210 703 6 128,7 9281 201 2559 824 7 123,2 9873 211 2613 874 8 72,2 7530 163 2191 828 9 94,2 9290 201 2435 901 10 100 10830 234 2956 1032 11 85,7 10656 233 2945 1063 12 89,5 9796 215 2722 931 13 73,1 6569 144 1641 591 14 124,2 11553 252 2994 1052 15 104,1 11043 241 2971 960 16 96,9 10251 224 2768 882 17 118,3 9541 207 2442 797 18 112,4 8143 179 2097 761 19 113,2 9520 206 2469 776 20 97,6 10927 238 2961 1050 21 135,9 10630 230 2860 902 totals 2192,7 205150,0 4454,0 55112,0 18678,0 avg 104,4 9769,0 212,1 2624,4 889,4 After processing each lesson, getting the tokens, and applying the pruning procedure, one obtains the required binomials. The following phase merges binomials, effectively building the binomials index at the course level. The aggregated set of binomials account for some 18,920 entries. A snapshot is available in Table 2, which list the first 10 entries (the English wording has been added for the general reader). The table is a logical representation of the index, as already anticipated, since actual DBMS entries depend on the logical model of the DBMS and might be mapped to quite different physical constructs. Table 2 Binomial entries (course level), first 10 of 18920. English translation added. Binomial in lessons tot count # lessons chiave esterna 2 3 4 5 6 8 9 10 17 20 99 10 (foreign key) dipendenze funzionali 17 18 19 21 97 4 (functional dependencies) chiave primaria 2 3 4 6 8 9 15 17 18 19 20 21 92 12 (primary key) basi dati 1 2 4 6 10 11 12 14 15 16 18 20 63 12 (database) modello relazionale 1 2 3 4 6 7 8 12 14 15 16 17 18 21 61 14 (relational model) target list 3 5 6 7 8 9 10 20 21 60 9 vincoli integrità 2 3 4 6 8 9 10 12 17 18 21 59 11 (integrity constraint) punto vista 1 2 3 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 51 19 (view point) legame associativo 12 13 14 15 16 17 42 6 (associative link) codice fiscale 2 4 6 12 13 15 17 18 19 21 41 10 (personal ID) While the topmost entries show a good representation of the most common bi-words spoken, that match very well the knowledge domain of the testbed course, some spurious entries also appear (such as “punto vista”, “codice fiscale”). The wording attitudes of the speaker, as well as recurrent concepts (personal ID as an instance of primary key in many examples used in the lessons) make the index somewhat cluttered. At this stage it is possible to outline the scenario, should the proof-of-concept be brought into production at the institutional level, that is, should all the online material be analyzed and transformed accordingly. While it is not possible to have a precise set of figures, since the distribution of file dimension is not known, one can assume that each of the 19,566 files available at the institutional level represents a “lesson” and brings about a number of “tokens” close to the average extracted from the proof-of-concept testbed. This amounts to producing 2x10^8 tokens in some 2x10^4 x 0,2MB=4x10^3MB files to be processed. These figures, while not in the order of the TB, start requiring a big-data approach, if one wants to prepare an architecture that is capable of ingesting online material on a semester basis, and producing reliable indexing in almost “real-time”, that is while the lessons are being delivered by professors. This consideration becomes even more relevant if the topic extraction strategy to improve smart indexing of video lessons is also considered. In fact, the extraction of topic requires the elaboration of tokens on a sentence level and the execution of the whole flows depicted in Figure 1and Figure 2 and described in Section 4. Therefore, every time a new lesson is added to the repository, overall, the approach will build a dictionary of “relevant” terms by re-training the Glove model. An example of the obtained dictionary for the 21 “proof-of-concept” lessons of the course “Basi di Dati” is reported in Table 3. This table reports the 10 most common terms of the whole dictionary containing 700 terms. Table 3 Top frequent terms of the obtained dictionary for the proof-of-concept Term Frequency relazione (relationship) 879 attributo (attribute) 866 chiave (key) 689 schema 649 entità (entity) 583 concetto (concept) 489 proiezione (projection) 375 vincolo (constraint) 359 esterno (foreign) 343 dbms 307 After that, it exploits the flow of Figure 2 to extract topics and to identify the keywords that better represent the concept encoded by each of them. As an example, the list of topics along with their keywords for the “proof-of-concept” is reported in Table 4. Table 4 The topics and keywords extracted by our approach for the proof-of-concept Topic ID Keywords Topic 0 {dipendenza (dependency), funzionale (functional), dipendenza funzionale (functional dependency), forma (form), determinante (determinant), relazione (relationship), forma normale (normal form)} Topic 1 {proiezione (projection), espressione (expression, attributo (attribute), sigma, restrizione (constraint), predicare (predicate), relazione (relationship} Topic 2 {query, tabella (table), operatore (operator), query query, cartesiano (cartesian), prodotto (product), algebra} Topic 3 {dbms, applicazione (application), ambiente (environment), cloud, connessione (connection), sistema (system), base dato (data base), rete (network)} Topic 4 {chiave (key), primario (primary), chiave primario (primary key), chiave esterno (foreign key), chiave chiave (key key), vincolo vincolo (constraint constraint), relazione chiave (key relationship)} Topic 5 {tabella (table), dominio (domain), lista (list), table (table), tabella tabella (table table), esempio (example), dato (datum)} Topic 6 {entità (entity), concetto (concept), associazione (association), identificatore (identifier), associativo (associative), associazione logica (logic association, logica (logic)} Topic 7 {modello (model), concettuale (conceptual), progettazione (design), modello relazionare (relational model), schema, fase (phase), logico (logic)} By inspecting this table, we can, for instance, see that the first identified topic, namely Topic 0, contains the keywords {dipendenza (dependency), funzionale (functional), dipendenza funzionale (functional dependency), forma (form), determinante (determinant), relazione (relation), forma normale (normal form)}. By inspecting the set of keywords, for which we also included the English translation in the parentheses, it is possible to associate with this topic the concept of functional dependency. Our strategy can, hence, associate each paragraph included in the content of the video lesson to one of the obtained topics, thus providing an enhanced semantic indexing. 6. Discussion and Conclusion This paper described a preliminary attempt to build an intelligent system to support the fruition of the huge amount of video data produced during the Covid-19 pandemic. In particular, we focused on both the definition of a smart indexing mechanism and an approach to extract the main subjects discussed during the lessons of a course. The experimental campaign has been carried out by leveraging the material produced during the “Basi di Dati” course taught by one of the authors. Although the initial objective was to develop the solution to provide immediate support to the students, due to technical and bureaucratic aspects we were not yet able to deploy such support system and make it available to our students at the moment of writing this manuscript. As a future research direction, we are hence planning to complete the deployment of our strategy and to analyze usage data to study the impact of such a support system to improve the fruition quality and usability of online material. Also, because we could not deploy our solution in a real live system, the study presented in this paper refers to a limited case of just 21 videos. Of course, due to this limitation, we could not investigate the performance of our solution when applied to a real big data scenario. However, we argue that the application of such a solution to support all the courses taught in a single University would alone require an ad-hoc big-data strategy to allow for scalability. On another side, our approach has been designed so that, once trained on an initial dataset, it can be used to support also any new course, provided it refers to concepts included in the initial training. This would allow the easy extension of our solution to improve the fruition of video material also in disparate contexts. Indeed, the knowledge base obtained by training it on data derived from the University domain could be used to build enhanced smart indexing facilities also for streaming context such as online conferences, webinars, video streaming, and so forth. The application of our approach to these additional scenarios would require the access to its functionalities in pseudo real-time during the live events. Therefore, we are planning to study suitable big data architectures to complete the preliminary solution described in this paper and further develop and refine it according to the research directions described above. 7. Acknowledgements The authors wish to thank Marco Prina for contributing to some preliminary explorative analyses related to the study described in this paper. 8. References [1] H. Yang and C. Meinel, "Content based lecture video retrieval using speech and video text information," IEEE transactions on learning technologies, vol. 2, no. 7, pp. 142-154, 241r. [2] N. Van Nguyen, M. Coustaty and J. M. Ogier, "Multi-modal and cross- modal for lecture videos retrieval," in 2014 22nd International Conference on Pattern Recognition, 2014. [3] C. Alberti, M. Bacchiani, A. Bezman, C. Chelba, A. Drofa and H. Liao, "An audio indexing system for election video material," in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. [4] J. Ogata and M. Goto, "PodCastle: A spoken document retrieval system for podcasts and its performance improvement by anonymous user contributions," in Proceedings of the third workshop on Searching spontaneous conversational speech, 2009. [5] N. Reimers and I. Gurevych, "Sentence-bert: Sentence embeddings using siamese bert- networks.," arXiv preprint arXiv:1908.10084, 2019. [6] L. L. S. Kong, M. Wu, C. Lin, Y. Fu, Y. Chung, Y. Huang and Y. Chen, "NTU Virtual Instructor - A Spoken Language System Offering Services of Learning on Demand Using Video/Audio/Slides of Course Lectures," in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. [7] C. Chelba, T. J. Hazen and M. Saraclar, "Retrieval and browsing of spoken content," IEEE Signal Processing Magazine, vol. 25, no. 3, pp. 39 - 49, 2008. [8] M. Garnier-Rizet, G. Adda, F. Cailliau, S. Guillemin-Lanne, C. Waast-Richard, L. Lamel and S. Vanni, "CallSurf: Automatic Transcription, Indexing and Structuration of Call Center Conversational Speech for Knowledge Extraction and Query by Content," in Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, Morocco, 2008. [9] J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz and A. Srivastava, "Speech and language technologies for audio indexing and retrieval," Proceedings of the IEEE, vol. 88, no. 8, pp. 1338-1353, 2000. [10] A. Salatino, F. Osborne, A. Birukou and E. Motta, "Improving editorial workflow and metadata quality at springer nature," in International Semantic Web Conference, 2019. [11] C. Xieling, D. Zou, G. Cheng and H. Xie, "Detecting latent topics and trends in educational technologies over four decades using structural topic modeling: A retrospective of all volumes of Computers & Education," Computers & Education, vol. 151, p. 103855, 2020. [12] S. Srinivas and S. Rajendran, "Topic-based knowledge mining of online student reviews for strategic planning in universities," Computers \& Industrial Engineering, vol. 128, pp. 974- -984, 2019. [13] C. Malaterre, D. Pulizzotto and F. Lareau, "Revisiting three decades of Biology and Philosophy: A computational topic-modeling perspective," Biology \& Philosophy, vol. 35, no. 1, pp. 1--25, 2020. [14] K. Kulkarni and R. Padaki, "Video Based Transcript Summarizer for Online Courses using Natural Language Processing," in 2021 IEEE International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS), 2021. [15] Google Cloud Platform, "Cloud Speech-to-Text," 13 June 2022. [Online]. Available: https://cloud.google.com/speech-to-text. [16] Amazon Web Services, "Amazon Transcribe," 13 June 2022. [Online]. Available: https://aws.amazon.com/it/transcribe/. [17] N. Team, "Natural Language Toolkit," 13 June 2022. [Online]. Available: https://www.nltk.org/. [18] Fondazione Bruno Kessler, "TINT – THE ITALIAN NLP TOOL," 13 June 2022. [Online]. Available: https://dh.fbk.eu/research/tint/. [19] G. Maarten, "BERTopic: Neural topic modeling with a class-based TF-IDF procedure.," arXiv preprint arXiv:2203.05794, 2022. [20] J. H. a. S. A. L. McInnes, "hdbscan: Hierarchical density based clustering.," J. Open Source Softw., vol. 2(11), 2017. [21] L. McInnes, J. Healy and J. Melville, "Umap: Uniform manifold approximation and projection for dimension reduction.," arXiv preprint arXiv:1802.03426, 2018. [22] J. Pennington, R. Socher and C. D. Manning, "Glove: Global vectors for word representation.," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.