=Paper=
{{Paper
|id=Vol-1983/paper_02
|storemode=property
|title=Applying Natural Language Processing to Speech Transcriptions for Automated Analysis
of Educational Video Broadcasts
|pdfUrl=https://ceur-ws.org/Vol-1983/paper_02.pdf
|volume=Vol-1983
|authors=Maurizio Montagnuolo,Roberto Borgotallo,Silvia Proscia,Laurent Boch
|dblpUrl=https://dblp.org/rec/conf/aiia/MontagnuoloBPB17
}}
==Applying Natural Language Processing to Speech Transcriptions for Automated Analysis
of Educational Video Broadcasts==
Applying Natural Language Processing to Speech Transcriptions for Automated Analysis of Educational Video Broadcasts Maurizio Montagnuolo, Roberto Borgotallo, Silvia Proscia, Laurent Boch RAI Radiotelevisione Italiana Centro Ricerche e Innovazione Tecnologica, Torino, Italy {maurizio.montagnuolo, roberto.borgotallo, silvia.proscia, laurent.boch}@rai.it Abstract. This paper describes the results of a work, carried out by RAI within the framework of the project “La Città Educante”, aiming at creating statistical models for automatic document categorization and named entity recognition, both acting in the educational field and in Ital- ian language. The taxonomy used for the documents categorization is the Scientific Disciplinary Sector taxonomy (SSD) used in Italy to organize the disciplines and thematic areas of higher education. The actual sta- tistical models were created with the Apache OpenNLP libraries. The obtained results showed fairly good accuracy with SSD document classi- fication and some significant improvement of categorization precision for Named Entities compared to the state of the art. Keywords: video categorization, ontology, named entity recognition 1 Introduction and Related Work With the increase of the number of multimedia contents that are produced ev- eryday by a wide range of applications and devices, users can now access huge amount of data in wide-world localized archives and repositories. As an exam- ple, in the educational domain, teachers can now circulate their lectures through online video repositories, like e.g. VideoLectures.NET.1 Effective and compre- hensive annotations are needed in order to quickly and easily access those con- tents. In the literature many approaches can be found that aim at addressing the problem of video analysis and retrieval. Thought, almost all of the proposed research targets specific and restricted domains such as sports and news broad- casting [8, 9, 13, 14, 18], and very few that specifically targets the educational domain. In [7] a visual content classification system (VCCS) combining support vector machine (SVM) and optical character recognition (OCR) to classify visual content into figures, text and equations was proposed. Basing on the assump- tion that text in lecture video is directly related to the lecture content, in [17] an approach for automated lecture video indexing based on video OCR technology 1 http://videolectures.net/ (last accessed September 2017) 18 was presented. Speech is the most natural way of communication carrying most of information in nearly all lectures. Therefore, it is of clear advantage that the speech information can be used for automated analysis and annotation of lecture videos. In [6] a proposal to transform the video subject classification problem into a text categorization problem by exploiting the extracted transcripts of videos was presented. In [16] both video OCR and automatic speech recognition (ASR) were applied to develop a framework for automated analysis and indexing of German video lectures. When a large number of videos is collected, filtering them by subject is especially important because it allows users (e.g. students) to find specific information on desired themes. However, subjects may often overlap to each other, such as for example medicine and biology or telecommunications and computer science. To solve this problem, fuzzy clustering for video lecture classification was proposed in [4]. “La Città Educante” project2 aims to develop new models of learning and teaching by exploring, developing and evaluating innovative technologies for knowledge extraction, management and sharing. These include content based video analysis, segmentation and summarization [1, 2, 10], text analysis and se- mantic categorization, data mashups and visualization. This paper describes the use of natural language processing for the development of a framework for au- tomatic annotation of educational video broadcasts. The problem is switched from the video domain to the textual domain by performing ASR conversion on the input video and then applying document categorization and named entity recognition on the transcribed texts. The remaining of the paper is organized as follows. Section 2 describes the theoretical background on which this work is grounded. Section 3 describes the experimental study we conducted. Section 4 concludes the paper with some summary remarks. 2 Video Annotation as Text Analysis Process This section describes the theoretical background on which automatic video an- notation was performed. The assumption is that video annotation can be shifted to a language processing problem by analyzing the text automatically derived by the speech content of broadcast material. Two problems are approached, i.e. document categorization and named entity recognition from spoken documents. To address these problems, the Apache OpenNLP library3 was used. 2.1 Categorization Criteria RAI has a multi-year experience in automatic metadata extraction in the news domain and developed technologies used internally also for automatic news cat- egorization. In this context a proprietary classification schema already adopted in manual documentation is used [11]. The educational context of “La Città Ed- ucante”, despite some similarities, is clearly different. Therefore, we decided to 2 http://www.cittaeducante.it (last accessed September 2017) 3 https://opennlp.apache.org/ (last accessed September 2017) 19 adopt a formal and authoritative classification schema conveying the subjects of study of secondary (lower and/or higher) and university schools. The following subsections describe the classification system used and its implementation in accordance with the Semantic Web techniques. Definition of the Categorization System. In order to define the most ap- propriate classification schema various options were considered, including the names of the subjects of study in various degrees of education. As an alterna- tive, the organization of the contents of Skuola.net,4 a portal that keeps and shares teaching materials, notes and news addressed to both middle/high school and university students, was considered. Starting from this analysis, it can be observed that topics are more and more specific as far as the level of education increases. However grouping in areas does not follow a coherent set of criteria, as it may be appropriate to reflect the goals of the different institutes. In the case of Skuola.net, topics are mapped to the specific faculties and examinations that meet the greatest interest of the portal users. This latter approach is certainly interesting, as it represents a categorization that takes into account the point of view of the final user. Though, in the absence of a specific model, it would be preferable to adopt a set of neutral categories with respect to the goals of particular communication operators. Thus, we decided to adopt the Scientific Disciplinary Sectors (SSD) classification system,5 which is used in Italy for the organization of higher education as the reference for the subject classification. The SSD, defined by the Italian Ministry of Education, University and Research (MIUR), is organized hierarchically on four levels. There are 14 areas at the top level, 88 macro-sectors at the second level, 188 competition sectors at the third, and 367 scientific-disciplinary sectors at the last level. Hierarchical classification allows the attribution of one (or more) category(s) to an appropriate degree of specificity/generality. We limited the work to the first two levels of the hierar- chy, as the number of remaining classes is comparable to what already done for journalism in RAI [12]. Nothing prevents, however, to apply subsequently a dy- namic criterion so that particularly populous classes can be further subdivided into sub-categories. The OpenNLP Document Categorizer tool was used to per- form automatic categorization of input texts according to the defined taxonomy. This tool uses the maximum entropy framework to train a categorization model based on pre-annotated corpora. Categorization System Implementation. Semantic Web aims to foster knowl- edge sharing through interoperability and data exchange. For this purpose, the World Wide Web Consortium (W3C) defined a number of formal languages that allow the development of systems for the integration of heterogeneous data sources, such as those from multimedia archives, social networks and open data. 4 http://www.skuola.net/ (last accessed September 2017) 5 http://attiministeriali.miur.it/anno-2015/ottobre/dm-30102015.aspx (last accessed September 2017) 20 Among them, the Simple Knowledge Organizational System (SKOS)6 represents the state of the art for describing structured vocabularies, such as glossaries, clas- sification systems, and taxonomies. SKOS was designed with the idea of being easily extensible in order to allow connection and sharing of knowledge between different databases. The main SKOS elements are: – Concepts, which are elementary information units, representing ideas, mean- ings, objects, or events at the basis of a system of knowledge organization. Concepts can be grouped into collections and/or organized according to a particular schema (ConceptScheme); – Labels and Notations, which are a set of words or phrases, even in different languages, that describe a concept; – Relationships, such as hierarchies and equivalences, between concepts within the same dictionary (Semantic Relations) and between concepts defined in different dictionaries (Mapping Relations). Some examples of taxonomies implemented in SKOS include the multilingual thesaurus of the European Union (Eurovoc),7 and the thesaurus of the Wikipedia categories. As part of “La Città Educante”, the Scientific Disciplinary Sectors (SSD) taxonomy has been implemented in SKOS through an iterative process defined by the following steps: – Definition of concepts. A SKOS concept was created for each SSD area and macro-sector and identified by the corresponding alphanumeric code defined by the ministerial ordinance. For example the code 12/F indicates the macro-sector F in area 12 of the SSD; – Definition of the main scheme. In order to group all the concepts of the ontology, a top level SKOS ConceptScheme was defined to which all the SSD areas are related by the hasTopConcept/topConceptOf property of the SKOS standard; – Concept description. Each concept was described with a label (prefLabel) based on the name (Italian and English) of the corresponding denomination defined by the ministerial decree. For example, the ’Diritto processuale civile’ and ’Civil procedural law ’ labels were assigned to the macro-sector 12/F, re- spectively for the Italian and English languages; – Definition of intra-schema hierarchy. Hierarchical relationships between areas and macro-sectors were defined using the broader and narrower prop- erties of the SKOS standard; – Definition of inter-schema relationships. In order to promote interop- erability with other classification schemes, each SSD concept was mapped with at least one concept of the Eurovoc thesaurus and of DBpedia cate- gories. Eurovoc was selected as it allows a unique classification of documents in the European Union’s institutional databases irrespective of the language used in the documents themselves. DBpedia was chosen as the central hub 6 https://www.w3.org/2004/02/skos/ (last accessed September 2017) 7 http://eurovoc.europa.eu/ (last accessed September 2017) 21 Fig. 1. Excerpt from the SSD ontology. of the Linked Open Data (LOD) network. Furthermore, in order to ensure interoperability with the RAI archives, each SSD concept was mapped to the classification schema used for documentation of the RAI’s archives. Inter-schema relationships were defined according to the following criteria: – The SKOS hasExactMatch relates concepts identified by the same label to the considered ontologies. For example, the SSD macro-sector (’Civil Proce- dure Law ’) was mapped exactly to the ’Civil procedure’ DBpedia category; – The SKOS hasNarrowMatch property (hasBroadMatch) relates more in- depth (generic) concepts to the considered ontologies. For example, the SSD macro-sector ’International Law, European Union, Comparison, Econ- omy, Markets and Navigation’ hasNarrowMatch ’Business law ’, ’Compara- tive law ’ , ’European Union law ’, and ’International law ’ by DBpedia; – The SKOS hasRelatedMatch property relates concepts that share one or more features but whose mapping do not fall into any of the previous cases. For example, the SSD macro-sector ’Astronomy, Astrophysics, Physics of Earth and Planets’ hasRelatedMatch ’Planetary science’ by DBpedia. An excerpt from the ontology is shown in Fig.1. 2.2 Named Entity Recognition For the creation of the named entity recognition (NER) models we adopted a semi-supervised approach, a technique commonly used in machine learning when, 22 given a large dataset, only a subset has annotations. The complete manual an- notation of the entire dataset is a long and expensive process not exempt from human mistakes. E.g., Xue et al [15] reported that approximately two years of work were needed for the manual annotation of about 4,000 sentences for nat- ural language analysis applications. The actual creation and assessment of the NER models was done in three steps. First, a set of web articles falling into the project’s educational scope (talking about economics, science, technology, health, etc.) was selected among a larger and more generic set (including also off-topics like sport, talking about politics and others) of news articles pre-annotated with named entities information (persons, locations and organizations) by an auto- matic system already in possession of RAI. These selected texts were adapted to make them similar to those coming out of a basic ASR process, by remov- ing punctuation and capital letters in order to create the final training corpora resulted in a set of around 47,000 sentences. This number is three times larger than the minimum size recommended by the Apache OpenNLP documentation. In the second step, we used the TokeNameFinderTrainer tool in the Apache OpenNLP library to generate the new NER models. Similarly to document cate- gorization, the TokeNameFinderTrainer tool creates a maximum-entropy-based name finder, hence the output of this step consists of three binary files represent- ing the OpenNLP models for the corresponding categories of entities considered. At last, we run the new models on a set of automatic speech transcriptions from RAI’s broadcasts and manually validated them by assessing the entities found within the input material. 3 Experimental Results This section describes the experiments that we undertook to demonstrate the effectiveness of the presented methods. The reference dataset includes 2,020 episodes of TGR Leonardo, a science newscast produced by RAI and broad- casted daily from Monday to Friday on the RAI3 channel. Each episode, lasting approximately 10 minutes, features news about technology, health, economy, en- vironment and society. The format is that of a traditional newscast. Each episode was automatically segmented and transcribed into elementary news stories using the RAI ANTS system [11], resulting in about 6,600 news items. 3.1 Ground Truth Generation for Document Categorization This section describes the guidelines adopted at the annotation stage as well as the composition and organization of the reference dataset. Assigning SSD areas and macro-sectors requires a good understanding of the hierarchy of categories and a careful listening and evaluation of content. The hierarchy of categories needs to be well-known as some words can have wider or narrower meaning with respect of common usage. For example the term pedagogy does not simply indicate the educational aspect of children/adolescents (as often intended) but of people of all ages. To create the ground-truth categories we assigned one or 23 two as a maximum areas/macro-sectors to each news item, according to their importance with respect to the news item topic. For example, if a news item talks about the use of information technology in secondary schools, the news item would have been annotated by ’History, philosophy, pedagogy and psychology’ and ’Mathematics and informatics’ as, respectively, primary and secondary SSD area. All detected news items were manually annotated by a group of 10 people. The total number of annotated news is 6,608, corresponding to approximately 243 hours of audio-visual material. The size of this dataset is in line with those adopted in similar works [5]. The distribution of the annotated SSD areas is shown in Fig.2. The number of news annotated with two areas is 1,593 that corresponds around a quarter of the total (i.e. about 6,600). Considering only the main category (Area1) there is a fairly good coverage of the annotations. For the purpose of training an automatic classification system, the use of a second category does not add significant information. However, it is interesting to analyze the relationships between the areas in multiple annotation cases, as shown in Fig.3. From these data the following cases of interest can be identified: – Mono-area and multi-sector news items. A news item within the same area that may be classified according to more than one macro-sector within the same area; – Multidisciplinary news items. A news item that may be classified in two or more different areas. For example, a news item on legal regulations on the use of drugs might be annotated primarily as Juridical Sciences (Area 14) and secondly as Biological Sciences (Area 5); – Doubtful news items. A news item that, due to the nature of its topic, may be misclassified. For example, the area of Biological Sciences (Area 5) might be confused with the area of Medical Sciences (Area 6) and vice versa. Similarly, we evaluated the distribution of SSD macro-sectors compared to an- notated news. The number of annotations to get a uniform coverage of all the macro-sectors is 100. All macro-sectors were annotated at least once. 25% of them were annotated for a sufficiently representative number of times while 50% of them have less than half of the optimum number of annotations. 3.2 Document Categorization Performance Evaluation A 10-fold cross validation was used to assess the performance of the categoriza- tion task. Fig.4 shows the results on the average precision, recall, and F-measure for the classification according the SSD areas. The categorization model achieves good performance in terms of precision. Analyzing average recall values, a more varied situation is noted. In partic- ular, for the areas most represented in the reference dataset, a good balance between precision and recall is obtained. Conversely, for the areas that are least represented in the dataset, low recall values are obtained. This may depend on the use of a not enough large learning dataset (e.g. insufficient annotations for a given area), or on the interchangeability of the area with other areas with 24 Fig. 2. Distribution of SSD areas with respect to annotated audiovisual clips. similar topics. This phenomenon is confirmed by the analysis of the confusion matrix shown in Fig.5. The matrix rows show the expected areas. The matrix columns count the deduced areas (i.e. those generated automatically by the cat- egorizer). The element in position (i,j) indicates the percentage of news items belonging to area i that have been categorized with the area j. The matrix diago- nal shows the percentage of news items ranked correctly for each area, while the extra-diagonal elements contain misclassification errors. It can be noted that the matrix has the maximum at the diagonal for most of the considered areas (high- lighted in green). Areas subject to many categorization errors are highlighted in red. For these areas, errors are mostly distributed along the matrix columns, confirming good precision with respect to lower recall. Some areas tend to be confused more than others (e.g., Medical Sciences with Biological Sciences and vice versa), in accordance with intrinsic ambiguity (subject matter) or derivative (annotation errors) of the same topics. Similarly, we evaluated the performance of the categorization of the SSD macro-sectors. Table 1 reports the top-5 classification scores. As expected, global performance is worse than performance of the SSD area categorizer. This might be attributable to different causes, including the small size of training examples for some macro-sectors, the higher number of categories (i.e. macro-sectors) to be recognized, and the higher degree of interchangeability among them. 3.3 Named Entity Recognition Performance Evaluation NER process was evaluated in terms of precision for each of the entity categories considered. Recall was omitted from the analysis because of the huge amount 25 Fig. 3. Co-occurrence (Area1, Area2) in the annotation dataset. In yellow, the most frequent combinations are highlighted. Table 1. Top-5 Precision, Recall and F-measure for macro-sector categorization. Macro-sector Precision Recall F-measure Astronomia, astrofisica, fisica della terra e dei pianeti 0.55 0.67 0.61 Ingegneria energetica, termo-meccanica e nucleare 0.44 0.58 0.50 Scienze archeologiche 0.49 0.47 0.48 Ingegneria meccanica, aerospaziale e navale 0.39 0.61 0.47 Geoscienze 0.37 0.59 0.46 of time required to assess it (since it requires having a dataset fully annotated). This omission is considered acceptable because in our application scenario the Recall is less significant. When doing searches on large amount of data in fact, users usually obtain a lot of matching documents and desire that they are not false positives in order not to waste their time, this means that high precision is desirable. For sure low recall excludes from the search potentially interesting documents but this is less important for not really specific searches made on big indexes. The dataset of 6,600 TG Leonardo news items was randomly split into four sets of equal size, each of them assigned to one annotator for evaluation. Precision was assessed manually, counting the occurrences of the following cases: – Correct identification and classification. Entity correctly identified in the text and attributed to the right category. – Correct identification but incorrect classification. Entity correctly identified in the text but attributed to a wrong category. – Wrong identification. Entity mistakenly identified. The results obtained are shown in Table 2. For calculation purpose, the Named Entities found were considered as statistically independent although present with multiple occurrences in the same document, e.g. the word ”Italia” found twice as a location in the same text counted as two correct occurrences and not just one. This method is justified by the fact that identification and classification are carried out on the basis of context analysis limited to the surrounding words and not with simple match on a vocabulary. As a first consideration of data analy- sis, it is noted that the obtained precision values are aligned between the four 26 Fig. 4. SSD area classification model performance measurement. Fig. 5. Confusion matrix for classifying SSD areas. datasets, confirming a substantial validity and concordance of assessments over the four annotators. The categorization precision values are over 70% (i.e. more than double a random grading process characterized by a uniform probability distribution) for each of the test datasets. These are therefore good results, even more significant when compared with the state of the art [3], where maximum precision of 65% is reported. 4 Conclusions This paper described the results of a work carried out by RAI, aimed to cre- ate statistical models for automatic document categorization and named entity recognition, both acting in the educational field and in Italian language. The tax- onomy used for the documents categorization is the Scientific Disciplinary Sector 27 Table 2. Named Entity Recognition performance. LOC (Locations), PER (Persons), ORG (Organizations). For each of the four subsets of documents considered and for each of the named entity category are reported: Precision of the detection process; Precision of the classification process (Category Precision); Global Precision. Detection Precision Category Precision Global Precision Dataset 1 LOC 91.54 98.36 89.9 PER 77.08 98.58 75.66 ORG 72.25 98.20 70.45 Dataset 2 LOC 91.07 99.75 90.82 PER 71.58 100 71.58 ORG 75.90 97.95 73.85 Dataset 3 LOC 88.90 98.89 87.78 PER 70.98 98.85 69.83 ORG 60.22 98.90 59.12 Dataset 4 LOC 91.44 99.38 90.82 PER 73.10 99.63 72.73 ORG 71.61 96.45 68.06 taxonomy that, as part of the work, has been coded using SKOS to get a formal XML/OWL ontology. The reference ground-truth for document categorization has been created by annotating a dataset of automatic speech transcriptions derived from RAI’s scientific newscasts. Named entity recognition models were generated to be suitable for automatically transcribed text without punctua- tion and capital letters. Document categorization performance was calculated in terms of precision, recall and F-measure. Named entity recognition was eval- uated in terms of precision of both entity detection and entity classification, distinctly for locations, people and organizations. The obtained results showed fairly good accuracy with SSD document classification and some significant im- provement of categorization precision for Named Entities compared to the state of the art. Next step for RAI is to test the trained OpanNLP models both in the context of the “La Città Educante” and for an internal archive search and retrieval experimental service. As an example, locations automatically extracted within speech transcriptions could be used to enrich archive metadata and pro- vide auxiliary information for applications such as georeferencing, data mashups and map visualization. Acknowledgments. This work was carried out within the project “La Città Educante” (CTN01 00034 393801) of the National Technological Cluster on Smart Communities cofunded by the Italian Ministry of Education, University and Research - MIUR. 28 References 1. Baraldi, L., Grana, C., Cucchiara, R.: Recognizing and presenting the storytelling video structure with deep multimodal networks. Trans. Multi. 19(5), 955–968 (2017) 2. Baraldi, L., Grana, C., Messina, A., Cucchiara, R.: A browsing and retrieval system for broadcast videos using scene detection and automatic annotation. In: Proc. of the 2016 ACM on Multimedia Conference. pp. 733–734. MM ’16 (2016) 3. Bartalesi Lenzi, V., Speranza, M., Sprugnoli, R.: Named Entity Recognition on Transcribed Broadcast News at EVALITA 2011, pp. 86–97 (2013) 4. Basu, S., Yu, Y., Zimmermann, R.: Fuzzy clustering of lecture videos based on topic modeling. In: Proc. of the 14th Intl. Workshop on Content-Based Multimedia Indexing. pp. 1–6 (2016) 5. Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa (2007) 6. Chang, H., Kim, H., Li, S., Lee, J., Lee, D.: Comparative study on subject classi- fication of academic videos using noisy transcripts. In: Proc. of the 4th IEEE Intl. Conf. on Semantic Computing. pp. 67–72 (2010) 7. Imran, A.S., Cheikh, F.A.: Blackboard content classification for lecture videos. In: 18th IEEE Intl. Conf. on Image Processing. pp. 2989–2992 (2011) 8. Jiang, H., Lu, Y., Xue, J.: Automatic soccer video event detection based on a deep neural network combined CNN and RNN. In: 28th IEEE Intl. Conf. on Tools with Artificial Intelligence. pp. 490–494 (2016) 9. Kapela, R., Swietlicka, A., Rybarczyk, A., Kolanowski, K., O’Connor, N.E.: Real- time event classification in field sport videos. Image Commun. 35(C), 35–45 (2015) 10. Lorenzo, B., Costantino, G., Cucchiara, R.: Neuralstory: an interactive multimedia system for video indexing and re-use. In: Proc. of the 15th Intl. Workshop on Content-Based Multimedia Indexing (2017) 11. Messina, A., Borgotallo, R., Dimino, G., Airola Gnota, D., Boch, L.: ANTS: A complete system for automatic news programme annotation based on multimodal analysis. In: 9th Intl. Workshop on Image Analysis for Multimedia Interactive Services. pp. 219–222 (2008) 12. Messina, A., Montagnuolo, M., Di Massa, R., Borgotallo, R.: Hyper media news: a fully automated platform for large scale analysis, production and distribution of multimodal news content. Multimedia Tools Appl. 63(2), 427–460 (2013) 13. Mocanu, B.C., Tapu, R., Zaharia, T.B.: Automatic segmentation of TV news into stories using visual and temporal information. In: Proc. of the 17th Intl. Conf. on Advanced Concepts for Intelligent Vision Systems. pp. 648–660 (2016) 14. Shih, H.: A survey on content-aware video analysis for sports. CoRR abs/1703.01170 (2017) 15. Xue, N., Xia, F., Chiou, F.d., Palmer, M.: The penn chinese treebank: Phrase structure annotation of a large corpus. Nat. Lang. Eng. 11(2), 207–238 (Jun 2005) 16. Yang, H., Oehlke, C., Meinel, C.: An Automated Analysis and Indexing Framework for Lecture Video Portal, pp. 285–294 (2012) 17. Yang, H., Siebert, M., Lhne, P., Sack, H., Meinel, C.: Lecture video indexing and analysis using video ocr technology. In: 7th Int. Conf. on Signal Image Technology and Internet Based Systems (SITIS 2011), Track Internet Based Computing and Systems (2011) 18. Zlitni, T., Bouaziz, B., Mahdi, W.: Automatic topics segmentation for tv news video using prior knowledge. Multimedia Tools Appl. 75(10), 5645–5672 (2016) 29