Genre tagging of videos based on information retrieval and semantic similarity using WordNet ∗ José M. Perea-Ortega, Arturo Montejo-Ráez, Manuel C. Díaz-Galiano and M. Teresa Martín-Valdivia SINAI Research Group, Computer Science Department University of Jaén 23071 - Jaén, Spain {jmperea,amontejo,mcdiaz,maite}@ujaen.es ABSTRACT videos may not contain sufficient context for locating data of In this paper we propose a new approach for the genre tag- interest in a large database. Detailed annotation is required, ging task of videos, using only their ASR transcripts and so that users could quickly locate clips of interest without associated metadata. This new approach is based on calcu- having to go through entire databases. lating the semantic similarity between the nouns detected The Genre Tagging task in MediaEval 2011 attempts to in the video transcripts and a bag of nouns generated from automatically generate genre labels to organize videos [2]. In WordNet, for each category proposed to classify the videos. this paper we present some experiments on automatic genre Specifically, we have used the Lin measure based on Word- tagging of videos making use of their Automatic Speech Net, which calculates the semantic distance between two Recognition (ASR) transcripts and metadata associated. We synsets. Obviously, this approach has been only applied have worked during last years in the field of video catego- on the English test videos due to the use of WordNet, an rization, participating in VideoCLEF [6, 7] and MediaEval English lexical resource. As base case, we have applied an 2010 [5]. information retrieval system as a classifier, using the gene- rated bag of nouns for each category as index data and the 2. DESCRIPTION OF THE TASK ASR transcripts from each test video as query. Several ex- In the Genre Tagging task of MediaEval 2011, participants periments have been submitted, one of them combining both are required to automatically assign thematic subject labels approaches (information retrieval and semantic similarity). to videos using features derived from speech, metadata, au- As main conclusion we have shown that, using this combina- dio or visual content. It is important to note that this is tion of semantic similarity and information retrieval, we can not a multilabel tagging task, so a given video can only be improve the results obtained using the information retrieval assigned to one label. The data set provided are the same as approach only. those used in MediaEval 2010 Wild Wild Web Tagging Task (ME10WWW) [2]. The development and test data sets con- Categories and Subject Descriptors sisted of 247 and 1,727 videos respectively. From the test H.3.1 [Information Storage and Retrieval]: Content videos, 1,673 videos are in English, 16 are in French, 25 are Analysis and Indexing - Indexing methods in Spanish and 13 are in Dutch. We have only worked with the English videos. The list of genre classes consisted of 25 tags, providing a “default category” for those videos that do Keywords not fit in any other classes. Genre video tagging, Video categorization, Automatic Speech Recognition 3. SYSTEM OVERVIEW Our main approach is based on using an Information Re- 1. MOTIVATION AND RELATED WORK trieval (IR) system as a classifier. On the one hand, we have Multimedia data are usually tagged with some relevant generated a XML document or bag of words for each cat- information in order to make the retrieval easier. In fact, egory proposed, making use of an external lexical resource the efficient use of textual data associated to other types of like WordNet1 . Specifically, we have included synonyms, information such as images can improve multimedia IR sys- hyponyms and domain terms related to the category. For tems [1, 4]. However, the provided labelling of multimedia example, for the “educational ” category we have generated a ∗This work has been partially supported by a grant from the XML document including terms such as instruction, teach- Fondo Europeo de Desarrollo Regional (FEDER), project ing, pedagogy, didactics, training, etc. On the other hand, TEXT-COOL 2.0 (TIN2009-13391-C04-02) from the Span- the preprocessed ASR transcripts (stemming and stop word ish Government, a grant from the Andalusian Government, removal) from test videos have been used as queries, without project GeOasis (P08-TIC-41999) and Geocaching Urbano any expansion. Finally, the Terrier2 IR system has been used research project (RFC/IEG2010). to obtain a measure of relatedness (RSV, Retrieval Status Value) between each video and the generated bags of words. 1 Copyright is held by the author/owner(s). http://wordnet.princeton.edu 2 MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy http://terrier.org As a second approach, we have used the formula proposed that really did not correspond due to its low similarity score. by Lin [3], which is based on WordNet, to measure the se- Nevertheless, for some categories (art, politics, religion and mantic similarity between the nouns detected in each test sports), we obtained good results, achieving high MAP indi- video and the bags of words generated for each category. vidual scores (e.g. 0.6176 for the politics category). This is Firstly, for calculating the semantic similarity between a due to such categories are more general concepts or genres video and the XML document generated for a category, we than others (business, comedy, documentary, etc.), so it was have obtained the Lin semantic similarity between each pair easier to find more nouns semantically related, increasing of nouns from both, accumulating those similarity scores the size of the XML document generated for such categories that exceed a threshold set at 0.75. With the use of this and, therefore, the probability of success. threshold, we have tried to minimize the effect of the size of the ASR transcripts, since some of the videos contain 5. CONCLUSIONS more words than others. Secondly, the accumulated similar- In this paper we propose the use of the semantic simi- ity score has been divided by the number of words detected larity based on WordNet combined with the IR approach in the video, obtaining the final semantic similarity score. in order to solve the genre tagging task of videos. Because Those videos with a final semantic similarity score of less our research field of interest is Natural Language Processing than 0.25 were considered in the default category. (NLP), we have only worked with the ASR transcripts from videos and their metadata. It was shown that combining 4. EVALUATION OF RESULTS the semantic similarity score with the RSV score obtained Several experiments were carried out under both approa- from the IR approach, we obtained a significant improve- ches. As baseline, we have considered the use of the pre- ment. Nevertheless, it seems clear that working only with processed ASR transcripts from test videos as query (expe- the ASR transcripts generally get poor results. For future riment IR ASR). Then, we have tried to evaluate the ad- work, we will study other resources in order to increase the dition of the metadata provided, carrying out an expan- size of the bag of words generated for each category, adding sion of the ASR transcripts using such metadata (exper- more terms semantically related with such categories. iments IR ASR+MD and IR ASR+MD+TAGS ). Regard- ing the second approach, we have submitted the experiment 6. REFERENCES “SIMSEM-ASR”, in which we only calculate the semantic [1] Bozzon, A., and Fraternali, P. Multimedia and similarity between each video and each category (its bag of multimodal information retrieval. In SeCO Workshop words), without using the IR approach. Finally, we have (2009), S. Ceri and M. Brambilla, Eds., vol. 5950 of combined the IR and the semantic similarity approaches Lecture Notes in Computer Science, Springer, (experiment “SIMSEM+IR-ASR”), merging both lists of re- pp. 135–155. sults. First, we have normalized the RSV score from the [2] Larson, M., Eskevich, M., Ordelman, R., Kofler, baseline. Then, for each test video, we have added their C., Schmeideke, S., and Jones, G. Overview of normalized RSV and semantic similarity scores. The results MediaEval 2011 Rich Speech Retrieval Task and Genre obtained are shown in Table 1, using the Mean Average Pre- Tagging Task. In MediaEval 2011 Workshop (Pisa, cision (MAP) measure. We also show the MAP obtained Italy, September 1-2 2011). considering only the English test videos. [3] Lin, D. An information-theoretic definition of similarity. In Proc. of the 15th Int’l. Conf. on Machine Run name M APof f icial M APEnglish Learning (1998), pp. 296–304. IR ASR 0.1031 0.1044 [4] Martı́n-Valdivia, M. T., Dı́az-Galiano, M. C., IR ASR+MD 0.1073 0.1088 Montejo-Ráez, A., and Ureńa-López, L. A. Using IR ASR+MD+TAGS 0.1115 0.1129 Information Gain to Improve Multimodal Information SIMSEM-ASR 0.0547 0.0559 Retrieval Systems. Information Processing & SIMSEM+IR-ASR 0.1266 0.1288 Management 44 (2008), 1146–1158. Table 1: Experiments and results obtained by [5] Perea-Ortega, J. M., Montejo-Ráez, A., SINAI in the MediaEval 2011 Genre Tagging task Dı́az-Galiano, M. C., and Martı́n-Valdivia, M. T. SINAI at Tagging Task Professional in MediaEval 2010. Analyzing the official results we can observe that the ex- In Working Notes Proceedings of the MediaEval 2010 pansion of the ASR transcripts using the provided meta- Workshop, Pisa, Italy, October 24, 2010 (2010). data improves the result obtained when metadata is not used [6] Perea-Ortega, J. M., Montejo-Ráez, A., (+4% and +8.15% better for the experiments IR ASR+MD Dı́az-Galiano, M. C., Martı́n-Valdivia, M. T., and IR ASR+MD+TAGS, respectively), as it was expected. and Ureña-López, L. A. Using an information On the other hand, the combination of the semantic similar- retrieval system for video classification. In Evaluating ity and the IR approaches seems to be interesting because it Systems for Multilingual and Multimodal Information improves the MAP value obtained for the baseline using the Access (2009), vol. 5706 of Lecture Notes in Computer IR approach only (+22.79%). Taking into account the test Science, Springer, pp. 927–930. groundtruth file provided by the MediaEval organizers, 185 [7] Perea-Ortega, J. M., Montejo-Ráez, A., videos of the 1,673 English videos (11.06%) belong to the Martı́n-Valdivia, M. T., and Ureña-López, L. A. default category, while our best experiment assigned only Using support vector machines as learning algorithm 18 videos (1.08%) to such category. This was motivated by for video categorization. In CLEF, Part II (2010), the low threshold used to assign a video to the default cat- vol. 6242 of Lecture Notes in Computer Science, egory (0.25), which allowed to classify videos in categories Springer, In Press.