=Paper=
{{Paper
|id=None
|storemode=property
|title=LIA @ MediaEval 2011: Compact representation of heterogeneous descriptors for video genre classification
|pdfUrl=https://ceur-ws.org/Vol-807/Rouvier_LIA_Genre_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RouvierL11
}}
==LIA @ MediaEval 2011: Compact representation of heterogeneous descriptors for video genre classification==
LIA @ MediaEval 2011 : Compact Representation of Heterogeneous Descriptors for Video Genre Classification Mickael Rouvier Georges Linarès LIA - University of Avignon LIA - University of Avignon Avignon, France Avignon, France mickael.rouvier@univ-avignon.fr georges.linares@univ-avignon.fr ABSTRACT way to obtain a corpus is to download documents given by a This paper describes our participation in Genre Tagging web search engine using useful queries. For example, for the Task @ MediaEval 2011, which aims at predicting the Genre Health genre we download all the videos on Youtube using of Internet videos. We propose a method that extracts low the query : health. But using only the genre like query dimensional feature space based on text, audio and video does not allow to download a training set that represents information. In the best configuration, our system yields a the class variability. We propose to expand our query to 0.56 MAP (Mean Average Precision) on the test corpus. other terms revolving around the genre. For example, for the Religion genre, we need to expand our query to related terms like : ethnic, belief, freedom, practice, etc. In order Categories and Subject Descriptors to find terms closely related to the genre, we propose to use H.3.1 [Information Search and Retrieval]: Content Anal- Latent Dirichlet Allocation (LDA). LDA is a unsupervised ysis and Indexing—Indexing method word clustering method that relies on word co-occurrence analysis. The 5000-classes LDA model was estimated on the Gigaword corpus. Each cluster is composed of the 10 General Terms best words. Queries are expanded by adding all words from Algorithms, Measurement, Performance, Experimentation the best cluster containing the genre tag. We propose the use of different information sources ex- 1. INTRODUCTION tracted from audio, speech and video channels. Consequently, The Genre Tagging Task held as part of MediaEval 2011 our training set should include all these sources, especially required participants to automatically predict tags of videos transcription of spoken contents. ASR performance on Web from Internet. The task consists in associating each video to data are usually high, however the ASR system used on the one and only one of the 26 provided genres [3]. The videos test set is not freely distributed. To overcome this problem, are provided with some additional information like metadata we propose to collect text materials by downloading web (title and description), speech recognition transcripts [2] and pages from the Internet. Our training corpus consists of web social network information (gathered from Twitter). pages collected from Google (60 documents per class, 1560 One of the main difficulty of video genre categorization documents) and videos collected from Youtube and Daily- is due to the diversity of the information sources that are motion (120 documents per class, 3120 documents). There genre-dependent (spoken contents, video structure, audio are more videos than web pages because of technical restric- and video patterns, etc.) and to the variability of video tions imposed by Google. The collected documents (web classes. In order to deal with this problem, we propose to pages and videos) are only in English. combine various features from audio and video channels and to reduce the resulting large-dimensional input data to a 3. SYSTEM DESCRIPTION low dimensional feature vector while retaining most of the The proposed system has a 2-level architecture. The first relevant information (reducing redundancies and minimising level consists of extracting low dimensional features from the useless information). speech, audio and video. Each feature is then given to a The paper is organized as follows. Section 2 explains how SVM (Support Vector Machine) classifier. The second level we collect our training data. Section 3 describes our sys- combines the scores of three SVM models. This combina- tem. Section 4 present the features used by our system and tion is achieved by linear interpolation whose coefficients are Section 5 summarizes the results. determined on the development corpus. 2. CORPUS 4. FEATURES This task is seen as a classification task/problem. We choose to follow a slightly supervised approach. The training 4.1 Text dataset is collected from the Web. A simple and effective Most of the linguistic-level methods for video genre classi- fication rely on extracting relevant words from the available video meta-data (close captions, tags, etc.), by removing Copyright is held by the author/owner(s). stopwords. Our system consists of extracting relevant key- MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy words from the documents by using the TF-IDF metric. The words with a high TF-IDF value are generally mean- learned on the features given by the training corpus. We no- ingful, topic-bearing words. Thus, we propose to construct a tice that the training corpus has been downloaded by using feature vector with the n (n = 600 in our experiments) most a slightly supervised method, and some of the video may be frequent words in the documentary of the training corpus. uncorrectly tagged. To improve the performance, we pro- pose to integrate the development corpus in the training 4.2 Audio corpus. Development corpus was provided by Mediaeval in One of the most popular approach for genre identification which the genres were manually checked. This run uses ex- by acoustic space characterization relies on MFCC (Mel Fre- actly the same features as run 3, but the SVM classifier has quency Cepstral Coefficients) analysis and GMM (Gaussian been learned on the features given by the training and de- Mixture Model) or SVM classifiers. velopment corpus. However, audio features include both useful and useless information. Run 5: In the development corpus, we observed that the Unfortunately, separating useful and useless information username of the video (present in the metadata) can give is a heavy process, and the useless space contains some in- some interesting information to predict the video genre. In- formation that can be used to distinguish between genres. deed a user often uploads multiple videos of the same genre. For this reason, [1] proposed a single space (the total vari- For example, the users Anglicantv or Aabbey1 often upload ability space) that models the two variabilities. In this new videos of the genre Religion. Here, we use the dev set as a space, a given audio utterance is represented by a new vector knowledge base, where the favorite genre of people is known. named total factors (we also refer to this vector as i-vector), For each video, we search if the username is present in the that allows to reduce redundancies and to enhancing useful dev corpus and increase the score of the genre in which the information. user uploaded the videos. Here, we boost scores from the Acoustic frames of MFCC are computed every 10 ms in a run 4 according to this new information. We conducted a Hamming window of 20 ms large. MFCC vectors are com- post-campaign experiment that show that, by using only posed of 12 coefficients, energy and first and second order this information, the system performs 51% MAP. derivatives of these 13 features. For these experiments the UBM is composed of 512 gaussians and the i-vector is a 400 dimension vector. Table 1: Results of the submitted runs Team Run Id MAP 4.3 Video LIA 1 run1 0.1179 We used features based on the color like : Color Structure LIA 2 run2 0.17 Descriptor or Dominant Color Structure or features based on LIA 3 run3 0.1828 the texture like : Homogeneous Texture Descriptor (HTD) LIA 4 run4 0.1964 or Edge Histogram Descriptor. On this task, it seems that LIA 5 run5 0.5626 texture was the best feature and specially the HTD. HTD is an efficient feature not only for computing tex- In run 2, we observe that the audio and video features pro- ture features but also for representing texture information. vide interesting information to predict the video genre. The HTD provides a quantitative characterization of homoge- runs 2, 3 and 4 achieved a similar performance which means neous texture regions for similarity retrieval. HTD consists that the different configurations did not strongly contribute in the mean, the standard deviation value of an image, en- to the global results. The use of the owner id strongly im- ergy, and energy deviation values of Fourier transform of the proves the results. image. Similarly to the audio feature processing, we extract, for 6. CONCLUSION each video feature, an i-vector. For these experiments the UBM is composed of 128 gaussians and the i-vector is a 50 We have described in this paper an approach based on dimension vector. the use of audio, video, text (transcription and metadata) features for Video Genre Classification. According to the results, username seems to be simple and strongly efficient 5. RESULTS information to predict the video genre. We submitted five runs for the Genre Tagging Task, com- bining the results, presented in the section above. In detail, 7. REFERENCES the configuration for each run was as follows : [1] N. Dehak, R. Dehak, P. Kenny, N. Brümmer, Run 1: We use only text feature. The text feature is built P. Ouellet, and P. Dumouchel. Support vector machines on the speech transcript. versus fast scoring in the low-dimensional total variability space for speaker verification. In Run 2: We use text, audio and video features. The text INTERSPEECH, pages 1559–1562, 2009. feature is built on speech transcript and description of the [2] J.-L. Gauvain, L. Lamel, and G. Adda. The limsi video given in the metadata. broadcast news transcription system. Speech Communication, 37(1-2):89 – 108, 2002. Run 3: We use text, audio and video features. The text [3] M. Larson, M. Eskevich, R. Ordelman, C. Kofler, feature is built on speech transcript, description of the video S. Schmiedeke, and G. Jones. Overview of MediaEval given in the metadata and tags. 2011 Rich Speech Retrieval Task and Genre Tagging Task. In MediaEval 2011 Workshop, Pisa, Italy, Run 4: In the previous run, the SVM classifier has been September 1-2 2011.