=Paper=
{{Paper
|id=None
|storemode=property
|title=Audio-Visual content description for video genre classification in the context of social media
|pdfUrl=https://ceur-ws.org/Vol-807/Ionescu_RAF_Genre_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/IonescuSVL11
}}
==Audio-Visual content description for video genre classification in the context of social media==
Audio-Visual Content Description for Video Genre Classification in the Context of Social Media Bogdan Ionescu1,3 , Klaus Seyerlehner2 , Constantin Vertan1 , Patrick Lambert3 1 LAPI - University Politehnica of Bucharest, 061071 Bucharest, Romania {bionescu,cvertan}@alpha.imag.pub.ro 2 DCP - Johannes Kepler University, A-4040, Linz, Austria klaus.seyerlehner@gmail.com 3 LISTIC - Polytech Annecy-Chambery, B.P. 80439, 74944 France patrick.lambert@univ-savoie.fr ABSTRACT vocal aspects. The proposed audio features are block-level In this paper we address the automatic video genre classi- based, which compared to classic approaches have the ad- fication with descriptors extracted from both, audio (block- vantage of capturing local temporal information by analyz- based features) and visual (color and temporal based) modal- ing sequences of consecutive frames in a time-frequency rep- ities. Tests performed on 26 genres from blip.tv media plat- resentation. Audio information is described with parame- form prove the potential of these descriptors to this task. ters such as: spectral pattern (characterize the soundtrack’s timbre), delta spectral pattern (captures the strength of on- sets), variance delta spectral pattern (captures the variation Categories and Subject Descriptors of the onset strength over time), logarithmic fluctuation pat- I.2.10 [Artificial Intelligence]: Vision and Scene Under- tern (captures the rhythmic aspects), spectral contrast pat- standing—audio, color and action descriptors; I.5.3 [Pattern tern (estimates ”tone-ness”) and correlation pattern (cap- Recognition]: Clustering—video genre. tures the temporal relation of loudness changes over different frequency bands). For more information see [3]. Keywords Temporal descriptors. The genre specificity is reflected also at temporal level, e.g. music clips tend to have a high block-based audio features, color perception, action content, visual tempo, documentaries have a reduced action content, video genre classification. etc. To address those aspects we detect sharp transitions, cuts and two of the most frequent gradual transitions, fades 1. INTRODUCTION and dissolves. Based on this information, we assess rhythm In this paper we address the issue of automatic video genre as movie’s average shot change speed computed over 5s time classification in the context of social media platforms as part windows (provides information about the movie’s changing of the MediaEval 2011 Benchmarking Initiative for Multi- tempo) and action in terms of high action ratio (e.g. fast media Evaluation (see http : //www.multimediaeval.org/). changes, fast motion, visual effects, etc.) and low action The challenge is to provide solutions for distinguishing be- ratio (the occurrence of static scenes). Action level is deter- tween up to 26 common genres, like ”art”, ”autos”, ”busi- mined based on user ground truth [4]. ness”, ”comedy”, ”food and drink”, ”gaming”, and so on [2]. Color descriptors. Finally, many genres have specific Validation is to be carried out on video footage from the color palettes, e.g. sports tend to have predominant hues, blip.tv media platform (see http : //blip.tv/). indoor scenes have different lighting conditions than outdoor We approach this task, globally, from the classification scenes, etc. We assess color perception by projecting colors point of view and focus on the feature extraction step. For a onto a color naming system (associating names with colors state-of-the-art of the literature see [1]. In our approach, we allows everyone to create a mental image of a given color extract information from both audio and visual modalities. or color mixture). We compute a global weighted color his- Whether these sources of information have been already ex- togram (movie’s color distribution), an elementary color his- ploited to genre classification, the novelty of our approach togram (distribution of basic hues), light/dark, saturated/weak- is in the content descriptors we use. saturated, warm/cold color ratios, color variation (the amount of different colors in the movie), color diversity (the amount of different hues) and adjacency/complementarity color ra- 2. VIDEO CONTENT DESCRIPTION tios. For more information on visual descriptors see [4]. Audio descriptors. Most of the common video genres tend to have very specific audio signatures, e.g. music clips con- tain music, in sports there is the specific crowd noise, etc. To 3. EXPERIMENTAL RESULTS address this specificity, we propose audio descriptors which Results on development data. First validation was per- are related to rhythm, timbre, onset strength, noisiness and formed on the provided development data set (247 sequences) which was eventually extended to up to 648 sequences in order to provide a consistent training data set for classifi- Copyright is held by the author/owner(s). cation (source blip.tv; sequences are different than the ones MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy proposed for the official runs). We observed that in the case SVM linear kernel 1 BayesNet DecisionTable FT HyperPipes 0.8 J48 0.7 Fscore NNge 0.6 NaiveBayes 0.5 RBFNetwork RandomForest 0.4 RandomTree 0.3 Ridor 0.2 SVMlinear 0.1 VFI kNN 0 ) ) (5 ) (6 ) 3) ) ) (2 ) ) ) ) 4) (3 ) ) ) ) ) ) (1 ) ) ) 7) (3 ) 4) ) ) (19 (1 5 (11 (20 (1 0 23 (1 5 18 (2 2 l (3 23 (1 8 27 (1 8 (1 6 (1 7 y( h( ( l (2 d( s( s( s( ss sm ry ure on t ia ica en s ed a rt tos o ry al ng alt sic on gy ing nt ed ve nta ce e vie litic o rt foo ati a li on nm me ra t ph sin m he olo mi igi mu tra au _m gg teg en uc me sp u rn mo co ati po lite gra re l ga iro lop bu hn blo fer ed am ca uc cu _ jo nv bio tec ve n eo lt_ ed do tre co _e en de to- fau vid ins the iz b_ au de cit ma we Figure 1: Average F score achieved using all audio-visual descriptors and genre classification ”performance” for the best run, i.e. SVM with linear kernel (graph on top). of some of the proposed genres, the genre specific content est on audio-color-action, 0.121 for SVM linear on audio- is captured mainly with the textual information. Therefore, color-action (best run), 0.103 for SVM linear on audio and our tests focused mainly on genres with specific audio-visual 0.038 SVM linear on color-action. This is mainly due to contents, like ”art”, ”food”, ”cars”, ”sports”, etc. (for which the limited training data set compared to the diversity of we provided a representative number of examples). test sequences and to the inclusion of the genres for which Tests were performed using a cross-validation approach. we obtain 0 precision (i.e. audio-visual information is not We use for training p% of the existing sequences (randomly discriminant, see Figure 1). The fact that MAP provides selected and uniformly distributed with respect to genre) only an overall average precision over all genres makes us and the remainder for testing. Experiments were repeated unable to conclude on the genres which are better suited to for different combinations between training and testing (e.g. be retrieved with audio-visual information and which fail to. 1000 repetitions). Figure 1 presents average Fscore = 2 · P · R/(P + R) ratio 4. CONCLUSIONS AND FUTURE WORK (where P and R are average precision and recall, respec- The proposed descriptors performed well for some of the tively over all repetitions) for p = 50%, descriptor set = genres, however to improve the classification performance audio-color-action (i.e. the descriptor set which provided the a more consistent training database is required. Also, our most accurate results) and various classification approaches approach is more suitable for classifying genre patterns from (see Weka at http : //www.cs.waikato.ac.nz/ml/weka/). the global point of view, like episodes from a series being not The number in the brackets represent the number of test able to detect a genre related content within a sequence. sequences used for each genre. Future tests will consist on preforming cross-validation on From the global point of view, the best results are ob- all the 2375 sequences (development + test sets). tained with SVM and a linear kernel (depicted in Orange), followed by k-NN (k = 3, depicted in Dark Green) and FT (Functional Trees, depicted in Cyan). At genre level, the 5. ACKNOWLEDGMENTS best accuracy is obtained for genres with particular audio- Part of this work has been supported under the Financial visual signatures. The graph on top presents a measure Agreement EXCEL POSDRU/89/1.5/S/62557. of the individual genre classification ”performance” which is computed as the Fscore times the number of test sequences 6. REFERENCES used. An Fscore obtained for a greater number of sequences [1] D. Brezeale, D.J. Cook, ”Automatic Video is more representative than one obtained for only a few (val- Classification: A Survey of the Literature,” IEEE ues are normalized with respect to 1 for visualization pur- Trans. on Systems, Man, and Cybernetics, Part C: pose). The proposed descriptors provided good discrimina- Applications and Reviews, 38(3), pp. 416-430, 2008. tive power for genres like (the number in the brackets is [2] M. Larson, M. Eskevich, R. Ordelman, C. Kofler, S. Fscore ): ”food and drink” (0.757), ”travel” (0.633), ”politics” Schmiedeke, G.J.F. Jones, ”Overview of MediaEval (0.552), ”web development and sites” (0.697), while at the 2011 Rich Speech Retrieval Task and Genre Tagging bottom end are genres whose contents are less reflected with Task”, MediaEval 2011 Workshop, Pisa, Italy, 2011. audio-visual information, e.g. ”citizen journalism”, ”busi- [3] K. Seyerlehner, M. Schedl, T. Pohle, P. Knees, ”Using ness”, ”comedy” (see Figure 1). Block-Level Features for Genre Classification, Tag Results on test data. For the final official runs, clas- Classification and Music Similarity Estimation,” sification was performed on 1727 sequences with training MIREX-10, Utrecht, Netherlands, 2010. performed on the previous data set (648 sequences). The [4] B. Ionescu, C. Rasche, C. Vertan, P. Lambert, ”A overall results obtained in terms of MAP (Mean Average Contour-Color-Action Approach to Automatic Precision) are less accurate than the previous results, thus: Classification of Several Common Video Genres”, 0.077 for k-NN on audio-color-action, 0.027 for RandomFor- AMR (LNCS 6817), Linz, Austria, 2010.