=Paper=
{{Paper
|id=None
|storemode=property
|title=KIT at MediaEval 2011 - Content-based genre classification on web-videos
|pdfUrl=https://ceur-ws.org/Vol-807/Semela_KIT_Genre_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SemelaE11
}}
==KIT at MediaEval 2011 - Content-based genre classification on web-videos==
KIT at MediaEval 2011 - Content-based genre classification on web-videos Tomas Semela Hazım Kemal Ekenel Institute for Anthropomatics Institute for Anthropomatics Karlsruhe Institute of Technology (KIT) Karlsruhe Institute of Technology (KIT) 76131 Karlsruhe, Germany 76131 Karlsruhe, Germany tomas.semela@student.kit.edu ekenel@kit.edu ABSTRACT detector. It contains average shot duration and distribution In this paper, we run our content-based video genre classi- of shot lengths. The cognitive and later presented visual fication system on the MediaEval evaluation corpus. Our features are extracted from 5 linearly distributed frames per system is based on several low level audio-visual cues, as shot. well as cognitive and structural information. The purpose of this evaluation is to assess our content-based system’s per- 2.2 Aural Features formance on the diversified content of the blip.tv web-video To benefit from the audio information of each clip, we corpus, which is described in detail in [5]. compute four features from the audio signal. All features are extracted from mono-channel audio with 16 kHz sample rate and a 256 kbit/s bit rate. The features include MFCC, Categories and Subject Descriptors Zero Crossing Rate and Signal Energy, and are utilized using H.3.1 [Information Storage and Retrieval]: Content different representations. Analysis and Indexing 2.3 Low-level Visual Features Keywords We used six different low level visual features which rep- resent color and texture information in the video. Genre classification, content-based features 2.3.1 Color descriptors 1. MOTIVATION Histogram: We use the HSV color space and build a his- Automatic genre classification is an important task in mul- togram with 162 bins [8]. timedia indexing. Several studies have been conducted on Color moments: We use a grid size of 5×5. The first three this topic. A comprehensive overview of these studies on order color moments were calculated in each local block in TV genre classification can be found in [6]. Recently, there the image and the Lab color space is used [7]. has also been an increasing interest in web video genre clas- Autocorrelogram: Autocorrelogram captures the spatial sification [9]1 . In this study, we evaluated our content-based correlation between identical colors. 64 quantized color bins system, which is based on the low-level audio-visual features, and five distances are used [4]. on the MediaEval corpus. The utilized features in the sys- tem correspond to low level color and texture cues, as well 2.3.2 Texture descriptors as shot boundary and face detection outputs. We used this Co-occurrence texture: As proposed in [1], five types of features before for detecting high-level features in videos [2] features are extracted from the gray level co-occurrence ma- and successfully classified various TV content into genres trix (GLCM): Entropy, Energy, Contrast, Correlation and [3]. In the following sections we give a brief overview of our Local homogeneity. system, for details please refer to [3]. Wavelet texture grid: We calculate the variances of the high-frequency sub-bands of the wavelet transform of each grid region. We performed 4-level analysis on a grid that 2. CONTENT-BASED FEATURES has 4 × 4 = 16 regions. Haar wavelet is employed, as in [1]. Edge histogram: For the edge histogram, 5 filters as pro- 2.1 Cognitive and structural features posed in the MPEG-7 standard are used to extract the kind Cognitive and structural features are proposed in [6]. Cog- of edge in each region of 2 × 2 pixels. Then, those small nitive features are derived using a face detector. It contains regions are grouped in a certain number of areas (4 rows × average number of faces per frame, distribution of number of 4 columns in our case) and the number of edges matched by faces per frame and distribution of location of the faces in the each filter (vertical, horizontal, diagonal 45◦ , diagonal 135◦ frame. Structural feature is derived using a shot boundary and non-directional) are counted in the region’s histogram. 1 Also as part of the ACM Multimedia Grand Challenge 3. CLASSIFICATION Classification is performed using multiple SVM classifiers. Copyright is held by the author/owner(s). As can be seen in Fig. 1, content-based features are ex- MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy tracted from each video and are used as input for separate Genre: Art Our experiments show that a content-based system which MFCC SVM Model is able to achieve nearly perfect accuracy on TV datasets (95% and 99%, see [3]) and also very high performance on ∑ Audio Feature Extraction [F0, MFCC, SP, ZCR] Wavelet SVM Model a YouTube dataset (92.4%), is not able to achieve high per- Input Majority Voting formance on the blip.tv corpus. The main reason for this ∑ Video Video Feature Extraction Genre: Web might be the increased number of genres to be classified, [AC, CM, CoOc, HSV, Edge, Wavelet, Cog, high intra-class diversity leading to difficulty in seperability MFCC SVM Model Struct] of genres from each other using content-based cues. More interestingly the low-level visual and aural features Wavelet SVM Model show more promising results than the selected higher-level cognitive and structural cues. Either it is not possible to cover the variety, or overall resemblance of all videos with Figure 1: System Overview these features or more promising high-level features have to be found by analyzing the properties of the web-videos. Because of the limits of content-based systems in this area, SVMs, one for each genre and feature. Classification output the usage of metadata and other sources like ASR engines of each SVM is summed up over all features for each genre is desirable to be able to attain a robust genre classification and a genre is picked via majority voting. system. 4. EVALUATION AND DISCUSSION Acknowledgments The evaluation of this years MediaEval genre tagging task This study is funded by OSEO, French State agency for was performed on 1727 clips form blip.tv, distributed un- innovation, as part of the Quaero Programme. evenly over 26 categories including a default category. Single label classification is performed and mean average precision 5. REFERENCES (MAP) is used as the official performance measure. Training of the SVMs was conducted on approximately 100 videos for [1] M. Campbell, E. Haubold, S. Ebadollahi, D. Joshi, each genre, except for autos and vehicles where only 14 clips M. R. Naphade, A. P. Natsev, J. Seidl, J. R. Smith, were available. These training videos are from a larger ad- K. Scheinberg, and L. Xie. IBM Research ditional set of blip.tv videos. However, since we had limited TRECVID-2006 Video Retrieval System. In Proc. of time, it was not possible to process all these videos. There- NIST TRECVID Workshop 2006. fore, we limited the number of training videos per genre to [2] H. K. Ekenel, H. Gao, and R. Stiefelhagen. Universität 100 videos, which are randomly selected for each genre. Be- Karlsruhe (TH) at TRECVID 2008. In NIST cause our system works as a single label classification system, TRECVID Workshop, Gaithersburg, USA, Nov. 2008. we also computed simple classification accuracy and calcu- [3] H. K. Ekenel, T. Semela, and R. Stiefelhagen. lated a 2nd MAP performance with a similarity value of 1, Content-based video genre classification using multiple instead of a very low probability output of our system. cues. In Proceedings of the 3rd International Workshop All in all, 5 runs were evaluated using these three evalua- on Automated Information Extraction in Media tion measures. In our case, a combination of all feature sets Production, AIEMPro’10, pages 21–26, 2010. (run1) and each feature category like visual (run2), aural [4] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and (run3), cognitive (run4) and structural (run5) are evaluated R. Zabih. Image indexing using color correlograms. In independently. The results are presented in Table 1. The Computer Vision and Pattern Recognition (CVPR), least contribution comes from the cognitive features, while pages 762–768, 1997. the visual features (run 2) contribute the most to the over- [5] M. Larson, M. Eskevich, R. Ordelman, C. Kofler, all performance, outperforming the other runs in the MAP S. Schmeideke, and G. J. F. Jones. Overview of performance measures and achieving almost the same clas- MediaEval 2011 Rich Speech Retrieval Task and Genre sification accuracy as all feature sets together. From the Tagging Task. In MediaEval 2011 Workshop, Pisa, six available visual features color moments and wavelet tex- Italy, September 1-2 2011. ture show the best classification results with 20% and 23%, [6] M. Montagnuolo and A. Messina. Parallel neural respectively. networks for multimodal video genre classification. The best results (greater 50%) were achieved in the web de- Multimedia Tools Appl., 41:125–159, January 2009. velopment (66.6%), mainstream media (68.9%), food and- [7] M. A. Stricker and M. Orengo. Similarity of color drink (61.1%), movies and television (58.5%) and litera- images. In Storage and Retrieval for Image and Video ture category with 89.6%. Worst results (under 10%) showed Databases (SPIE)’95, pages 381–392, 1995. documentary (4.5%), educational (3.2%), health (9.5%), tra- [8] M. J. Swain and D. H. Ballard. Color indexing. vel (7.1%) and videoblogging with 0%. International Journal of Computer Vision, 7:11–32, 1991. run1 run2 run3 run4 run5 [9] Z. Wang, M. Zhao, Y. Song, S. Kumar, and B. Li. YouTubeCat: Learning to categorize wild web videos. MAP 0.0023 0.0035 0.001 0.001 0.003 In Computer Vision and Pattern Recognition (CVPR), 2nd MAP 0.0038 0.006 0.001 0.0012 0.0028 2010 IEEE Conference on, pages 879 –886, June 2010. Accuracy (%) 28.2 27.5 13.9 1.3 5.4 Table 1: Evaluation Results