=Paper=
{{Paper
|id=None
|storemode=property
|title=Brno University of Technology at MediaEval 2011 Genre Tagging Task
|pdfUrl=https://ceur-ws.org/Vol-807/Hradis_BUT_Genre_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/HradisRB11
}}
==Brno University of Technology at MediaEval 2011 Genre Tagging Task==
Brno University of Technology at MediaEval 2011 Genre Tagging Task Michal Hradiš, Ivo Řezníček, Kamil Behúň Graph@FIT, Brno University of Technology, Bozetechova 2, Brno, CZ {ihradis, ireznice}@fit.vutbr.cz, xbehun03@stud.fit.vutbr.cz ABSTRACT extracted patches were represented by SIFT and color SIFT This paper briefly describes our approach to the video genre descriptors (SIFT, CSIFT). Codebooks were created for tagging task which was a part of MediaEval 2011. We fo- each representation by k-means with Euclidean distance and cused mainly on visual and audio information, and we ex- exact nearest neighbor search. The size of all codebooks was ploited metadata and automatic speech transcripts only in 4096. Local features were translated to BOW using code- a very basic way. Our approach relied on classification and book uncertainty [6] with Gaussian kernel and the standard on classifier fusion to combine different sources of informa- deviation set to average distance between closest neighbor- tion. We did not use any additional training data except the ing codewords. The BOW vectors were normalized to L1 very small exemplary set provided by MediaEval (only 246 unit length for classification. videos). The best performance was achieved by metadata BOW representation from audio was extracted almost in alone. Combination with the other sources of information the same way as from video. In this case the one-dimensional did not improve results in the submitted runs. This was audio signal was converted to mel-frequency spectrogram achieved later by choosing more suitable weights in fusion. which is a 2D representation and can be treated as an image. Excluding the metadata, audio and video gave better results For spectrograms, only DENSE8 and DENSE16 sampling than speech transcripts. Using classifiers for 345 semantic was used because the spectrograms do not contain distinct classes from TRECVID 2011 semantic indexing (SIN) task interest regions which Harris-Laplace could detect. Only to project the data worked better than classifying directly SIFT descriptor was used as spectrograms do not contain from video and audio features. any color information. As the provided training set is extremely small, we de- cided to expand this set by treating each video frame and 1. INTRODUCTION short spectrogram as individual sample with label equal to Our approach was mainly motivated by a question how label of the original video, and merge these partial deci- video classification approaches which we employ to solve sions later. 100 equidistant samples were extracted from TRECVID SIN task [2] behave in a different context. The each video (training and testing). The length of the spec- genre tagging task [3] is similar to SIN except the classes are trograms was set to 10 second and an overlap was allowed of different kind, videos belong as a whole to a single class when needed. Linear SVM was used to learn separate 1-to- and, most importantly, the provided training set is in this all classifiers for each genre. Meta-parameter C was set in case more than magnitude smaller. cross-validation which asserted that samples from a single We attempted to exploit most of the modalities available: video did not appear in training and testing set at the same video, audio, automatic speech recognition [1] (ASR) and time. Considering the small number of original videos, we user-supplied metadata. We did not use social network in- set the same C for all classifiers for a particular represen- formation. The image features extracted from video were tation. The final response for a video was computed as the standard Bag of visual Words (BOW) representations com- number of samples from that video for which the classifier monly used for image classification [5]. Spectrograms from for a particular genre gave the highest response compared audio were processed in the same way as image data. BOW to the other genre classifiers. was constructed from metadata and ASR as well. For TRECVID 2011 SIN task, we created classifiers for 345 semantic classes. These classifiers were created in al- most the same way as the classifiers described in the pre- 2. METHOD vious text. We applied these 345 classifiers to the image The BOW representation of video frames was constructed and audio samples and created feature representations for in a standard way [5] starting with local patch sampling fol- the videos by computing histograms of their responses (8 lowed by computing descriptors [4] and a codebook trans- bins per semantic class). The response histograms were then form. We used Harris-Laplace detector (HARLAP) and used to train genre classifiers as before - with a difference dense sampling with position step 8 pixels and patch radius that the training sets were only the 246 videos and that the 8 pixels (DENSE8), respective 16 pixels (DENSE16). The responses were directly used as results. These classifiers are further denoted with TV11. For metadata and ASR we computed BOW representation Copyright is held by the author/owner(s). by removing XML elements, non-alphabetic characters and MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy by splitting words where lower-case character was followed Run MAP by upper-case character. Classifiers for metedata and ASR RUN1 0.165 were created in the same way as for TV11 representation. RUN3 0.346 We assumed that the number of training samples avail- RUN4 0.322 able is too small for accurate and reliable fusion. For this RUN5 0.360 reason we decided to make an educated guess based on pre- vious experience and results on the training set and combine Table 1: Mean average precision on test set achieved classifiers by weighted average with the weights set by hand. by the runs submitted to MediaEval 2011. Responses of classifier based on all audio and video features were averaged into C-AV, TV11 into C-TV11 and ASR Features TV11 and metadata into C-TEXT. These averages were normal- DENSE16 CSIFT 0.126 0.194 ized the biggest standard deviation of individual class re- DENSE16 SIFT 0.100 0.178 sponses and were combined by weighted average. DENSE8 CSIFT 0.116 0.201 RUN1 used only ASR as required. RUN3 combined DENSE8 SIFT 0.078 0.187 C-AV, C-TV11 and C-TEXT with weight for C-TEXT HARLAP CSIFT 0.145 0.178 increased to 2.5. RUN4 combined C-AV and C-TEXT HARLAP SIFT 0.133 0.174 which had weight 1.25. RUN4 combined C-TV11 and C- SPECTRUM DENSE16 SIFT 0.195 0.167 TEXT which had weight 1.25. SPECTRUM DENSE16 SIFT 0.158 0.188 COMBINED (C-AV, C-TV11) 0.226 0.275 ASR 0.165 3. RESULTS METADATA 0.405 The results of the official runs are shown in Table 1. Using C-TEXT 0.300 the MediaEval methodology, we additionally evaluated all ALL WITHOUT METADATA 0.300 the separate parts which were combined for the official runs ALL WITHOUT ASR 0.448 as well as some other combinations. These unofficial results RANDOM 0.046 are shown in Table 2. From the individual types of features, the best results were Table 2: Unofficial results on testing set. Mean av- achieved by metadata. Metadata gives better results than erage precision reported. all the official runs where adding other features decreased the performance. TV11 classifiers provide significantly bet- ter results than classifiers trained directly on image features. source of information for genre recognition, the audio and The same is true also for their combinations, where C-TV11 video content features improved results when appropriately gives 0.275 MAP and C-AV gives only 0.226 MAP. Ques- combined. A larger training set would be needed to perform tion remains if this is because the TRECVID classifiers bring proper classifier fusion which could further increase the ben- additional knowledge or due to the differences in the train- efit of the content-based features. ing of the two sets of classifiers. Interestingly, the audio features provide good results comparable to visual features Acknowledgements in TV11, and are much better than image features when This work has been supported by the EU FP7 project TA2: learning directly from the features. The worse results in the Together Anywhere, Together Anytime ICT-2007-214793, case of TV11 could be explained by lower performance of grant no 214793. the original audio-classifiers on TRECVID data (almost two times worse than image features). Further, we experimented with additional combinations of 5. REFERENCES features. We combined all classifiers and all classifiers ex- [1] Jean-Luc Gauvain, Lori Lamel, and Gilles Adda. The cluding METADATA with weights which more reflect per- limsi broadcast news transcription system. Speech formance of the classifiers. These result are denoted as ALL, Communication, 37(1-2):89 – 108, 2002. respectively ALL WITHOUT METADATA, in Table 2. [2] Michal Hradis et al. Brno university of technology at The weights were 1× ASR, 1× C-AV, 4× C-TV11 and trecvid 2010. In TRECVID 2010: Participant Notebook 8× METADATA. The combination ALL provides over- Papers and Slides, page 11. National Institute of all best result 0.448 MAP and significantly improves over Standards and Technology, 2010. the metadata alone. Improving over all its components, [3] Martha Larson et al. Overview of mediaeval 2011 rich ALL WITHOUT METADATA reaches 0.3 MAP. speech retrieval task and genre tagging task. In MediaEval 2011 Workshop, Pisa, Italy, September 1-2 4. CONCLUSION 2011. [4] Krystian Mikolajczyk and Cordelia Schmid. A The achieved results are surprisingly good considering the performance evaluation of local descriptors. IEEE small size of the training set used. Question is how the Trans. Pattern Anal. Mach. Intell., 27(10):1615–1630, results would compare to other methods on this dataset, 2005. especially to those which use external sources of knowledge and which focus more on the metadata, as it was shown to [5] Cees G. M. Snoek et al. The mediamill trecvid 2010 be the most important source of information. Additionally, semantic video search engine. In TRECVID 2010: it is not certain how the presented methods would work on Participant Notebook Papers and Slides, 2010. more diverse dataset. [6] J. C. van Gemert et al. Visual word ambiguity. PAMI, Although, the metadata is definitely the most important 32(7):1271–1283, 2010.