=Paper=
{{Paper
|id=None
|storemode=property
|title=TUD-MM at MediaEval 2011 Genre Tagging Task: Video search reranking for genre tagging
|pdfUrl=https://ceur-ws.org/Vol-807/Xu_TUD_Genre_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/XuTH11
}}
==TUD-MM at MediaEval 2011 Genre Tagging Task: Video search reranking for genre tagging==
TUD-MM at MediaEval 2011 Genre Tagging Task: Video Search Reranking for Genre Tagging Peng Xu1, D.M.J. Tax2, Alan Hanjalic1 1 Delft Information Retrieval Lab, 2Pattern Recognition Laboratory Delft University of Technology Mekelweg 4, Delft, The Nethelands {p.xu, d.tax, a.hanjalic@tudelft.nl} ABSTRACT content. Motivated by the limitations of visual similarity, we In this paper, we investigate the possibility of using visual proposed a method for video structure representation and information to improve the text based ranking. Both a structure measurement. based representation (using the similarity matrix of the frames of In this method, a video self-similarity matrix is used to represent one video) as well as a key-frame based representation (using the structure. This matrix is generated by calculating the pairwise visual words) is evaluated. It appears that only in some queries the similarity between frames from one video, that are sampled with a visual information can improve the performance by reranking. fixed sampling rate. These similarities are calculated from the The presented experiments on reranking show the limitations and HSV histogram of each frame. This representation exploits the also the potential, for these structural and visual representations fact that one video tends to have consistent quality and editing conditions. A reliable similarity can be achieved without Categories and Subject Descriptors complicated low level visual features or additional domain H.3 [Information storage and Retrieval]: H3.1 Content knowledge Analysis and Indexing; H3.7 Digital libraries. Each video is now represented by a square similarity matrix, General Terms which can be considered as a square gray-level image. Next, three Measurement, Performance, Experimentation, types of multi-scale statistical image features can be extracted from the self-similarity matrix: a) GLCM based features (30 Keywords components). The Gray Level Co-occurrence Matrix (GLCM) is Video representation, Video search reranking constructed from the similarity matrix for 2 directions (0, 45) and 3 offsets (3, 6, 12). For each GLCM, energy, entropy, correlation, 1. INTRODUCTION contrast and Homogeneity are computed; b) 3-scale Gabor texture In the MediaEval 2011 Genre Tagging Task, internet videos have features (14 components). c) Intensity Coherence Vectors (16 to be ranked according to their relevance for a set of genre tags [1]. components). Each pixel within a given intensity bin is classified For certain domains, visual information has been proved to be as either coherent or in coherent, based on whether or not it is part related to video genres, but it is still a challenging problem for of a large similarly-colored region. The size of the region is internet videos, because of the diversity of the video content. determined by a fix threshold. (1/15 of the size of image is used.) This research evaluates the potential of visual information for 2.2 Key-frame based representation genre level video retrieval. This problem is addressed by using Next to the structure, a Visual Word representation based on key- visual information to rerank the text retrieval ranking list. Despite frames is used for measuring the visual similarity between videos the different strategies used in various reranking methods, the [2]. The key-frames are clustered into N clusters using K-means basic assumption is that visually similar videos should be ranked clustering based on image features. Next, every key-frame can be in nearby positions in the ranking list. Therefore, it is important to assigned to a cluster label, and the label histogram is finally used find the appropriate visual features to represent movies. as the representation of the video. This feature was designed for 2. VIDEO REPRESENTATIONS web video categorization. The number of clusters is set as K=400. Additional experiments have shown that the performance is not In this paper, a Bayesian reranking approach is performed based sensitive to this parameter. on two kinds of video representations: the first one is a structure based feature and the second one is a key frame based feature. 3. BAYESIAN RERANKING 2.1 Structure based representation The attractiveness of reranking is that it is naturally unsupervised. Given an initial ranking list, an improved one can be achieved by Most video measurements are based on comparing visual grouping the visually similar videos into the nearby positions. similarity between videos directly, using color, shapes and However, in practice, the designing of re-ranking methods and the movement. However, these measurements would not always work setting of parameters are highly depended on the quality and due to the high variance of video content. In particular, video characteristics of the base line ranking list. from the same genre are not expected to have the same visual Bayesian video re-ranking method is used in this paper, because it requires less assumptions of the original ranking list and it is less Copyright is held by the author/owner(s). sensitive to particular parameter settings [3]. In this method, the MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy reranking problem is considered as minimizing the following tend to have duplicate parts, which can be easily detected by energy function. visual similarity based representations. 1 r r ASR baseline Gabor+CCV VW E (r ) wij (ri rj ) 2 c (i , j )Sr (1 r1 r2 ) 2 i, j 1 1 2 0.8 Here r [r1, r2 ,, rN ] is the ranking list after reranking, 0.6 0.4 r [r1, r2,, rN ] is the initial ranking list, w exp x x /σ is 0.2 the visual similarity between two items in the refined ranking list. 0 The first term measures the visual consistency of the ranking list, 1001 1006 1007 1008 1009 1010 1011 1012 1013 1016 1017 1019 1020 1025 while the second term refers to the ranking distance between the reranking list and initial list. c is a trade-off parameter to the two terms, which can be optimized on the development set. 4. RESULTS Figure 1. Reranking results for selected queries on ASR text Reranking is performed based on two baseline ranking lists which baseline are generated by text retrieval. The first one uses information of Exploring the similarity within the same show is not enough for automatic speech transcript (ASR), the second one uses metadata genre tagging. Videos of the same show have certain chances to of the video. The second one is expected to outperform the first be of different genres. Moreover, there are some queries of which one. Details of generating these two baselines can be seen at [4]. videos are from many different shows. (For example, the 64 Five official runs for this task are submitted. The 5 runs are videos of the query ‘1019 sports’ are from 16 different shows.) In organized as followed: 1) Gabor feature combined with CCV these cases, the visual similarities between true positive videos are feature on ASR baseline; 2) Visual words based feature on ASR not obvious. Therefore, the structure based features outperform baseline; 3) Gabor feature combined with CCV feature on key-frame based ones. This indicates that the videos of same metadata baseline; 4) Visual words based feature on metadata genre may share similarities in structure even though they are not baseline 5) GLCM feature combined with CCV feature on consistent in visual. metadata baseline. The comparison of Mean average precision (MAP) of text baseline and reranking results are shown in Table.1. 5. DISCUSSION AND FUTURE WORK Although the visual reranking made no improvement for the Table 1. MAPs of text baseline and re-ranking results initial ranking list on MAP, it does not necessarily mean that ASR Metadata visual information is useless for detecting video genre. It is the Baseline special characteristics of this dataset that make it difficult to 0.2146 0.3936 Gabor+CCV 0.2060 0.3703 utilize information in visual channel. In particular, compared with Reranking GLCM+CCV --- 0.3690 conventional understanding of video ‘genres’, the genre tags VW 0.2098 0.3605 given in this task are more related to the ‘topics’ of videos. The proposed structure based video representation provided a It can be seen in Table1 that compared with the initial ranking possibility for an inexact matching for video similarity. The lists, the reranking process did not improve the overall MAP. This characteristics of this representation can be observed through result is unexpected, because there are some results in literature analyzing the reranking performance of certain queries. It is still that suggests that the visual channel may contain information not clear what is the most suitable way of representing the frame about the video genre [5]. It appears that in this dataset, around similarity matrix. A more attractive direction may be discovering one fourth of the videos contain a single person talking with little a set of tags which could reflect the visual consistency of videos. visual aids. Therefore, these videos do not contain sufficient information in the visual channel to estimate the genre tags of 6. REFERENCES these videos. [1] Larson, M., Eskevich, M., Ordelman, R., Kofler, C., Schmiedeke, S. and Jones, G.J.F. Overview of MediaEval Furthermore, many videos in this dataset are presented in series. 2011 Rich Speech Retrieval Task and Genre Tagging Task, 1390 videos in the test set have more than 2 episodes belong to MediaEval 2011 Workshop, 1-2 September 2011, Pisa, Italy. the same show. Videos of the same show tend to share certain visual similarities. Through the analysis of the reranking [2] Yang, L., Liu, J., Yang, X and Hua, X-S. 2007. Multi- performance on each query, it can be observed that this property modality web video categorization. In Proceedings of has a strong effect on the performance. (The reranking results for MIR ’07. 265-274 some selected queries are presented in Figure 1.) Generally [3] Tian, X., Yang, L., Wang, J., Yang, Y., Wu, X. and Hua, X.- speaking, if most of the true positive videos for a certain query are S. 2008. Bayesian video search reranking. In Proceeding of from one or several shows, the reranking results can be reasonable, MM’08, ACM, 131-140 such as the query ‘1016 politics’. [4] Rudinac, S., Larson, M., and Hanjalic A., 2011. TUD-MIR at In particular, the most significant improvement appeared in the MediaEval 2011 Genre Tagging Task: Query Expansion query ‘1001 autos_and_vehicles’. It can be seen from the ground from a Limited Number of Labeled Videos, In Working truth that all the 6 videos in this genre are episodes of a same Notes MediaEval 2011 show. The reranking process takes advantage of the high visual [5] Brezeale, D. and Cook, D. Automatic Video Classification: similarity between the 6 videos. Especially, for this query, key- A Survey of the Literature. 2008. IEEE Transactions on frame based features achieved higher performance than the Systems, Man, and Cybernetics, Part C: Applications and structure based features. This is because videos in the same series Reviews, 416 -430