=Paper=
{{Paper
|id=None
|storemode=property
|title=TUD-MM at MediaEval 2011 Genre Tagging Task: Video search reranking for genre tagging
|pdfUrl=https://ceur-ws.org/Vol-807/Xu_TUD_Genre_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/XuTH11
}}
==TUD-MM at MediaEval 2011 Genre Tagging Task: Video search reranking for genre tagging==
<pdf width="1500px">https://ceur-ws.org/Vol-807/Xu_TUD_Genre_me11wn.pdf</pdf>
<pre>
      TUD-MM at MediaEval 2011 Genre Tagging Task: Video
             Search Reranking for Genre Tagging
                                                Peng Xu1, D.M.J. Tax2, Alan Hanjalic1
                                    1
                                     Delft Information Retrieval Lab, 2Pattern Recognition Laboratory
                                                            Delft University of Technology
                                                       Mekelweg 4, Delft, The Nethelands
                                                    {p.xu, d.tax, a.hanjalic@tudelft.nl}

ABSTRACT                                                                     content. Motivated by the limitations of visual similarity, we
In this paper, we investigate the possibility of using visual                proposed a method for video structure representation and
information to improve the text based ranking. Both a structure              measurement.
based representation (using the similarity matrix of the frames of           In this method, a video self-similarity matrix is used to represent
one video) as well as a key-frame based representation (using                the structure. This matrix is generated by calculating the pairwise
visual words) is evaluated. It appears that only in some queries the         similarity between frames from one video, that are sampled with a
visual information can improve the performance by reranking.                 fixed sampling rate. These similarities are calculated from the
The presented experiments on reranking show the limitations and              HSV histogram of each frame. This representation exploits the
also the potential, for these structural and visual representations          fact that one video tends to have consistent quality and editing
                                                                             conditions. A reliable similarity can be achieved without
Categories and Subject Descriptors                                           complicated low level visual features or additional domain
H.3 [Information storage and Retrieval]: H3.1 Content                        knowledge
Analysis and Indexing; H3.7 Digital libraries.
                                                                             Each video is now represented by a square similarity matrix,
General Terms                                                                which can be considered as a square gray-level image. Next, three
Measurement, Performance, Experimentation,                                   types of multi-scale statistical image features can be extracted
                                                                             from the self-similarity matrix: a) GLCM based features (30
Keywords                                                                     components). The Gray Level Co-occurrence Matrix (GLCM) is
Video representation, Video search reranking                                 constructed from the similarity matrix for 2 directions (0, 45) and
                                                                             3 offsets (3, 6, 12). For each GLCM, energy, entropy, correlation,
1. INTRODUCTION                                                              contrast and Homogeneity are computed; b) 3-scale Gabor texture
In the MediaEval 2011 Genre Tagging Task, internet videos have             features (14 components). c) Intensity Coherence Vectors (16
to be ranked according to their relevance for a set of genre tags [1].       components). Each pixel within a given intensity bin is classified
For certain domains, visual information has been proved to be                as either coherent or in coherent, based on whether or not it is part
related to video genres, but it is still a challenging problem for           of a large similarly-colored region. The size of the region is
internet videos, because of the diversity of the video content.              determined by a fix threshold. (1/15 of the size of image is used.)
This research evaluates the potential of visual information for              2.2 Key-frame based representation
genre level video retrieval. This problem is addressed by using
                                                                             Next to the structure, a Visual Word representation based on key-
visual information to rerank the text retrieval ranking list. Despite
                                                                             frames is used for measuring the visual similarity between videos
the different strategies used in various reranking methods, the
                                                                             [2]. The key-frames are clustered into N clusters using K-means
basic assumption is that visually similar videos should be ranked
                                                                             clustering based on image features. Next, every key-frame can be
in nearby positions in the ranking list. Therefore, it is important to
                                                                             assigned to a cluster label, and the label histogram is finally used
find the appropriate visual features to represent movies.
                                                                             as the representation of the video. This feature was designed for
2. VIDEO REPRESENTATIONS                                                     web video categorization. The number of clusters is set as K=400.
                                                                             Additional experiments have shown that the performance is not
In this paper, a Bayesian reranking approach is performed based
                                                                             sensitive to this parameter.
on two kinds of video representations: the first one is a structure
based feature and the second one is a key frame based feature.               3.    BAYESIAN RERANKING
2.1      Structure based representation                                      The attractiveness of reranking is that it is naturally unsupervised.
                                                                             Given an initial ranking list, an improved one can be achieved by
Most video measurements are based on comparing visual
                                                                             grouping the visually similar videos into the nearby positions.
similarity between videos directly, using color, shapes and
                                                                             However, in practice, the designing of re-ranking methods and the
movement. However, these measurements would not always work
                                                                             setting of parameters are highly depended on the quality and
due to the high variance of video content. In particular, video
                                                                             characteristics of the base line ranking list.
from the same genre are not expected to have the same visual
                                                                             Bayesian video re-ranking method is used in this paper, because it
                                                                             requires less assumptions of the original ranking list and it is less
Copyright is held by the author/owner(s).
                                                                             sensitive to particular parameter settings [3]. In this method, the
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy
reranking problem is considered as minimizing the following                          tend to have duplicate parts, which can be easily detected by
energy function.                                                                     visual similarity based representations.
                   1                                           r r                               ASR baseline           Gabor+CCV           VW
        E (r )       wij (ri  rj ) 2  c  (i , j )Sr (1  r1  r2 )
                   2 i, j
                                                                                           1
                                                                1    2                   0.8
Here   r  [r1, r2 ,, rN ] is the ranking list after reranking,                         0.6
                                                                                         0.4
r [r1, r2,, rN ] is the initial ranking list, w         exp      x       x /σ is
                                                                                         0.2
the visual similarity between two items in the refined ranking list.                       0
The first term measures the visual consistency of the ranking list,


                                                                                                1001
                                                                                                1006
                                                                                                1007
                                                                                                1008
                                                                                                1009
                                                                                                1010
                                                                                                1011
                                                                                                1012
                                                                                                1013
                                                                                                1016
                                                                                                1017
                                                                                                1019
                                                                                                1020
                                                                                                1025
while the second term refers to the ranking distance between the
reranking list and initial list. c is a trade-off parameter to the two
terms, which can be optimized on the development set.

4. RESULTS                                                                           Figure 1. Reranking results for selected queries on ASR text
Reranking is performed based on two baseline ranking lists which                     baseline
are generated by text retrieval. The first one uses information of                   Exploring the similarity within the same show is not enough for
automatic speech transcript (ASR), the second one uses metadata                      genre tagging. Videos of the same show have certain chances to
of the video. The second one is expected to outperform the first                     be of different genres. Moreover, there are some queries of which
one. Details of generating these two baselines can be seen at [4].                   videos are from many different shows. (For example, the 64
Five official runs for this task are submitted. The 5 runs are                       videos of the query ‘1019 sports’ are from 16 different shows.) In
organized as followed: 1) Gabor feature combined with CCV                            these cases, the visual similarities between true positive videos are
feature on ASR baseline; 2) Visual words based feature on ASR                        not obvious. Therefore, the structure based features outperform
baseline; 3) Gabor feature combined with CCV feature on                              key-frame based ones. This indicates that the videos of same
metadata baseline; 4) Visual words based feature on metadata                         genre may share similarities in structure even though they are not
baseline 5) GLCM feature combined with CCV feature on                                consistent in visual.
metadata baseline. The comparison of Mean average precision
(MAP) of text baseline and reranking results are shown in Table.1.                   5. DISCUSSION AND FUTURE WORK
                                                                                     Although the visual reranking made no improvement for the
     Table 1. MAPs of text baseline and re-ranking results                           initial ranking list on MAP, it does not necessarily mean that
                                             ASR                  Metadata           visual information is useless for detecting video genre. It is the
          Baseline                                                                   special characteristics of this dataset that make it difficult to
                                            0.2146                 0.3936
               Gabor+CCV                    0.2060                 0.3703            utilize information in visual channel. In particular, compared with
Reranking      GLCM+CCV                       ---                  0.3690            conventional understanding of video ‘genres’, the genre tags
                  VW                        0.2098                 0.3605            given in this task are more related to the ‘topics’ of videos. The
                                                                                     proposed structure based video representation provided a
It can be seen in Table1 that compared with the initial ranking                      possibility for an inexact matching for video similarity. The
lists, the reranking process did not improve the overall MAP. This                   characteristics of this representation can be observed through
result is unexpected, because there are some results in literature                   analyzing the reranking performance of certain queries. It is still
that suggests that the visual channel may contain information                        not clear what is the most suitable way of representing the frame
about the video genre [5]. It appears that in this dataset, around                   similarity matrix. A more attractive direction may be discovering
one fourth of the videos contain a single person talking with little                 a set of tags which could reflect the visual consistency of videos.
visual aids. Therefore, these videos do not contain sufficient
information in the visual channel to estimate the genre tags of                      6. REFERENCES
these videos.                                                                        [1] Larson, M., Eskevich, M., Ordelman, R., Kofler, C.,
                                                                                         Schmiedeke, S. and Jones, G.J.F. Overview of MediaEval
Furthermore, many videos in this dataset are presented in series.                        2011 Rich Speech Retrieval Task and Genre Tagging Task,
1390 videos in the test set have more than 2 episodes belong to                          MediaEval 2011 Workshop, 1-2 September 2011, Pisa, Italy.
the same show. Videos of the same show tend to share certain
visual similarities. Through the analysis of the reranking                           [2] Yang, L., Liu, J., Yang, X and Hua, X-S. 2007. Multi-
performance on each query, it can be observed that this property                         modality web video categorization. In Proceedings of
has a strong effect on the performance. (The reranking results for                       MIR ’07. 265-274
some selected queries are presented in Figure 1.) Generally                          [3] Tian, X., Yang, L., Wang, J., Yang, Y., Wu, X. and Hua, X.-
speaking, if most of the true positive videos for a certain query are                    S. 2008. Bayesian video search reranking. In Proceeding of
from one or several shows, the reranking results can be reasonable,                      MM’08, ACM, 131-140
such as the query ‘1016 politics’.
                                                                                     [4] Rudinac, S., Larson, M., and Hanjalic A., 2011. TUD-MIR at
In particular, the most significant improvement appeared in the                          MediaEval 2011 Genre Tagging Task: Query Expansion
query ‘1001 autos_and_vehicles’. It can be seen from the ground                          from a Limited Number of Labeled Videos, In Working
truth that all the 6 videos in this genre are episodes of a same                         Notes MediaEval 2011
show. The reranking process takes advantage of the high visual
                                                                                     [5] Brezeale, D. and Cook, D. Automatic Video Classification:
similarity between the 6 videos. Especially, for this query, key-
                                                                                         A Survey of the Literature. 2008. IEEE Transactions on
frame based features achieved higher performance than the
                                                                                         Systems, Man, and Cybernetics, Part C: Applications and
structure based features. This is because videos in the same series
                                                                                         Reviews, 416 -430

</pre>