TUD-MIR at MediaEval 2011 Genre Tagging Task: Query
     Expansion from a Limited Number of Labeled Videos
                                      Stevan Rudinac, Martha Larson, Alan Hanjalic
                Multimedia Information Retrieval Lab, Delft University of Technology, Delft, The Netherlands
                                       {s.rudinac, m.a.larson, a.hanjalic}@tudelft.nl


ABSTRACT                                                                language model for each individual genre label. Instead, in our
In this paper we present results of our initial research on genre       approach we take a genre label to be an initial query and mine
tagging. We approach the task from information retrieval perspec-       additional genre-specific query terms from the text associated with
tive using a relatively small number of labeled videos in the de-       the available labeled videos. We choose to use query expansion
velopment set to mine query expansion terms characteristic of           because in our previous work [4] it has proven effective in seman-
each genre. We also investigate which sources of information            tic-theme-based video tagging and retrieval.
associated with the videos or extracted from their audio channel,            The experiments reported here were performed on MediaEval
e.g. title, description, tags and automatic speech recognition tran-    2011 Genre Tagging datasets [1], which consist of semi-
scripts yield the highest improvement within our query expansion        professional documentary videos downloaded from blip.tv to-
framework. The experiments performed on MediaEval 2011                  gether with the associated metadata. The metadata available with
Genre Tagging dataset demonstrate the effectiveness of our ap-          the videos include title, description, tags as well as the id of show
proach.                                                                 to which a particular video episode belongs. The development set
                                                                        consists of 247 videos for which the genre labels are provided.
Categories and Subject Descriptors                                      The test set is larger and consists of 1727 videos. Each video in
H.3.3 [Information Storage and Retrieval]: Information Search           the development and the test set belongs to one of 26 genre cate-
and Retrieval – retrieval models, query formulation.                    gories defined by blip.tv (e.g. art, autos and vehicles, business,
                                                                        default category etc.). The task requires prediction of genre label
General Terms                                                           for the videos in the test set. In the following, we first describe our
Algorithms, Performance, Experimentation.                               query expansion approaches as well as information sources used.
                                                                        Then, we report on experimental results which confirm effective-
Keywords                                                                ness of our approach to genre tagging and indicate research direc-
Genre tagging, query expansion, video retrieval.                        tions that should be pursued for further performance improve-
                                                                        ment.
1. INTRODUCTION                                                         2. APPROACHES
In this paper we present results of our initial research on genre       In all official runs, we expand queries using the videos available
tagging, conducted as part of the participation in MediaEval 2011       in the development set. Additionally, we experiment with several
benchmark. Aiming to create a solid baseline for the future work,       other query expansions and report results as “unofficial runs”.
we investigate which sources of information, including automatic
speech recognition transcripts and metadata associated with the         2.1 Query Expansion via Labeled Videos
videos, such as e.g. title, description and tags would yield the best   We conjecture that a set of terms characteristic of a particular
performance in the task. Information about genre is generally           genre could be extracted even from a small number of labeled
encoded in both visual and spoken channel of the video. In the          videos and further used for query expansion. This concept has
specific case of semi-professional user-generated videos, used to       been widely exploited in e.g. information retrieval approaches
compose MediaEval 2011 Genre Tagging datasets, the visual               with relevance feedback [5]. In our approach, we treat a genre
channel usually doesn’t provide enough information to discrimi-         label as the original query and sample additional query terms from
nate between videos based on genre [6], because a large number          the text associated with the videos of that particular genre avail-
of videos depict a single person talking about a particular topic.      able in the development set. For each video, text from information
For this reason, here we focus on the spoken channel and meta-          sources used in a particular run is concatenated in a single docu-
data only, while the possibilities of exploiting visual content to      ment and then stopword removal and stemming are applied. Fur-
improve performance in a further step are explored in [6].              ther, for each genre we rank all terms in the development set vo-
    Motivated by the success of information retrieval approaches        cabulary according to the decreasing Offer Weight [3] and extend
to semantic video annotation demonstrated in the Tagging Task           the initial query (genre label) with the 20 top-ranked terms.
Professional and WWW of MediaEval 2010 benchmark [2], we
                                                                                                 ( r + 0.5 ) ∗ ( N − n − R + r + 0.5 ) 
perform genre tagging within an information retrieval framework.             OW ( i ) = r ∗ log 
We conjecture that, given a relatively small number of videos in                                 ( n − r + 0.5 ) ∗ ( R − r + 0.5 )       (1)
                                                                                                                                       
the development set, it would be practically infeasible to train a
                                                                           In the formula above, r is the number of videos of a particular
                                                                        genre term t(i) appears in, R is the total number of videos of that
Copyright is held by the author/owner(s).                               genre, N is the total number of videos in the collection and n is the
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy
number of videos in the collection term t(i) appears in. As a re-         for default category and sample 20 expansion terms using the
trieval method we use negative divergence between multinomial             Offer Weight as explained in the previous section. For a descrip-
models of the query and the document (video) implemented in               tion of the baseline retrieval run and the remaining three query
Lemur Toolkit. Since the default category is very broad and di-           expansion approaches please refer to [4]. Such composed queries
verse we produce a ranked list of videos for this genre independ-         are used to query metadata associated with the videos in the test
ently. We conjecture that the videos that are ranked low, or don’t        set. Videos in default category are ranked as described in previous
appear at all in the results lists produced for the other 25 genres,      section. To conserve space, we report only the performance of a
likely belong to the default category. Therefore, we produce the          hypothetical oracle query expansion indicator that chooses the
ranked list for this genre according to the increased video               best performing query expansion (PRF, WordNet, Google Sets or
score, VSi = ∑ g =1:25 ( N g − Rgi ) N g , where Ng is the total number   YouTube) or the baseline for each genre (unofficial run_7 in
                                                                          Table 1). Failure analysis confirms that across genres all of these
of videos retrieved for a particular genre g and Rgi is the rank of       choices make a contribution to performance in individual cases.
video vi in that list. If a particular video doesn’t appear in the
results list produced for a genre g, Rgi is set to 0.                     Table 1. Performance of reported runs expressed in terms of
    In the official runs we test how different sources of informa-        MAP; officially submitted runs are indicated with “^”
tion influence tagging performance of the approach. Performance            run_1^     run_2^    run_3^     run_4^    run_5^     run_6      run_7
of the runs described below is reported in Table 1.                        0.2146     0.2699    0.3212     0.3937    0.4191     0.5594     0.2175
ASR: In the official run_1 we investigate the scenario when no
metadata is available. Expansion terms are sampled from ASR               3. DISCUSSION AND CONCLUSIONS
transcripts of videos in the development set and the retrieval is         We presented several approaches to genre tagging for web video
performed on the ASR transcripts of the videos in the test set.           classification, based on simple and proven information retrieval
                                                                          concepts. The experimental results summarized in Table 1 con-
Metadata No Tags: In the official run_2 only video title and
                                                                          firm their effectiveness for the task. We show that it is possible to
description are exploited.
                                                                          make effective use of sampling genre-specific expansion terms,
Metadata & ASR: In the official run_3 we use title, description,          even when only a limited set of labeled videos is available. Fur-
tags and ASR transcripts associated with the videos.                      ther, we show that the use of metadata yields the highest perform-
                                                                          ance within our framework. Reranking with show ids (run_5 and
Metadata: To produce results in the official run_4 we index               run_6) further improves performance, but we have strong reser-
only title, description and the tags associated with the videos.          vations about generality of this conclusion, because it might be
Reranking With Show ID: In the case of blip.tv dataset, the show          the artifact of blip.tv portal. It is also interesting to notice that the
id is a strong genre indicator. Specifically, on the development set      use of ASR transcripts together with metadata does not improve
we noticed that episodes from the same show are usually of the            performance of genre tagging, which is opposite to our earlier
same genre. However, we decided not to use the show ids in the            findings on “general” video retrieval and tagging [2]. Finally, the
first 4 official runs because we wouldn’t be able to localize per-        results in Table 1 show that the performance of “naïve” baseline,
formance improvement and isolate contribution of other informa-           PRF and query expansions using thesauri and collateral corpora is
tion used. We use results produced in the official run_4 as the           far below level of the approach presented in Section 2.1. In the
baseline for reranking, because of the highest performance on the         future we will work on refinement of the approach and investigate
development set. For each genre, we utilize show ids to compute           the performance on e.g. substantially larger video collections. We
median rank of the videos (episodes) coming from the same show.           will also investigate how visual modality could be exploited for
In the official run_5, we follow the general video search rerank-         improved genre tagging performance.
ing idea, ranking videos of the same show together and according
to their median rank in the starting results list. This run is meant      4. ACKNOWLEDGMENTS
to complement visual reranking runs from [6] which investigate            The research leading to these results has received funding from
usefulness of visual channel for discriminating blip.tv videos            the European Commission's 7th Framework Programme (FP7)
based on genre. In the unofficial reranking run_6 we use a simi-          under grant agreement n° 216444 (NoE PetaMedia).
lar idea and rank at the top all episodes from the test set belonging
to the shows that were in the development set labeled by a given          5. REFERENCES
genre label. Videos belonging to the same show, are sorted ac-            [1]   Larson, M. et al. 2011. Overview of MediaEval 2011 Rich Speech
cording to their rank in the initial results list and otherwise alpha-          Retrieval Task and Genre Tagging Task. In MediaEval ’11.
betically. The remaining videos from the initial results list are
                                                                          [2]   Larson, M. et al. 2011. Automatic tagging and geotagging in video
sorted according to their initial ranks. Note that we consider the
                                                                                collections and communities. In Proceedings ACM ICMR ‘11.
strength of show id as a genre indicator to be an artifact of this
particular dataset and do not expect this approach to generalize          [3]   Robertson, S. E. 1991. On term selection for query expansion. J.
                                                                                Doc. 46, 4.
2.2 Baseline Query Expansions                                             [4]   Rudinac, S., Larson, M. and Hanjalic, A. 2010. Exploiting noisy
Besides the approach described in the previous section, we run                  visual concept detection to improve spoken content based video re-
several experiments using unexpanded queries (genre labels – a                  trieval. In Proceedings ACM MM ‘10.
baseline run) and several query expansions: PRF, WordNet, Goo-            [5]   Ruthven, I. and Lalmas, M. 2003. A survey on the use of relevance
gle Sets and YouTube. To expand queries via YouTube, we first                   feedback for information access systems. Knowl. Eng. Rev. 18, 2.
download metadata (e.g. title, description and tags) of the top-50        [6]   Xu, P. et al. 2011. TUD-MM at MediaEval 2011 Genre Tagging
ranked videos returned by YouTube for each genre label, except                  Task: Video search reranking for genre tagging. In MediaEval ’11.