Genre tagging of videos based on information retrieval and
           semantic similarity using WordNet ∗

       José M. Perea-Ortega, Arturo Montejo-Ráez, Manuel C. Díaz-Galiano and M. Teresa
                                       Martín-Valdivia
                                  SINAI Research Group, Computer Science Department
                                                  University of Jaén
                                                 23071 - Jaén, Spain
                                   {jmperea,amontejo,mcdiaz,maite}@ujaen.es

ABSTRACT                                                        videos may not contain suﬃcient context for locating data of
In this paper we propose a new approach for the genre tag-      interest in a large database. Detailed annotation is required,
ging task of videos, using only their ASR transcripts and       so that users could quickly locate clips of interest without
associated metadata. This new approach is based on calcu-       having to go through entire databases.
lating the semantic similarity between the nouns detected          The Genre Tagging task in MediaEval 2011 attempts to
in the video transcripts and a bag of nouns generated from      automatically generate genre labels to organize videos [2]. In
WordNet, for each category proposed to classify the videos.     this paper we present some experiments on automatic genre
Speciﬁcally, we have used the Lin measure based on Word-        tagging of videos making use of their Automatic Speech
Net, which calculates the semantic distance between two         Recognition (ASR) transcripts and metadata associated. We
synsets. Obviously, this approach has been only applied         have worked during last years in the ﬁeld of video catego-
on the English test videos due to the use of WordNet, an        rization, participating in VideoCLEF [6, 7] and MediaEval
English lexical resource. As base case, we have applied an      2010 [5].
information retrieval system as a classiﬁer, using the gene-
rated bag of nouns for each category as index data and the      2.     DESCRIPTION OF THE TASK
ASR transcripts from each test video as query. Several ex-         In the Genre Tagging task of MediaEval 2011, participants
periments have been submitted, one of them combining both       are required to automatically assign thematic subject labels
approaches (information retrieval and semantic similarity).     to videos using features derived from speech, metadata, au-
As main conclusion we have shown that, using this combina-      dio or visual content. It is important to note that this is
tion of semantic similarity and information retrieval, we can   not a multilabel tagging task, so a given video can only be
improve the results obtained using the information retrieval    assigned to one label. The data set provided are the same as
approach only.                                                  those used in MediaEval 2010 Wild Wild Web Tagging Task
                                                                (ME10WWW) [2]. The development and test data sets con-
Categories and Subject Descriptors                              sisted of 247 and 1,727 videos respectively. From the test
H.3.1 [Information Storage and Retrieval]: Content              videos, 1,673 videos are in English, 16 are in French, 25 are
Analysis and Indexing - Indexing methods                        in Spanish and 13 are in Dutch. We have only worked with
                                                                the English videos. The list of genre classes consisted of 25
                                                                tags, providing a “default category” for those videos that do
Keywords                                                        not ﬁt in any other classes.
Genre video tagging, Video categorization, Automatic Speech
Recognition                                                     3.     SYSTEM OVERVIEW
                                                                   Our main approach is based on using an Information Re-
1. MOTIVATION AND RELATED WORK                                  trieval (IR) system as a classiﬁer. On the one hand, we have
   Multimedia data are usually tagged with some relevant        generated a XML document or bag of words for each cat-
information in order to make the retrieval easier. In fact,     egory proposed, making use of an external lexical resource
the eﬃcient use of textual data associated to other types of    like WordNet1 . Speciﬁcally, we have included synonyms,
information such as images can improve multimedia IR sys-       hyponyms and domain terms related to the category. For
tems [1, 4]. However, the provided labelling of multimedia      example, for the “educational ” category we have generated a
∗This work has been partially supported by a grant from the     XML document including terms such as instruction, teach-
Fondo Europeo de Desarrollo Regional (FEDER), project           ing, pedagogy, didactics, training, etc. On the other hand,
TEXT-COOL 2.0 (TIN2009-13391-C04-02) from the Span-             the preprocessed ASR transcripts (stemming and stop word
ish Government, a grant from the Andalusian Government,         removal) from test videos have been used as queries, without
project GeOasis (P08-TIC-41999) and Geocaching Urbano           any expansion. Finally, the Terrier2 IR system has been used
research project (RFC/IEG2010).                                 to obtain a measure of relatedness (RSV, Retrieval Status
                                                                Value) between each video and the generated bags of words.
                                                                1
Copyright is held by the author/owner(s).                           http://wordnet.princeton.edu
                                                                2
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy           http://terrier.org
   As a second approach, we have used the formula proposed     that really did not correspond due to its low similarity score.
by Lin [3], which is based on WordNet, to measure the se-      Nevertheless, for some categories (art, politics, religion and
mantic similarity between the nouns detected in each test      sports), we obtained good results, achieving high MAP indi-
video and the bags of words generated for each category.       vidual scores (e.g. 0.6176 for the politics category). This is
Firstly, for calculating the semantic similarity between a     due to such categories are more general concepts or genres
video and the XML document generated for a category, we        than others (business, comedy, documentary, etc.), so it was
have obtained the Lin semantic similarity between each pair    easier to ﬁnd more nouns semantically related, increasing
of nouns from both, accumulating those similarity scores       the size of the XML document generated for such categories
that exceed a threshold set at 0.75. With the use of this      and, therefore, the probability of success.
threshold, we have tried to minimize the eﬀect of the size
of the ASR transcripts, since some of the videos contain       5.   CONCLUSIONS
more words than others. Secondly, the accumulated similar-
                                                                  In this paper we propose the use of the semantic simi-
ity score has been divided by the number of words detected
                                                               larity based on WordNet combined with the IR approach
in the video, obtaining the ﬁnal semantic similarity score.
                                                               in order to solve the genre tagging task of videos. Because
Those videos with a ﬁnal semantic similarity score of less
                                                               our research ﬁeld of interest is Natural Language Processing
than 0.25 were considered in the default category.
                                                               (NLP), we have only worked with the ASR transcripts from
                                                               videos and their metadata. It was shown that combining
4. EVALUATION OF RESULTS                                       the semantic similarity score with the RSV score obtained
   Several experiments were carried out under both approa-     from the IR approach, we obtained a signiﬁcant improve-
ches. As baseline, we have considered the use of the pre-      ment. Nevertheless, it seems clear that working only with
processed ASR transcripts from test videos as query (expe-     the ASR transcripts generally get poor results. For future
riment IR ASR). Then, we have tried to evaluate the ad-        work, we will study other resources in order to increase the
dition of the metadata provided, carrying out an expan-        size of the bag of words generated for each category, adding
sion of the ASR transcripts using such metadata (exper-        more terms semantically related with such categories.
iments IR ASR+MD and IR ASR+MD+TAGS ). Regard-
ing the second approach, we have submitted the experiment      6.   REFERENCES
“SIMSEM-ASR”, in which we only calculate the semantic          [1] Bozzon, A., and Fraternali, P. Multimedia and
similarity between each video and each category (its bag of        multimodal information retrieval. In SeCO Workshop
words), without using the IR approach. Finally, we have            (2009), S. Ceri and M. Brambilla, Eds., vol. 5950 of
combined the IR and the semantic similarity approaches             Lecture Notes in Computer Science, Springer,
(experiment “SIMSEM+IR-ASR”), merging both lists of re-            pp. 135–155.
sults. First, we have normalized the RSV score from the        [2] Larson, M., Eskevich, M., Ordelman, R., Kofler,
baseline. Then, for each test video, we have added their           C., Schmeideke, S., and Jones, G. Overview of
normalized RSV and semantic similarity scores. The results         MediaEval 2011 Rich Speech Retrieval Task and Genre
obtained are shown in Table 1, using the Mean Average Pre-         Tagging Task. In MediaEval 2011 Workshop (Pisa,
cision (MAP) measure. We also show the MAP obtained                Italy, September 1-2 2011).
considering only the English test videos.
                                                               [3] Lin, D. An information-theoretic deﬁnition of
                                                                   similarity. In Proc. of the 15th Int’l. Conf. on Machine
       Run name             M APof f icial   M APEnglish
                                                                   Learning (1998), pp. 296–304.
         IR ASR               0.1031           0.1044
                                                               [4] Martı́n-Valdivia, M. T., Dı́az-Galiano, M. C.,
      IR ASR+MD               0.1073           0.1088
                                                                   Montejo-Ráez, A., and Ureńa-López, L. A. Using
   IR ASR+MD+TAGS             0.1115           0.1129
                                                                   Information Gain to Improve Multimodal Information
      SIMSEM-ASR              0.0547           0.0559
                                                                   Retrieval Systems. Information Processing &
    SIMSEM+IR-ASR             0.1266           0.1288
                                                                   Management 44 (2008), 1146–1158.
Table 1: Experiments and results obtained by                   [5] Perea-Ortega, J. M., Montejo-Ráez, A.,
SINAI in the MediaEval 2011 Genre Tagging task                     Dı́az-Galiano, M. C., and Martı́n-Valdivia, M. T.
                                                                   SINAI at Tagging Task Professional in MediaEval 2010.
   Analyzing the oﬃcial results we can observe that the ex-        In Working Notes Proceedings of the MediaEval 2010
pansion of the ASR transcripts using the provided meta-            Workshop, Pisa, Italy, October 24, 2010 (2010).
data improves the result obtained when metadata is not used    [6] Perea-Ortega, J. M., Montejo-Ráez, A.,
(+4% and +8.15% better for the experiments IR ASR+MD               Dı́az-Galiano, M. C., Martı́n-Valdivia, M. T.,
and IR ASR+MD+TAGS, respectively), as it was expected.             and Ureña-López, L. A. Using an information
On the other hand, the combination of the semantic similar-        retrieval system for video classiﬁcation. In Evaluating
ity and the IR approaches seems to be interesting because it       Systems for Multilingual and Multimodal Information
improves the MAP value obtained for the baseline using the         Access (2009), vol. 5706 of Lecture Notes in Computer
IR approach only (+22.79%). Taking into account the test           Science, Springer, pp. 927–930.
groundtruth ﬁle provided by the MediaEval organizers, 185      [7] Perea-Ortega, J. M., Montejo-Ráez, A.,
videos of the 1,673 English videos (11.06%) belong to the          Martı́n-Valdivia, M. T., and Ureña-López, L. A.
default category, while our best experiment assigned only          Using support vector machines as learning algorithm
18 videos (1.08%) to such category. This was motivated by          for video categorization. In CLEF, Part II (2010),
the low threshold used to assign a video to the default cat-       vol. 6242 of Lecture Notes in Computer Science,
egory (0.25), which allowed to classify videos in categories       Springer, In Press.