=Paper=
{{Paper
|id=None
|storemode=property
|title=LIA @ MediaEval 2011: Compact representation of heterogeneous descriptors for video genre classification
|pdfUrl=https://ceur-ws.org/Vol-807/Rouvier_LIA_Genre_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RouvierL11
}}
==LIA @ MediaEval 2011: Compact representation of heterogeneous descriptors for video genre classification==
<pdf width="1500px">https://ceur-ws.org/Vol-807/Rouvier_LIA_Genre_me11wn.pdf</pdf>
<pre>
   LIA @ MediaEval 2011 : Compact Representation of
Heterogeneous Descriptors for Video Genre Classification

                         Mickael Rouvier                                         Georges Linarès
                     LIA - University of Avignon                             LIA - University of Avignon
                           Avignon, France                                         Avignon, France
              mickael.rouvier@univ-avignon.fr                           georges.linares@univ-avignon.fr


ABSTRACT                                                          way to obtain a corpus is to download documents given by a
This paper describes our participation in Genre Tagging           web search engine using useful queries. For example, for the
Task @ MediaEval 2011, which aims at predicting the Genre         Health genre we download all the videos on Youtube using
of Internet videos. We propose a method that extracts low         the query : health. But using only the genre like query
dimensional feature space based on text, audio and video          does not allow to download a training set that represents
information. In the best configuration, our system yields a       the class variability. We propose to expand our query to
0.56 MAP (Mean Average Precision) on the test corpus.             other terms revolving around the genre. For example, for
                                                                  the Religion genre, we need to expand our query to related
                                                                  terms like : ethnic, belief, freedom, practice, etc. In order
Categories and Subject Descriptors                                to find terms closely related to the genre, we propose to use
H.3.1 [Information Search and Retrieval]: Content Anal-           Latent Dirichlet Allocation (LDA). LDA is a unsupervised
ysis and Indexing—Indexing method                                 word clustering method that relies on word co-occurrence
                                                                  analysis. The 5000-classes LDA model was estimated on
                                                                  the Gigaword corpus. Each cluster is composed of the 10
General Terms                                                     best words. Queries are expanded by adding all words from
Algorithms, Measurement, Performance, Experimentation             the best cluster containing the genre tag.
                                                                     We propose the use of different information sources ex-
1.   INTRODUCTION                                                 tracted from audio, speech and video channels. Consequently,
   The Genre Tagging Task held as part of MediaEval 2011          our training set should include all these sources, especially
required participants to automatically predict tags of videos     transcription of spoken contents. ASR performance on Web
from Internet. The task consists in associating each video to     data are usually high, however the ASR system used on the
one and only one of the 26 provided genres [3]. The videos        test set is not freely distributed. To overcome this problem,
are provided with some additional information like metadata       we propose to collect text materials by downloading web
(title and description), speech recognition transcripts [2] and   pages from the Internet. Our training corpus consists of web
social network information (gathered from Twitter).               pages collected from Google (60 documents per class, 1560
   One of the main difficulty of video genre categorization       documents) and videos collected from Youtube and Daily-
is due to the diversity of the information sources that are       motion (120 documents per class, 3120 documents). There
genre-dependent (spoken contents, video structure, audio          are more videos than web pages because of technical restric-
and video patterns, etc.) and to the variability of video         tions imposed by Google. The collected documents (web
classes. In order to deal with this problem, we propose to        pages and videos) are only in English.
combine various features from audio and video channels and
to reduce the resulting large-dimensional input data to a         3.     SYSTEM DESCRIPTION
low dimensional feature vector while retaining most of the           The proposed system has a 2-level architecture. The first
relevant information (reducing redundancies and minimising        level consists of extracting low dimensional features from
the useless information).                                         speech, audio and video. Each feature is then given to a
   The paper is organized as follows. Section 2 explains how      SVM (Support Vector Machine) classifier. The second level
we collect our training data. Section 3 describes our sys-        combines the scores of three SVM models. This combina-
tem. Section 4 present the features used by our system and        tion is achieved by linear interpolation whose coefficients are
Section 5 summarizes the results.                                 determined on the development corpus.

2.   CORPUS                                                       4.     FEATURES
  This task is seen as a classification task/problem. We
choose to follow a slightly supervised approach. The training     4.1     Text
dataset is collected from the Web. A simple and effective            Most of the linguistic-level methods for video genre classi-
                                                                  fication rely on extracting relevant words from the available
                                                                  video meta-data (close captions, tags, etc.), by removing
Copyright is held by the author/owner(s).                         stopwords. Our system consists of extracting relevant key-
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy         words from the documents by using the TF-IDF metric.
   The words with a high TF-IDF value are generally mean-         learned on the features given by the training corpus. We no-
ingful, topic-bearing words. Thus, we propose to construct a      tice that the training corpus has been downloaded by using
feature vector with the n (n = 600 in our experiments) most       a slightly supervised method, and some of the video may be
frequent words in the documentary of the training corpus.         uncorrectly tagged. To improve the performance, we pro-
                                                                  pose to integrate the development corpus in the training
4.2    Audio                                                      corpus. Development corpus was provided by Mediaeval in
   One of the most popular approach for genre identification      which the genres were manually checked. This run uses ex-
by acoustic space characterization relies on MFCC (Mel Fre-       actly the same features as run 3, but the SVM classifier has
quency Cepstral Coefficients) analysis and GMM (Gaussian          been learned on the features given by the training and de-
Mixture Model) or SVM classifiers.                                velopment corpus.
   However, audio features include both useful and useless
information.                                                      Run 5: In the development corpus, we observed that the
   Unfortunately, separating useful and useless information       username of the video (present in the metadata) can give
is a heavy process, and the useless space contains some in-       some interesting information to predict the video genre. In-
formation that can be used to distinguish between genres.         deed a user often uploads multiple videos of the same genre.
For this reason, [1] proposed a single space (the total vari-     For example, the users Anglicantv or Aabbey1 often upload
ability space) that models the two variabilities. In this new     videos of the genre Religion. Here, we use the dev set as a
space, a given audio utterance is represented by a new vector     knowledge base, where the favorite genre of people is known.
named total factors (we also refer to this vector as i-vector),   For each video, we search if the username is present in the
that allows to reduce redundancies and to enhancing useful        dev corpus and increase the score of the genre in which the
information.                                                      user uploaded the videos. Here, we boost scores from the
   Acoustic frames of MFCC are computed every 10 ms in a          run 4 according to this new information. We conducted a
Hamming window of 20 ms large. MFCC vectors are com-              post-campaign experiment that show that, by using only
posed of 12 coefficients, energy and first and second order       this information, the system performs 51% MAP.
derivatives of these 13 features. For these experiments the
UBM is composed of 512 gaussians and the i-vector is a 400
dimension vector.                                                        Table 1: Results of the submitted runs
                                                                                Team Run      Id    MAP
4.3    Video                                                                     LIA    1    run1 0.1179
  We used features based on the color like : Color Structure                     LIA    2    run2   0.17
Descriptor or Dominant Color Structure or features based on                      LIA    3    run3 0.1828
the texture like : Homogeneous Texture Descriptor (HTD)                          LIA    4    run4 0.1964
or Edge Histogram Descriptor. On this task, it seems that                        LIA    5    run5 0.5626
texture was the best feature and specially the HTD.
  HTD is an efficient feature not only for computing tex-           In run 2, we observe that the audio and video features pro-
ture features but also for representing texture information.      vide interesting information to predict the video genre. The
HTD provides a quantitative characterization of homoge-           runs 2, 3 and 4 achieved a similar performance which means
neous texture regions for similarity retrieval. HTD consists      that the different configurations did not strongly contribute
in the mean, the standard deviation value of an image, en-        to the global results. The use of the owner id strongly im-
ergy, and energy deviation values of Fourier transform of the     proves the results.
image.
  Similarly to the audio feature processing, we extract, for      6.   CONCLUSION
each video feature, an i-vector. For these experiments the
UBM is composed of 128 gaussians and the i-vector is a 50           We have described in this paper an approach based on
dimension vector.                                                 the use of audio, video, text (transcription and metadata)
                                                                  features for Video Genre Classification. According to the
                                                                  results, username seems to be simple and strongly efficient
5.    RESULTS                                                     information to predict the video genre.
  We submitted five runs for the Genre Tagging Task, com-
bining the results, presented in the section above. In detail,    7.   REFERENCES
the configuration for each run was as follows :
                                                                  [1] N. Dehak, R. Dehak, P. Kenny, N. Brümmer,
Run 1: We use only text feature. The text feature is built            P. Ouellet, and P. Dumouchel. Support vector machines
on the speech transcript.                                             versus fast scoring in the low-dimensional total
                                                                      variability space for speaker verification. In
Run 2: We use text, audio and video features. The text                INTERSPEECH, pages 1559–1562, 2009.
feature is built on speech transcript and description of the      [2] J.-L. Gauvain, L. Lamel, and G. Adda. The limsi
video given in the metadata.                                          broadcast news transcription system. Speech
                                                                      Communication, 37(1-2):89 – 108, 2002.
Run 3: We use text, audio and video features. The text            [3] M. Larson, M. Eskevich, R. Ordelman, C. Kofler,
feature is built on speech transcript, description of the video       S. Schmiedeke, and G. Jones. Overview of MediaEval
given in the metadata and tags.                                       2011 Rich Speech Retrieval Task and Genre Tagging
                                                                      Task. In MediaEval 2011 Workshop, Pisa, Italy,
Run 4: In the previous run, the SVM classifier has been               September 1-2 2011.

</pre>