<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIA @ MediaEval 2011 : Compact Representation of Heterogeneous Descriptors for Video Genre Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mickael Rouvier</string-name>
          <email>mickael.rouvier@univ-avignon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georges Linarès</string-name>
          <email>georges.linares@univ-avignon.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIA - University of Avignon</institution>
          ,
          <addr-line>Avignon</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>This paper describes our participation in Genre Tagging Task @ MediaEval 2011, which aims at predicting the Genre of Internet videos. We propose a method that extracts low dimensional feature space based on text, audio and video information. In the best con guration, our system yields a 0.56 MAP (Mean Average Precision) on the test corpus.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The Genre Tagging Task held as part of MediaEval 2011
required participants to automatically predict tags of videos
from Internet. The task consists in associating each video to
one and only one of the 26 provided genres [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The videos
are provided with some additional information like metadata
(title and description), speech recognition transcripts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
social network information (gathered from Twitter).
      </p>
      <p>One of the main di culty of video genre categorization
is due to the diversity of the information sources that are
genre-dependent (spoken contents, video structure, audio
and video patterns, etc.) and to the variability of video
classes. In order to deal with this problem, we propose to
combine various features from audio and video channels and
to reduce the resulting large-dimensional input data to a
low dimensional feature vector while retaining most of the
relevant information (reducing redundancies and minimising
the useless information).</p>
      <p>The paper is organized as follows. Section 2 explains how
we collect our training data. Section 3 describes our
system. Section 4 present the features used by our system and
Section 5 summarizes the results.</p>
    </sec>
    <sec id="sec-2">
      <title>CORPUS</title>
      <p>way to obtain a corpus is to download documents given by a
web search engine using useful queries. For example, for the
Health genre we download all the videos on Youtube using
the query : health. But using only the genre like query
does not allow to download a training set that represents
the class variability. We propose to expand our query to
other terms revolving around the genre. For example, for
the Religion genre, we need to expand our query to related
terms like : ethnic, belief, freedom, practice, etc. In order
to nd terms closely related to the genre, we propose to use
Latent Dirichlet Allocation (LDA). LDA is a unsupervised
word clustering method that relies on word co-occurrence
analysis. The 5000-classes LDA model was estimated on
the Gigaword corpus. Each cluster is composed of the 10
best words. Queries are expanded by adding all words from
the best cluster containing the genre tag.</p>
      <p>We propose the use of di erent information sources
extracted from audio, speech and video channels. Consequently,
our training set should include all these sources, especially
transcription of spoken contents. ASR performance on Web
data are usually high, however the ASR system used on the
test set is not freely distributed. To overcome this problem,
we propose to collect text materials by downloading web
pages from the Internet. Our training corpus consists of web
pages collected from Google (60 documents per class, 1560
documents) and videos collected from Youtube and
Dailymotion (120 documents per class, 3120 documents). There
are more videos than web pages because of technical
restrictions imposed by Google. The collected documents (web
pages and videos) are only in English.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>SYSTEM DESCRIPTION</title>
      <p>The proposed system has a 2-level architecture. The rst
level consists of extracting low dimensional features from
speech, audio and video. Each feature is then given to a
SVM (Support Vector Machine) classi er. The second level
combines the scores of three SVM models. This
combination is achieved by linear interpolation whose coe cients are
determined on the development corpus.
4.
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>FEATURES</title>
    </sec>
    <sec id="sec-5">
      <title>Text</title>
      <p>Most of the linguistic-level methods for video genre
classication rely on extracting relevant words from the available
video meta-data (close captions, tags, etc.), by removing
stopwords. Our system consists of extracting relevant
keywords from the documents by using the TF-IDF metric.</p>
      <p>The words with a high TF-IDF value are generally
meaningful, topic-bearing words. Thus, we propose to construct a
feature vector with the n (n = 600 in our experiments) most
frequent words in the documentary of the training corpus.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Audio</title>
      <p>One of the most popular approach for genre identi cation
by acoustic space characterization relies on MFCC (Mel
Frequency Cepstral Coe cients) analysis and GMM (Gaussian
Mixture Model) or SVM classi ers.</p>
      <p>However, audio features include both useful and useless
information.</p>
      <p>
        Unfortunately, separating useful and useless information
is a heavy process, and the useless space contains some
information that can be used to distinguish between genres.
For this reason, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed a single space (the total
variability space) that models the two variabilities. In this new
space, a given audio utterance is represented by a new vector
named total factors (we also refer to this vector as i-vector),
that allows to reduce redundancies and to enhancing useful
information.
      </p>
      <p>Acoustic frames of MFCC are computed every 10 ms in a
Hamming window of 20 ms large. MFCC vectors are
composed of 12 coe cients, energy and rst and second order
derivatives of these 13 features. For these experiments the
UBM is composed of 512 gaussians and the i-vector is a 400
dimension vector.
4.3</p>
    </sec>
    <sec id="sec-7">
      <title>Video</title>
      <p>We used features based on the color like : Color Structure
Descriptor or Dominant Color Structure or features based on
the texture like : Homogeneous Texture Descriptor (HTD)
or Edge Histogram Descriptor. On this task, it seems that
texture was the best feature and specially the HTD.</p>
      <p>HTD is an e cient feature not only for computing
texture features but also for representing texture information.
HTD provides a quantitative characterization of
homogeneous texture regions for similarity retrieval. HTD consists
in the mean, the standard deviation value of an image,
energy, and energy deviation values of Fourier transform of the
image.</p>
      <p>Similarly to the audio feature processing, we extract, for
each video feature, an i-vector. For these experiments the
UBM is composed of 128 gaussians and the i-vector is a 50
dimension vector.</p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS</title>
      <p>We submitted ve runs for the Genre Tagging Task,
combining the results, presented in the section above. In detail,
the con guration for each run was as follows :
Run 1: We use only text feature. The text feature is built
on the speech transcript.</p>
      <p>Run 2: We use text, audio and video features. The text
feature is built on speech transcript and description of the
video given in the metadata.</p>
      <p>Run 3: We use text, audio and video features. The text
feature is built on speech transcript, description of the video
given in the metadata and tags.</p>
      <p>Run 4: In the previous run, the SVM classi er has been
learned on the features given by the training corpus. We
notice that the training corpus has been downloaded by using
a slightly supervised method, and some of the video may be
uncorrectly tagged. To improve the performance, we
propose to integrate the development corpus in the training
corpus. Development corpus was provided by Mediaeval in
which the genres were manually checked. This run uses
exactly the same features as run 3, but the SVM classi er has
been learned on the features given by the training and
development corpus.</p>
      <p>Run 5: In the development corpus, we observed that the
username of the video (present in the metadata) can give
some interesting information to predict the video genre.
Indeed a user often uploads multiple videos of the same genre.
For example, the users Anglicantv or Aabbey1 often upload
videos of the genre Religion. Here, we use the dev set as a
knowledge base, where the favorite genre of people is known.
For each video, we search if the username is present in the
dev corpus and increase the score of the genre in which the
user uploaded the videos. Here, we boost scores from the
run 4 according to this new information. We conducted a
post-campaign experiment that show that, by using only
this information, the system performs 51% MAP.</p>
      <p>In run 2, we observe that the audio and video features
provide interesting information to predict the video genre. The
runs 2, 3 and 4 achieved a similar performance which means
that the di erent con gurations did not strongly contribute
to the global results. The use of the owner id strongly
improves the results.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>We have described in this paper an approach based on
the use of audio, video, text (transcription and metadata)
features for Video Genre Classi cation. According to the
results, username seems to be simple and strongly e cient
information to predict the video genre.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kenny</surname>
          </string-name>
          , N. Brummer, P. Ouellet, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Dumouchel</surname>
          </string-name>
          .
          <article-title>Support vector machines versus fast scoring in the low-dimensional total variability space for speaker veri cation</article-title>
          .
          <source>In INTERSPEECH</source>
          , pages
          <volume>1559</volume>
          {
          <fpage>1562</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Adda.</surname>
          </string-name>
          <article-title>The limsi broadcast news transcription system</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>37</volume>
          (
          <issue>1-2</issue>
          ):
          <volume>89</volume>
          {
          <fpage>108</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Ko er</article-title>
          , S. Schmiedeke, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Overview of MediaEval 2011 Rich Speech Retrieval Task and Genre Tagging Task</article-title>
          . In MediaEval 2011 Workshop, Pisa, Italy, September 1-2
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>