<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Genre tagging of videos based on information retrieval and semantic similarity using WordNet</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>José M. Perea-Ortega</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arturo Montejo-Ráez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel C. Díaz-Galiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Teresa Martín-Valdivia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SINAI Research Group, Computer Science Department University of Jaén 23071 - Jaén</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>In this paper we propose a new approach for the genre tagging task of videos, using only their ASR transcripts and associated metadata. This new approach is based on calculating the semantic similarity between the nouns detected in the video transcripts and a bag of nouns generated from WordNet, for each category proposed to classify the videos. Specifically, we have used the Lin measure based on WordNet, which calculates the semantic distance between two synsets. Obviously, this approach has been only applied on the English test videos due to the use of WordNet, an English lexical resource. As base case, we have applied an information retrieval system as a classifier, using the generated bag of nouns for each category as index data and the ASR transcripts from each test video as query. Several experiments have been submitted, one of them combining both approaches (information retrieval and semantic similarity). As main conclusion we have shown that, using this combination of semantic similarity and information retrieval, we can improve the results obtained using the information retrieval approach only.</p>
      </abstract>
      <kwd-group>
        <kwd>Genre video tagging</kwd>
        <kwd>Video categorization</kwd>
        <kwd>Automatic Speech Recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.3.1 [Information Storage and Retrieval]: Content
Analysis and Indexing - Indexing methods</p>
    </sec>
    <sec id="sec-2">
      <title>MOTIVATION AND RELATED WORK</title>
      <p>
        Multimedia data are usually tagged with some relevant
information in order to make the retrieval easier. In fact,
the efficient use of textual data associated to other types of
information such as images can improve multimedia IR
systems [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. However, the provided labelling of multimedia
This work has been partially supported by a grant from the
Fondo Europeo de Desarrollo Regional (FEDER), project
TEXT-COOL 2.0 (TIN2009-13391-C04-02) from the
Spanish Government, a grant from the Andalusian Government,
project GeOasis (P08-TIC-41999) and Geocaching Urbano
research project (RFC/IEG2010).
videos may not contain sufficient context for locating data of
interest in a large database. Detailed annotation is required,
so that users could quickly locate clips of interest without
having to go through entire databases.
      </p>
      <p>
        The Genre Tagging task in MediaEval 2011 attempts to
automatically generate genre labels to organize videos [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In
this paper we present some experiments on automatic genre
tagging of videos making use of their Automatic Speech
Recognition (ASR) transcripts and metadata associated. We
have worked during last years in the field of video
categorization, participating in VideoCLEF [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] and MediaEval
2010 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
2.
      </p>
    </sec>
    <sec id="sec-3">
      <title>DESCRIPTION OF THE TASK</title>
      <p>
        In the Genre Tagging task of MediaEval 2011, participants
are required to automatically assign thematic subject labels
to videos using features derived from speech, metadata,
audio or visual content. It is important to note that this is
not a multilabel tagging task, so a given video can only be
assigned to one label. The data set provided are the same as
those used in MediaEval 2010 Wild Wild Web Tagging Task
(ME10WWW) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The development and test data sets
consisted of 247 and 1,727 videos respectively. From the test
videos, 1,673 videos are in English, 16 are in French, 25 are
in Spanish and 13 are in Dutch. We have only worked with
the English videos. The list of genre classes consisted of 25
tags, providing a “default category” for those videos that do
not fit in any other classes.
3.
      </p>
    </sec>
    <sec id="sec-4">
      <title>SYSTEM OVERVIEW</title>
      <p>Our main approach is based on using an Information
Retrieval (IR) system as a classifier. On the one hand, we have
generated a XML document or bag of words for each
category proposed, making use of an external lexical resource
like WordNet1. Specifically, we have included synonyms,
hyponyms and domain terms related to the category. For
example, for the “educational ” category we have generated a
XML document including terms such as instruction,
teaching, pedagogy, didactics, training, etc. On the other hand,
the preprocessed ASR transcripts (stemming and stop word
removal) from test videos have been used as queries, without
any expansion. Finally, the Terrier2 IR system has been used
to obtain a measure of relatedness (RSV, Retrieval Status
Value) between each video and the generated bags of words.
1http://wordnet.princeton.edu
2http://terrier.org</p>
      <p>
        As a second approach, we have used the formula proposed
by Lin [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is based on WordNet, to measure the
semantic similarity between the nouns detected in each test
video and the bags of words generated for each category.
Firstly, for calculating the semantic similarity between a
video and the XML document generated for a category, we
have obtained the Lin semantic similarity between each pair
of nouns from both, accumulating those similarity scores
that exceed a threshold set at 0.75. With the use of this
threshold, we have tried to minimize the effect of the size
of the ASR transcripts, since some of the videos contain
more words than others. Secondly, the accumulated
similarity score has been divided by the number of words detected
in the video, obtaining the final semantic similarity score.
Those videos with a final semantic similarity score of less
than 0.25 were considered in the default category.
      </p>
    </sec>
    <sec id="sec-5">
      <title>EVALUATION OF RESULTS</title>
      <p>Several experiments were carried out under both
approaches. As baseline, we have considered the use of the
preprocessed ASR transcripts from test videos as query
(experiment IR ASR). Then, we have tried to evaluate the
addition of the metadata provided, carrying out an
expansion of the ASR transcripts using such metadata
(experiments IR ASR+MD and IR ASR+MD+TAGS ).
Regarding the second approach, we have submitted the experiment
“SIMSEM-ASR”, in which we only calculate the semantic
similarity between each video and each category (its bag of
words), without using the IR approach. Finally, we have
combined the IR and the semantic similarity approaches
(experiment “SIMSEM+IR-ASR”), merging both lists of
results. First, we have normalized the RSV score from the
baseline. Then, for each test video, we have added their
normalized RSV and semantic similarity scores. The results
obtained are shown in Table 1, using the Mean Average
Precision (MAP) measure. We also show the MAP obtained
considering only the English test videos.</p>
      <p>Run name</p>
      <p>IR ASR</p>
      <p>IR ASR+MD
IR ASR+MD+TAGS</p>
      <p>SIMSEM-ASR
SIMSEM+IR-ASR</p>
      <sec id="sec-5-1">
        <title>M APofficial</title>
        <p>0.1031
0.1073
0.1115
0.0547
0.1266</p>
      </sec>
      <sec id="sec-5-2">
        <title>M APEnglish</title>
        <p>0.1044
0.1088
0.1129
0.0559
0.1288</p>
        <p>Analyzing the official results we can observe that the
expansion of the ASR transcripts using the provided
metadata improves the result obtained when metadata is not used
(+4% and +8.15% better for the experiments IR ASR+MD
and IR ASR+MD+TAGS, respectively), as it was expected.
On the other hand, the combination of the semantic
similarity and the IR approaches seems to be interesting because it
improves the MAP value obtained for the baseline using the
IR approach only (+22.79%). Taking into account the test
groundtruth file provided by the MediaEval organizers, 185
videos of the 1,673 English videos (11.06%) belong to the
default category, while our best experiment assigned only
18 videos (1.08%) to such category. This was motivated by
the low threshold used to assign a video to the default
category (0.25), which allowed to classify videos in categories
that really did not correspond due to its low similarity score.
Nevertheless, for some categories (art, politics, religion and
sports), we obtained good results, achieving high MAP
individual scores (e.g. 0.6176 for the politics category). This is
due to such categories are more general concepts or genres
than others (business, comedy, documentary, etc.), so it was
easier to find more nouns semantically related, increasing
the size of the XML document generated for such categories
and, therefore, the probability of success.
5.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>In this paper we propose the use of the semantic
similarity based on WordNet combined with the IR approach
in order to solve the genre tagging task of videos. Because
our research field of interest is Natural Language Processing
(NLP), we have only worked with the ASR transcripts from
videos and their metadata. It was shown that combining
the semantic similarity score with the RSV score obtained
from the IR approach, we obtained a significant
improvement. Nevertheless, it seems clear that working only with
the ASR transcripts generally get poor results. For future
work, we will study other resources in order to increase the
size of the bag of words generated for each category, adding
more terms semantically related with such categories.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bozzon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fraternali</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Multimedia and multimodal information retrieval</article-title>
          .
          <source>In SeCO Workshop</source>
          (
          <year>2009</year>
          ),
          <string-name>
            <given-names>S.</given-names>
            <surname>Ceri</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Brambilla</surname>
          </string-name>
          , Eds., vol.
          <volume>5950</volume>
          of Lecture Notes in Computer Science, Springer, pp.
          <fpage>135</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Larson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eskevich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ordelman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kofler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmeideke</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Overview of MediaEval 2011 Rich Speech Retrieval Task and Genre Tagging Task</article-title>
          . In MediaEval 2011 Workshop (Pisa, Italy, September 1-2
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>An information-theoretic definition of similarity</article-title>
          .
          <source>In Proc. of the 15th Int'l. Conf. on Machine Learning</source>
          (
          <year>1998</year>
          ), pp.
          <fpage>296</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Mart</surname>
          </string-name>
          n-Valdivia, M. T.,
          <article-title>D az-</article-title>
          <string-name>
            <surname>Galiano</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montejo-Raez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Urena-Lopez</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Using</surname>
          </string-name>
          <article-title>Information Gain to Improve Multimodal Information Retrieval Systems</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>44</volume>
          (
          <year>2008</year>
          ),
          <fpage>1146</fpage>
          -
          <lpage>1158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Perea-Ortega</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montejo-Raez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <article-title>D az-</article-title>
          <string-name>
            <surname>Galiano</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Mart</surname>
            n-Valdivia,
            <given-names>M. T.</given-names>
          </string-name>
          <article-title>SINAI at Tagging Task Professional in MediaEval 2010</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2010 Workshop</source>
          , Pisa, Italy, October
          <volume>24</volume>
          ,
          <year>2010</year>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Perea-Ortega</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montejo-Raez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <article-title>D az-</article-title>
          <string-name>
            <surname>Galiano</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            n-Valdivia,
            <given-names>M. T.</given-names>
          </string-name>
          ,
          <article-title>and Uren~a-</article-title>
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <article-title>Using an information retrieval system for video classification</article-title>
          .
          <source>In Evaluating Systems for Multilingual and Multimodal Information Access</source>
          (
          <year>2009</year>
          ), vol.
          <volume>5706</volume>
          of Lecture Notes in Computer Science, Springer, pp.
          <fpage>927</fpage>
          -
          <lpage>930</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Perea-Ortega</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montejo-Raez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            n-Valdivia,
            <given-names>M. T.</given-names>
          </string-name>
          ,
          <article-title>and Uren~a-</article-title>
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <article-title>Using support vector machines as learning algorithm for video categorization</article-title>
          .
          <source>In CLEF</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          (
          <year>2010</year>
          ), vol.
          <volume>6242</volume>
          of Lecture Notes in Computer Science, Springer, In Press.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>