<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DCU at MediaEval 2011: Rich Speech Retrieval (RSR)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Eskevich</string-name>
          <email>meskevich@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gareth J. F. Jones</string-name>
          <email>gjones@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CDVP &amp; CNGL, School of Computing, Dublin City University</institution>
          ,
          <addr-line>Dublin 9</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CDVP, School of Computing, Dublin City University</institution>
          ,
          <addr-line>Dublin 9</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>We describe our runs and results for the Rich Speech Retrieval (RSR) Task at MediaEval 2011. Our runs examine the use of alternative segmentation methods on the provided ASR transcripts to locate the beginning of the topic, assuming that this will capture or get close to the starting point of the relevant segment; combination of various types of queries and weighting of metadata to move the relevant segment higher in the ranked list; and di erent ASR transcripts to compare the in uence of the ASR transcripts quality. Our results show that newer versions of the transcripts and use of metadata produce better results on average. So far we have not used information about the illocutionary act type corresponding to each query, but analysis of the retrieval results shows di erence in behaviour for queries associated with certatin classes of act.</p>
      </abstract>
      <kwd-group>
        <kwd>Speech search</kwd>
        <kwd>information retrieval</kwd>
        <kwd>automatic speech recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.3 [Information Storage and Retrieval]: H.3.1
Content Analysis and Indexing; H.3.3 Information search and
Retrieval; H.3.4 Systems and Software</p>
    </sec>
    <sec id="sec-2">
      <title>General Terms</title>
    </sec>
    <sec id="sec-3">
      <title>1. INTRODUCTION</title>
      <p>
        The Rich Speech Retrieval (RSR) Task at MediaEval 2011
seeks to open discussion of a new task in the search of spoken
content. The information to be found has special features
a certain speaker's intention (illocutionary act1). This new
way of setting the problem of speech search raises the
question of uniformity of the structures of naturally produced
queries for di erent speech acts and how belonging to
certain type of acts a ects retrieval behaviour. This dataset
contains 5 basic speech acts: 'apology', 'de nition',
'opinion', 'promise' and 'warning'. Two of these ('de nition' and
1http://en.wikipedia.org/wiki/Speech acts
'opinion') are more neutral and appear more as simple
textual requests for information, while the otherw are more
emotional and subjective, and therefore less similar to the
usual textual query style. A full description of the task can
be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The o cial metric of the RSR task was used
to evaluate our results - mGAP which re ects how close the
predicted jump in point of the run result is to the manual
ground truth within a certain window. The following
sections summarise our methods and results.
2.
      </p>
    </sec>
    <sec id="sec-4">
      <title>APPROACH DESCRIPTION</title>
      <p>
        The videos in the data setare diverse in their structure,
style of language and length. Both ASR transcripts and
confusion networks are provided for all videos. This
information can be used as input for the retrieval process. We
treated both the 2010 transcripts and the 2011 confusion
networks in the same way: creating clean text out of the
words and punctuation from the transcripts. The next step
was to preprocess the data for retrieval. We rst
automatically segmented the data into topically coherent segments.
For this we examined the use of two existing text
segmentation algorithms: C99 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and TextTiling [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Most videos in the collection are accompanied by
metadata relating to the whole video regardless of its length or
the number of topics discussed. This metadata tag
information was added once ('m1') or 5 times ('m5', to give it more
weight) to all of the segments in the le. Segment indexing
and retrieval were carried out using the lemur2 Indri toolkit.</p>
      <p>As queries we used only the naturally formulated full query
('title') and the short query similar to the query for an
internet search engine ('google') and the combination of both
('title + google'). For these experiments, the starting time
of the segment was selected as the jump-in point the results.
3.</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS</title>
      <p>Table 1 shows the results of our runs. As could have been
anticipated, larger window size shows better scores, since
more of the results have non zero GAP; more complicated
queries ('title + google') make the request for information
more detailed and consequently relevant segments are found
better; addition of metadata, and especially allocation of
more weight to the metadata can overcome the problem of
some keywords being misrecognized or not uttered at all in
the segment and therefore improves the overall results. The
confusion networks provided for 2011 dataset have a
restriction that the second variant is reported only if its con dence
2http://www.lemurproject.org/
2011
2011
2011
2011
2011
2010
2011</p>
      <p>{
+ (5)</p>
      <p>{
+ (5)
+ (1)
{
{</p>
      <p>google
title + google</p>
      <p>google
title + google
title + google
title
google
measure is higher than 50%, in most cases this second
variant is either the same word written with a capital letter or
is another grammatical form of the same word. Since we
were taking all the words from the confusion networks to
prepare our text, these variants do not bring new terms into
the document, but increase the weight of the term that has
multiple entries.
4. ILLOCUTIONARY ACT BREAKDOWN
mGAP over all the queries shows average performance for
a speci c combination of di erent system parameters, but
it is also interesting to look into the results of the same
combinations separated into illocutionary act type. When
simple queries ('title') are used on the 2010 transcript not
enriched with metadata information, the results fall into two
classes: 'de nition' and 'opinion' have scores of the same
level for window size 60, while the three other act types have
signi cantly lower scores. In the case of the other simple
query type ('google'), the di erence in speech acts types is
not so distinct, however with the small number of queries for
certain types (only 1 for apology), it is hard to argue that
the query type is the reason for the results achieved or the
dataset itself.</p>
      <p>Our runs enable us to compare the a ect of using
metadata with di erent weight (2011 c99 m5 title and google and
2011 c99 m1 title and google). In general the 'm1' run has
lower scores than the 'm5', but in reality the scores are
the same for all window sizes for 'apology', 'de nition' and
'promise' and higher for 'm5' for 'opinion' and 'warning'.</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>This investigation has shown that queries that have
several dimensions - not only requesting speci c data in the
transcript, but also certain emotion or illocution related to
it, that have to be treated in a di erent way depending on
the type of the speech act. When the illocution is less
neutral more data needs to be combined in order to nd the
relevant segments. While the distribution of the
illocutionary acts in the query set models real life, perhaps we need to
create more queries of speci c less popular types in order to
develop better ways of processing the di erent query types.</p>
      <p>Preliminary experiments suggested C99 to be the better
algorithm for segmenting the data, hence more runs were
submitted with C99. However, the results from the full runs
show that TextTiling can outperform C99, more runs with
di erent combinations of transcripts and queries will be
carried out in further work.</p>
    </sec>
    <sec id="sec-7">
      <title>FUTURE WORK</title>
      <p>In future work we plan to compare all the possible
combinations of query types, use of metadata and transcript
segmentation to be able to demonstrate our results more
solidly. Segmentation algorithms that have been developed
for other types of spoken content (i.e. meetings, broadcast
news) can be applied to the data in order to examine
alternative ways of splitting the transcripts into search units.
Since so far we were calling the beginning of the segment
the jump-in point, another potential research direction may
be to postprocess the retrieved segment locate the assigned
jump-in point closer to the manually assigned position.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is funded by a grant under the Science
Foundation Ireland Research Frontiers Programme 2008 Grant No:
08/RFP/CMS1677.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F. Y. Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>Advances in domain independent linear text segmentation</article-title>
          .
          <source>In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference</source>
          , pages
          <volume>26</volume>
          {
          <fpage>33</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hearst</surname>
          </string-name>
          .
          <article-title>TextTiling: A quantitative approach to discourse segmentation</article-title>
          .
          <source>Technical Report Sequoia</source>
          <volume>93</volume>
          /24, Computer Science Department, University of California, Berkeley,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Ko er</article-title>
          , S. Schmeideke, and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Overview of mediaeval 2011 rich speech retrieval task and genre tagging task</article-title>
          .
          <source>In Proceedings of the MediaEval Workshop</source>
          <year>2011</year>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>