<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DCU Search Runs at MediaEval 2014 Search and Hyperlinking</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>David N. Racca, Maria Eskevich, Gareth J.F. Jones CNGL Centre for Global Intelligent Content School of Computing Dublin City University</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>We described Dublin City University (DCU)'s participation in the Search sub-task of the Search and Hyperlinking Task at MediaEval 2014. Exploratory experiments were carried out to investigate the utility of prosodic prominence features in the task of retrieving relevant video segments from a collection of BBC videos. Normalised acoustic correlates of loudness, pitch, and duration were incorporated in a standard TF-IDF weighting scheme to increase weights for terms that were prominent in speech. Prosodic models outperformed a text-based TF-IDF baseline on the training set but failed to surpass the baseline on the test set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Increasing amounts of multimedia content are being
produced and stored on a daily basis. In order to make this
data useful, computer applications are required that
facilitate search, browsing, and navigation through these large
data collections. The MediaEval Search and Hyperlinking
task seeks to contribute to addressing this problem.</p>
      <p>
        In contrast with previous years where a known-item task
was examined, this year an ad-hoc search task was
introduced. The retrieval collection consists of an extension of
last year's collection, comprising 4021 hours of BBC TV
Broadcast content split into training and test sets of 1335
and 2686 hours respectively. For every video le in the
collection, the organizers provided human-generated subtitles,
three di erent automatic speech recognition (ASR)
transcripts (LIMSI/Vocapia, LIUM, and NST-She eld), prosodic
features, shot boundaries, visual concept detection output,
and additional metadata associated with each TV-show. The
training set includes 50 text queries while the test set
comprises 30 queries. More details about the data collection and
task evaluation metrics can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Previous research has demonstrated that prosodic
information is useful for a wide range of speech processing tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
including speech search tasks. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Crestani suggests that
there might be a direct relationship between acoustic stress
of terms and their TF-IDF score in the OGI Stories Corpus,
while Chen reports improvements on a spoken document
retrieval task by using energy and durational features [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Guinaudeau and Hirschberg improve a topic tracking
system by incorporating intensity and pitch values into the
retrieval weighting scheme.
This paper describes an implementation of an approach
that incorporates loudness, duration, and pitch into TF-IDF
weights in order to examine their potential to improve
retrieval e ectiveness of video segments.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>FEATURE PROCESSING</title>
      <p>
        Following Guinaudeau's method [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], loudness and pitch
correlates were extracted from the speech signal and
normalised and aligned to each word occurrence in the
transcripts. To perform this alignment, word timestamps were
used in the case of LIMSI/Vocapia and NST-She eld
transcripts. For subtitles, word timestamps had to be
approximated from each segment's starting and ending timestamps.
This was done by dividing the number of words included in
a segment by its length to obtain the average word
duration for that segment. Starting times and duration of words
were then approximated by considering the starting time of
a segment plus multiples of its average word duration. In
the case of the LIUM transcript, duration of words was
approximated for the test set by the average word duration of
all words in the training set.
      </p>
      <p>After the alignment was performed, minimum, maximum,
mean, and standard deviation of loudness and pitch were
computed for each word. These four statistics were
normalised in order to be compared against other words spoken
in di erent acoustic conditions. The nal objective was to
calculate an acoustic score for each spoken word that
represents how salient a word is relative to its surounding context.
With this in mind, two di erent de nitions of surounding
context for a word were then considered:</p>
      <p>Context given by the words that belong to the same
speech segment predicted by the ASR (seg).</p>
      <p>Full length of document, this is, all the words spoken
in the video (doc).</p>
      <p>Finally, two normalisation functions were explored for
normalising a feature fi over a context C:
1. Range: (fi
2. Z-score: (fi
minC )=(maxC</p>
      <p>C )= C .</p>
      <p>minC ).
3.</p>
    </sec>
    <sec id="sec-3">
      <title>RETRIEVAL FRAMEWORK</title>
      <p>
        Text transcripts were segmented into xed-time adjacent
(non-overlapping) segments of 90 seconds duration. Before
indexing, stop words from the standard Terrier list [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] were
removed and Porter stemming applied. Segments were then
indexed using a modi ed version of Terrier-3.5 that
associated acoustic features with term occurrences in the inverted
      </p>
      <p>Subtitles
NST-She eld</p>
      <p>Normalisation Type
Function Context</p>
      <p>-
range seg
z-score doc
-
-
range seg
z-score doc
-
-
range seg
z-score doc
-
-
range seg
z-score doc
-
index. Note that due to stemming, multiple words can be
mapped to the same stem. In these cases, acoustic feature
vectors associated with each non-stemmed word ccurrence
in a segment were treated as belonging to the same stem
and thus were linked with this term in the inverted index.</p>
      <p>
        Retrieval was performed using Terrier's standard
implementation of the vector space model (VSM) with a
modied TF-IDF weighting function that takes into account the
acoustic features from the inverted index when computing
term weights. The weight of a term t in a segment was
computed using Guinaudeau's harmonic mean [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
w(t) =
ir idft tft + ac act
ir + ac
Di erent de nitions were explored for the acoustic score
(act). In all cases, act was intended to represent the level
of salience of t from its surrounding context. In particular,
simple multiplications of the maximum loudness and
maximum pitch (G-lp), pitch range considering the maximum
and minimum pitch (G-pr), and the maximum duration
(Gdur) with which t was pronounced in the segment were used
as de nitions for act. Values for the free parameters ir and
ac were selected to optimise mean reciprocal rank (MRR),
mean generalised average presicion (mGAP), and mean
average segment precision (MASP) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] on the training set for
individual ASR transcripts and normalisation type. Speci
cally, the runs G-lp, G-pr, and G-dur were optimised for the
LIUM, NST-She eld, and LIMSI transcripts respectively.
      </p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS AND CONCLUSIONS</title>
      <p>ing and test sets were produced with di erent objectives in
mind. This could be another reason why the models
presented in this work seem to have over tted the training set.</p>
      <p>In future work, an error analysis will be carried out in
order to identify queries for which prosodic-based models
could have outperformed the baseline.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by Science Foundation Ireland
(Grant 12/CE/I2267) as part of the Centre for Global
Intelligent Content CNGL II project at DCU.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-M. Wang</surname>
            , and
            <given-names>L.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Improved spoken document retrieval by exploring extra acoustic and linguistic cues</article-title>
          .
          <source>In Proceedings Interspeech'01</source>
          , pages
          <fpage>299</fpage>
          {
          <fpage>302</fpage>
          ,
          <string-name>
            <surname>Aalborg</surname>
          </string-name>
          , Denmark,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          .
          <article-title>Towards the use of prosodic information for spoken document retrieval</article-title>
          .
          <source>In Proceedings ACM SIGIR'01</source>
          , pages
          <fpage>420</fpage>
          {
          <fpage>421</fpage>
          ,
          <string-name>
            <surname>New</surname>
            <given-names>Orleans</given-names>
          </string-name>
          , LA, USA,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>The search and hyperlinking task at MediaEval 2013</article-title>
          .
          <source>In Proceedings of the MediaEval 2013 Workshop</source>
          , Barcelona, Spain,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Racca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>The search and hyperlinking task at MediaEval 2014</article-title>
          .
          <source>In Proceedings of the MediaEval 2014 Multimedia Benchmark Workshop</source>
          , Barcelona, Spain,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guinaudeau</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hirschberg</surname>
          </string-name>
          .
          <article-title>Accounting for prosodic information to improve ASR-based topic tracking for TV broadcast news</article-title>
          .
          <source>In Proceedings Interspeech'11</source>
          , pages
          <fpage>1401</fpage>
          {
          <fpage>1404</fpage>
          ,
          <string-name>
            <surname>Florence</surname>
          </string-name>
          , Italy,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hirschberg</surname>
          </string-name>
          .
          <article-title>Communication and prosody: Functional aspects of prosody</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>36</volume>
          (
          <issue>1</issue>
          ):
          <volume>31</volume>
          {
          <fpage>43</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          .
          <article-title>Research directions in Terrier: a search engine for advanced retrieval on the web</article-title>
          .
          <source>Novatica/UPGRADE Special Issue on Next Generation Web Search</source>
          , pages
          <volume>49</volume>
          {
          <fpage>56</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>