<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Prosody-Based Similarity Models for Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Steven D. Werner</string-name>
          <email>stevenwerner@acm.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nigel G. Ward</string-name>
          <email>nigelward@acm.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Texas at El Paso</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>Prosody is important in spoken language, and especially in dialog, but its utility for search in dialog archives has remained an open question. Using prosody-based measures of similarity, which also roughly correlate with dialog-activity similarity and topic similarity, we built support for “retrieve more like this” searches. Performance on the Similar Segments in Social Speech Task at MediaEval 2013 was well above baseline, showing the value of prosody for search.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>In most cases people searching in audio are probably not
really interested in finding words. What people want is
often information of some type, which may be characterized
in part by dialog process or activity, for example
recommending, answering a question, agreeing, forming a decision,
telling life stories, making plans, hearing surprising
statements, giving advice, explaining, and so on. In dialog, such
activities and topics often are associated with characteristic
prosodic features and patterns.</p>
      <p>
        Our basic idea is to use a vector-space model of dialog
activity, where each moment in time maps to a point in this
space. This representation is obtained by applying Principal
Component Analysis to 78 local prosodic features computed
every 10ms calculated over a 6 second sliding window [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
This feature set was choosen for simplicity of computation
and for providing coverage of most of the prosodic aspects
known to be most relevent for dialog. It resembles that used
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but with more volume features and fewer pitch
features, more speaker features and fewer interlocutor features,
and more narrow-window features close to the point of
interest and fewer distant-context features. After PCA this
gave 78 dimensions, ordered by how much of the variation
they explained.
      </p>
      <p>
        In previous work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] we found that dialog timepoints which
were proximal in this space tended to be similar not only in
dialog activity but in topic as well. Here we extend this
work to use better similarity models, and report positive
results on a standard problem, namely the Similar Segments
in Social Speech Task at MediaEval 2013 for which the task
definition, data set, and evaulation metrics may be found in
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>THE MODELS</title>
      <p>The similar segments task is based on regions, but the
dialog-space model is based on timepoints. For simplicity,
the middle point of the query region is used as the
characteristic point. The most similar (proximal) timepoints, across
the entire corpus, are then found and returned, in order, as
the ranked list of jump-in points.</p>
      <p>
        We started with a similarity metric using simple Euclidean
distance in the vector space, as described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However we
observed that some of the dimensions seemed especially
useful for the similarity computations and/or more revealing of
dialog activities. We wanted our models to reflect this, with
greater weights for such dimensions. Doing so sacrifices the
distance metaphor, but is computationally similar.
Specifically, for any two points in a dialog, x and y, we compute a
weighted sum of their die↵rences on the dimensions:
dissimilarity =
(1)
78
X wi|xi
i=1
yi|
      </p>
      <p>First we tried this with uniform weights, giving the
“dissim” results in the tables. We then tried optimized weights,
trained using linear regression, where the target was a
distance of 0 if x and y were similar, and 1 if they were not
similar. Thus, for example, if two selected timepoints x and
y both were located in regions that had been tagged as talk
about “favorite movies,” then x and y were counted as
similar. If x and y shared no tags, they were counted as not
similar. This is of course not ideal, since a point-pair might
be similar even if not belonging to regions that were felt to
be worth tagging. Sets of similar and non-similar
timepointpairs were obtained by random sampling over the training
set.</p>
      <p>
        For sampling we experimented with various more
restrictive definitions of similar. One type of constraint was to
require agreement by at least some number of annotators in
order to consider a timepoint pair as similar. For this the
label names, were ignored (as always), and so the annotators
might have considered the points to be similar in die↵rent
ways entirely. The second type of constraint relied on the
utility values (“weights”) assigned by the annotators to their
tags, higher the more informative and cohesive they thought
the tagset was. For example, in one sampling we included
only pairs whose connecting tag was rated 3, excluding those
rated 0, 1, or 2 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Requiring higher tagweights and more
agreement gave higher-quality training data, but at the cost
of reducing the quantity of similar point-pairs available to
train with.
      </p>
      <p>We also experimented with pruning the dimensions, using
model</p>
      <sec id="sec-2-1">
        <title>Random</title>
        <p>Expl.
raw
s.u.r
model</p>
      </sec>
      <sec id="sec-2-2">
        <title>Random Expl.</title>
      </sec>
      <sec id="sec-2-3">
        <title>Distance</title>
        <p>Dissim.
all+
good+
all-p+
good-p+
raw
s.u.r
naive raw
prec. recall
norm.</p>
        <p>s.u.r
norm.
recall
two feature selection methods. This was prompted by the
observation that linear regression consistently gave negative
weights to some of the dimensions, for example 67, which,
when we listened to it, seemed to encode the di↵erence
between calm, indie↵rent speech and energetic explaining. The
first method was to try to leave a dimension out of the model
(set its weight to zero), and if that improved performance
on a held-out subset of the training data, to drop it from
the set. This was iterated, typically resulting in dropping
about a third of the dimensions. The second approach was
to simply drop any dimension to which regression assigned
a negative weight.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>RESULTS AND DISCUSSION</title>
      <p>The tables show the results1. for the four models which
performed best on the training set and four reference
models: the baseline, where the jump in points for each query are
randomly selected; a tagset-exploiting model, where jump
in points are found by considering tags by other annotators
with regions that overlap the query region; the Euclidean
distance model; and a model based on uniform-weight
dissimalarity, that is, like distance but using absolute-value
instead of squared di↵erences. We used the tagset-exploiting
model as a likely upper bound on performance, as it is akin
to how a second human might themselves perform the search
task. For the best models, performance is far above
baseline, showing that information retrieval can indeed benefit
by using prosodic information.</p>
      <p>These results are, however, weaker than those that can be
obtained by using lexical features. Perhaps in this corpus
topical similarity was more relevent then functional
similarity, and perhaps lexical models are better for topic
similarity. Thus prosodic models may still be of value, as is, for
languages for which speech recognizers are not available or
perform poorly. We further conjecture that the prosody is
capturing dimensions of similarity not seen in lexical
similarity, and therefore that a combined model could do even
1From the point of view of the competition, these results are
all unocial, since the authors, being also the competition
organizers, had privileged access to the data.
better. Exploring this is a priority for future research.</p>
      <p>The ee↵cts of using higher quality training data varied
with the testset: on the training set, using the good quality
set gave the best performance, but on the test set the model
trained using all the data performed best. Pruning was
generally beneficial, with dropping dimensions with negative
weights being the most useful, with some additional benefit
from also selectively dropping dimensions.</p>
      <p>Looking at robustness to changes in the data, the picture
is clouded by the fact that the test set was harder, in terms of
recall (because the target regions, like all regions in this set,
tended to be shorter and thus harder to find). Nevertheless,
on the training set the best model’s performance was still
far above baseline, showing a degree of generalizability.</p>
      <p>
        Although the potential utility of prosody for search has
been long discussed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and demonstrations of the relevance
for prosody for inferring emotion and dialog acts are
common, here we demonstrate, for the first time, that prosodic
information, used by itself, is actually of value for search in
audio archives.
      </p>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGMENTS</title>
      <p>We thank the National Science Foundation for support via
a REU supplement to Award IIS-0914868, and Olac Fuentes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hakkani-Tur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Tur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stolcke</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. E. Shriberg.</surname>
          </string-name>
          <article-title>Combining words and prosody for information extraction from speech</article-title>
          .
          <source>In Proc. Eurospeech</source>
          , vol.
          <volume>5</volume>
          , pages
          <fpage>1991</fpage>
          -
          <lpage>1994</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Ward</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Vega</surname>
          </string-name>
          .
          <article-title>A bottom-up exploration of the dimensions of dialog state in spoken interaction</article-title>
          .
          <source>In 13th Annual SIGdial Meeting on Discourse and Dialogue</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Ward</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Werner</surname>
          </string-name>
          .
          <article-title>Data collection for the Similar Segments in Social Speech task</article-title>
          . University of Texas at El Paso,
          <source>Technical Report, UTEP-CS-13-58</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Ward</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Werner</surname>
          </string-name>
          .
          <article-title>Using dialog-activity similarity for spoken information retrieval</article-title>
          .
          <source>In Interspeech</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Werner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Novick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kawahara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. E.</given-names>
            <surname>Shriberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-P.</given-names>
            <surname>Morency</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Oertel</surname>
          </string-name>
          .
          <article-title>The similar segments in social speech task</article-title>
          .
          <source>In MediaEval Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>