Evaluating Prosody-Based Similarity Models for
                           Information Retrieval

                                        Steven D. Werner                   Nigel G. Ward
                                    University of Texas at El Paso   University of Texas at El Paso
                                    stevenwerner@acm.org               nigelward@acm.org


ABSTRACT                                                                The similar segments task is based on regions, but the
Prosody is important in spoken language, and especially in           dialog-space model is based on timepoints. For simplicity,
dialog, but its utility for search in dialog archives has re-        the middle point of the query region is used as the character-
mained an open question. Using prosody-based measures of             istic point. The most similar (proximal) timepoints, across
similarity, which also roughly correlate with dialog-activity        the entire corpus, are then found and returned, in order, as
similarity and topic similarity, we built support for “retrieve      the ranked list of jump-in points.
more like this” searches. Performance on the Similar Seg-               We started with a similarity metric using simple Euclidean
ments in Social Speech Task at MediaEval 2013 was well               distance in the vector space, as described in [4]. However we
above baseline, showing the value of prosody for search.             observed that some of the dimensions seemed especially use-
                                                                     ful for the similarity computations and/or more revealing of
                                                                     dialog activities. We wanted our models to reflect this, with
1.    INTRODUCTION                                                   greater weights for such dimensions. Doing so sacrifices the
   In most cases people searching in audio are probably not          distance metaphor, but is computationally similar. Specifi-
really interested in finding words. What people want is of-          cally, for any two points in a dialog, x and y, we compute a
ten information of some type, which may be characterized             weighted sum of their di↵erences on the dimensions:
in part by dialog process or activity, for example recom-
mending, answering a question, agreeing, forming a decision,                                         78
                                                                                                     X
telling life stories, making plans, hearing surprising state-                      dissimilarity =         wi |xi   yi |       (1)
ments, giving advice, explaining, and so on. In dialog, such                                         i=1
activities and topics often are associated with characteristic          First we tried this with uniform weights, giving the “dis-
prosodic features and patterns.                                      sim” results in the tables. We then tried optimized weights,
   Our basic idea is to use a vector-space model of dialog           trained using linear regression, where the target was a dis-
activity, where each moment in time maps to a point in this          tance of 0 if x and y were similar, and 1 if they were not
space. This representation is obtained by applying Principal         similar. Thus, for example, if two selected timepoints x and
Component Analysis to 78 local prosodic features computed            y both were located in regions that had been tagged as talk
every 10ms calculated over a 6 second sliding window [2].            about “favorite movies,” then x and y were counted as sim-
This feature set was choosen for simplicity of computation           ilar. If x and y shared no tags, they were counted as not
and for providing coverage of most of the prosodic aspects           similar. This is of course not ideal, since a point-pair might
known to be most relevent for dialog. It resembles that used         be similar even if not belonging to regions that were felt to
in [2], but with more volume features and fewer pitch fea-           be worth tagging. Sets of similar and non-similar timepoint-
tures, more speaker features and fewer interlocutor features,        pairs were obtained by random sampling over the training
and more narrow-window features close to the point of in-            set.
terest and fewer distant-context features. After PCA this               For sampling we experimented with various more restric-
gave 78 dimensions, ordered by how much of the variation             tive definitions of similar. One type of constraint was to
they explained.                                                      require agreement by at least some number of annotators in
   In previous work [4] we found that dialog timepoints which        order to consider a timepoint pair as similar. For this the
were proximal in this space tended to be similar not only in         label names, were ignored (as always), and so the annotators
dialog activity but in topic as well. Here we extend this            might have considered the points to be similar in di↵erent
work to use better similarity models, and report positive            ways entirely. The second type of constraint relied on the
results on a standard problem, namely the Similar Segments           utility values (“weights”) assigned by the annotators to their
in Social Speech Task at MediaEval 2013 for which the task           tags, higher the more informative and cohesive they thought
definition, data set, and evaulation metrics may be found in         the tagset was. For example, in one sampling we included
[5].                                                                 only pairs whose connecting tag was rated 3, excluding those
                                                                     rated 0, 1, or 2 [3]. Requiring higher tagweights and more
2.    THE MODELS                                                     agreement gave higher-quality training data, but at the cost
                                                                     of reducing the quantity of similar point-pairs available to
                                                                     train with.
Copyright is held by the authors.                                       We also experimented with pruning the dimensions, using
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
               naive     raw     raw    norm.    norm.                         naive      raw     raw    norm.    norm.
    model      prec.   recall   s.u.r    s.u.r   recall     F      model       prec.    recall   s.u.r    s.u.r   recall      F
    Random      6%      23%     0.25     0.86     0.83    0.86     Random        7%      11%     0.12     0.43     0.40    0.43
    Expl.      16%      46%     0.43     1.49     1.67    1.50     Expl.         9%      18%     0.29     1.00     0.67    0.95
    Distance     3%     26%     0.21     0.74     0.96    0.76     Distance      3%      16%     0.12     0.41     0.57    0.42
    Dissim.      4%     26%     0.22     0.76     0.96    0.78     Dissim.       6%      22%     0.17     0.58     0.81    0.60
    all+         6%     31%     0.26     0.89     1.12    0.91     all+          7%      28%     0.22     0.75     1.03    0.77
    good+        6%     34%     0.27     0.94     1.25    0.97     good+         6%      22%     0.17     0.60     0.81    0.61
    all-p+       7%     32%     0.27     0.92     1.17    0.94     all-p+        7%      30%     0.22     0.77     1.08    0.79
    good-p+      7%     34%     0.28     0.96     1.24    0.98     good-p+       7%      26%     0.20     0.69     0.93    0.71


Table 1: Performance on Training Set.           all =             Table 2: Performance on the Test Set, as above.
trained using all training-data similarity sets; good
= trained on only point-pairs which were in same-
tagged regions according to at least three annota-               better. Exploring this is a priority for future research.
tors; p = iterative-leave-one-out pruning applied to                The e↵ects of using higher quality training data varied
dimensions, + = only positively-weighted dimen-                  with the testset: on the training set, using the good quality
sions retained; s.u.r = speaker utility ratio.                   set gave the best performance, but on the test set the model
                                                                 trained using all the data performed best. Pruning was gen-
                                                                 erally beneficial, with dropping dimensions with negative
two feature selection methods. This was prompted by the          weights being the most useful, with some additional benefit
observation that linear regression consistently gave negative    from also selectively dropping dimensions.
weights to some of the dimensions, for example 67, which,           Looking at robustness to changes in the data, the picture
when we listened to it, seemed to encode the di↵erence be-       is clouded by the fact that the test set was harder, in terms of
tween calm, indi↵erent speech and energetic explaining. The      recall (because the target regions, like all regions in this set,
first method was to try to leave a dimension out of the model    tended to be shorter and thus harder to find). Nevertheless,
(set its weight to zero), and if that improved performance       on the training set the best model’s performance was still
on a held-out subset of the training data, to drop it from       far above baseline, showing a degree of generalizability.
the set. This was iterated, typically resulting in dropping         Although the potential utility of prosody for search has
about a third of the dimensions. The second approach was         been long discussed [1], and demonstrations of the relevance
to simply drop any dimension to which regression assigned        for prosody for inferring emotion and dialog acts are com-
a negative weight.                                               mon, here we demonstrate, for the first time, that prosodic
                                                                 information, used by itself, is actually of value for search in
                                                                 audio archives.
3.    RESULTS AND DISCUSSION
   The tables show the results1 . for the four models which
performed best on the training set and four reference mod-       4.   ACKNOWLEDGMENTS
els: the baseline, where the jump in points for each query are     We thank the National Science Foundation for support via
randomly selected; a tagset-exploiting model, where jump         a REU supplement to Award IIS-0914868, and Olac Fuentes.
in points are found by considering tags by other annotators
with regions that overlap the query region; the Euclidean        5.   REFERENCES
distance model; and a model based on uniform-weight dissi-       [1] D. Hakkani-Tur, G. Tur, A. Stolcke, and E. E.
malarity, that is, like distance but using absolute-value in-        Shriberg. Combining words and prosody for
stead of squared di↵erences. We used the tagset-exploiting           information extraction from speech. In Proc.
model as a likely upper bound on performance, as it is akin          Eurospeech, vol. 5, pages 1991–1994, 1999.
to how a second human might themselves perform the search
                                                                 [2] N. G. Ward and A. Vega. A bottom-up exploration of
task. For the best models, performance is far above base-
                                                                     the dimensions of dialog state in spoken interaction. In
line, showing that information retrieval can indeed benefit
                                                                     13th Annual SIGdial Meeting on Discourse and
by using prosodic information.
                                                                     Dialogue, 2012.
   These results are, however, weaker than those that can be
                                                                 [3] N. G. Ward and S. D. Werner. Data collection for the
obtained by using lexical features. Perhaps in this corpus
                                                                     Similar Segments in Social Speech task. University of
topical similarity was more relevent then functional similar-
                                                                     Texas at El Paso, Technical Report, UTEP-CS-13-58,
ity, and perhaps lexical models are better for topic similar-
                                                                     2013.
ity. Thus prosodic models may still be of value, as is, for
languages for which speech recognizers are not available or      [4] N. G. Ward and S. D. Werner. Using dialog-activity
perform poorly. We further conjecture that the prosody is            similarity for spoken information retrieval. In
capturing dimensions of similarity not seen in lexical simi-         Interspeech, 2013.
larity, and therefore that a combined model could do even        [5] N. G. Ward, S. D. Werner, D. G. Novick, T. Kawahara,
                                                                     E. E. Shriberg, L.-P. Morency, and C. Oertel. The
1
 From the point of view of the competition, these results are        similar segments in social speech task. In MediaEval
all unofficial, since the authors, being also the competition        Workshop, 2013.
organizers, had privileged access to the data.