=Paper=
{{Paper
|id=None
|storemode=property
|title=Evaluating Prosody-Based Similarity Models for Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_52.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WernerW13
}}
==Evaluating Prosody-Based Similarity Models for Information Retrieval==
Evaluating Prosody-Based Similarity Models for
Information Retrieval
Steven D. Werner Nigel G. Ward
University of Texas at El Paso University of Texas at El Paso
stevenwerner@acm.org nigelward@acm.org
ABSTRACT The similar segments task is based on regions, but the
Prosody is important in spoken language, and especially in dialog-space model is based on timepoints. For simplicity,
dialog, but its utility for search in dialog archives has re- the middle point of the query region is used as the character-
mained an open question. Using prosody-based measures of istic point. The most similar (proximal) timepoints, across
similarity, which also roughly correlate with dialog-activity the entire corpus, are then found and returned, in order, as
similarity and topic similarity, we built support for “retrieve the ranked list of jump-in points.
more like this” searches. Performance on the Similar Seg- We started with a similarity metric using simple Euclidean
ments in Social Speech Task at MediaEval 2013 was well distance in the vector space, as described in [4]. However we
above baseline, showing the value of prosody for search. observed that some of the dimensions seemed especially use-
ful for the similarity computations and/or more revealing of
dialog activities. We wanted our models to reflect this, with
1. INTRODUCTION greater weights for such dimensions. Doing so sacrifices the
In most cases people searching in audio are probably not distance metaphor, but is computationally similar. Specifi-
really interested in finding words. What people want is of- cally, for any two points in a dialog, x and y, we compute a
ten information of some type, which may be characterized weighted sum of their di↵erences on the dimensions:
in part by dialog process or activity, for example recom-
mending, answering a question, agreeing, forming a decision, 78
X
telling life stories, making plans, hearing surprising state- dissimilarity = wi |xi yi | (1)
ments, giving advice, explaining, and so on. In dialog, such i=1
activities and topics often are associated with characteristic First we tried this with uniform weights, giving the “dis-
prosodic features and patterns. sim” results in the tables. We then tried optimized weights,
Our basic idea is to use a vector-space model of dialog trained using linear regression, where the target was a dis-
activity, where each moment in time maps to a point in this tance of 0 if x and y were similar, and 1 if they were not
space. This representation is obtained by applying Principal similar. Thus, for example, if two selected timepoints x and
Component Analysis to 78 local prosodic features computed y both were located in regions that had been tagged as talk
every 10ms calculated over a 6 second sliding window [2]. about “favorite movies,” then x and y were counted as sim-
This feature set was choosen for simplicity of computation ilar. If x and y shared no tags, they were counted as not
and for providing coverage of most of the prosodic aspects similar. This is of course not ideal, since a point-pair might
known to be most relevent for dialog. It resembles that used be similar even if not belonging to regions that were felt to
in [2], but with more volume features and fewer pitch fea- be worth tagging. Sets of similar and non-similar timepoint-
tures, more speaker features and fewer interlocutor features, pairs were obtained by random sampling over the training
and more narrow-window features close to the point of in- set.
terest and fewer distant-context features. After PCA this For sampling we experimented with various more restric-
gave 78 dimensions, ordered by how much of the variation tive definitions of similar. One type of constraint was to
they explained. require agreement by at least some number of annotators in
In previous work [4] we found that dialog timepoints which order to consider a timepoint pair as similar. For this the
were proximal in this space tended to be similar not only in label names, were ignored (as always), and so the annotators
dialog activity but in topic as well. Here we extend this might have considered the points to be similar in di↵erent
work to use better similarity models, and report positive ways entirely. The second type of constraint relied on the
results on a standard problem, namely the Similar Segments utility values (“weights”) assigned by the annotators to their
in Social Speech Task at MediaEval 2013 for which the task tags, higher the more informative and cohesive they thought
definition, data set, and evaulation metrics may be found in the tagset was. For example, in one sampling we included
[5]. only pairs whose connecting tag was rated 3, excluding those
rated 0, 1, or 2 [3]. Requiring higher tagweights and more
2. THE MODELS agreement gave higher-quality training data, but at the cost
of reducing the quantity of similar point-pairs available to
train with.
Copyright is held by the authors. We also experimented with pruning the dimensions, using
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
naive raw raw norm. norm. naive raw raw norm. norm.
model prec. recall s.u.r s.u.r recall F model prec. recall s.u.r s.u.r recall F
Random 6% 23% 0.25 0.86 0.83 0.86 Random 7% 11% 0.12 0.43 0.40 0.43
Expl. 16% 46% 0.43 1.49 1.67 1.50 Expl. 9% 18% 0.29 1.00 0.67 0.95
Distance 3% 26% 0.21 0.74 0.96 0.76 Distance 3% 16% 0.12 0.41 0.57 0.42
Dissim. 4% 26% 0.22 0.76 0.96 0.78 Dissim. 6% 22% 0.17 0.58 0.81 0.60
all+ 6% 31% 0.26 0.89 1.12 0.91 all+ 7% 28% 0.22 0.75 1.03 0.77
good+ 6% 34% 0.27 0.94 1.25 0.97 good+ 6% 22% 0.17 0.60 0.81 0.61
all-p+ 7% 32% 0.27 0.92 1.17 0.94 all-p+ 7% 30% 0.22 0.77 1.08 0.79
good-p+ 7% 34% 0.28 0.96 1.24 0.98 good-p+ 7% 26% 0.20 0.69 0.93 0.71
Table 1: Performance on Training Set. all = Table 2: Performance on the Test Set, as above.
trained using all training-data similarity sets; good
= trained on only point-pairs which were in same-
tagged regions according to at least three annota- better. Exploring this is a priority for future research.
tors; p = iterative-leave-one-out pruning applied to The e↵ects of using higher quality training data varied
dimensions, + = only positively-weighted dimen- with the testset: on the training set, using the good quality
sions retained; s.u.r = speaker utility ratio. set gave the best performance, but on the test set the model
trained using all the data performed best. Pruning was gen-
erally beneficial, with dropping dimensions with negative
two feature selection methods. This was prompted by the weights being the most useful, with some additional benefit
observation that linear regression consistently gave negative from also selectively dropping dimensions.
weights to some of the dimensions, for example 67, which, Looking at robustness to changes in the data, the picture
when we listened to it, seemed to encode the di↵erence be- is clouded by the fact that the test set was harder, in terms of
tween calm, indi↵erent speech and energetic explaining. The recall (because the target regions, like all regions in this set,
first method was to try to leave a dimension out of the model tended to be shorter and thus harder to find). Nevertheless,
(set its weight to zero), and if that improved performance on the training set the best model’s performance was still
on a held-out subset of the training data, to drop it from far above baseline, showing a degree of generalizability.
the set. This was iterated, typically resulting in dropping Although the potential utility of prosody for search has
about a third of the dimensions. The second approach was been long discussed [1], and demonstrations of the relevance
to simply drop any dimension to which regression assigned for prosody for inferring emotion and dialog acts are com-
a negative weight. mon, here we demonstrate, for the first time, that prosodic
information, used by itself, is actually of value for search in
audio archives.
3. RESULTS AND DISCUSSION
The tables show the results1 . for the four models which
performed best on the training set and four reference mod- 4. ACKNOWLEDGMENTS
els: the baseline, where the jump in points for each query are We thank the National Science Foundation for support via
randomly selected; a tagset-exploiting model, where jump a REU supplement to Award IIS-0914868, and Olac Fuentes.
in points are found by considering tags by other annotators
with regions that overlap the query region; the Euclidean 5. REFERENCES
distance model; and a model based on uniform-weight dissi- [1] D. Hakkani-Tur, G. Tur, A. Stolcke, and E. E.
malarity, that is, like distance but using absolute-value in- Shriberg. Combining words and prosody for
stead of squared di↵erences. We used the tagset-exploiting information extraction from speech. In Proc.
model as a likely upper bound on performance, as it is akin Eurospeech, vol. 5, pages 1991–1994, 1999.
to how a second human might themselves perform the search
[2] N. G. Ward and A. Vega. A bottom-up exploration of
task. For the best models, performance is far above base-
the dimensions of dialog state in spoken interaction. In
line, showing that information retrieval can indeed benefit
13th Annual SIGdial Meeting on Discourse and
by using prosodic information.
Dialogue, 2012.
These results are, however, weaker than those that can be
[3] N. G. Ward and S. D. Werner. Data collection for the
obtained by using lexical features. Perhaps in this corpus
Similar Segments in Social Speech task. University of
topical similarity was more relevent then functional similar-
Texas at El Paso, Technical Report, UTEP-CS-13-58,
ity, and perhaps lexical models are better for topic similar-
2013.
ity. Thus prosodic models may still be of value, as is, for
languages for which speech recognizers are not available or [4] N. G. Ward and S. D. Werner. Using dialog-activity
perform poorly. We further conjecture that the prosody is similarity for spoken information retrieval. In
capturing dimensions of similarity not seen in lexical simi- Interspeech, 2013.
larity, and therefore that a combined model could do even [5] N. G. Ward, S. D. Werner, D. G. Novick, T. Kawahara,
E. E. Shriberg, L.-P. Morency, and C. Oertel. The
1
From the point of view of the competition, these results are similar segments in social speech task. In MediaEval
all unofficial, since the authors, being also the competition Workshop, 2013.
organizers, had privileged access to the data.