Evaluating Prosody-Based Similarity Models for Information Retrieval Steven D. Werner Nigel G. Ward University of Texas at El Paso University of Texas at El Paso stevenwerner@acm.org nigelward@acm.org ABSTRACT The similar segments task is based on regions, but the Prosody is important in spoken language, and especially in dialog-space model is based on timepoints. For simplicity, dialog, but its utility for search in dialog archives has re- the middle point of the query region is used as the character- mained an open question. Using prosody-based measures of istic point. The most similar (proximal) timepoints, across similarity, which also roughly correlate with dialog-activity the entire corpus, are then found and returned, in order, as similarity and topic similarity, we built support for “retrieve the ranked list of jump-in points. more like this” searches. Performance on the Similar Seg- We started with a similarity metric using simple Euclidean ments in Social Speech Task at MediaEval 2013 was well distance in the vector space, as described in [4]. However we above baseline, showing the value of prosody for search. observed that some of the dimensions seemed especially use- ful for the similarity computations and/or more revealing of dialog activities. We wanted our models to reflect this, with 1. INTRODUCTION greater weights for such dimensions. Doing so sacrifices the In most cases people searching in audio are probably not distance metaphor, but is computationally similar. Specifi- really interested in finding words. What people want is of- cally, for any two points in a dialog, x and y, we compute a ten information of some type, which may be characterized weighted sum of their di↵erences on the dimensions: in part by dialog process or activity, for example recom- mending, answering a question, agreeing, forming a decision, 78 X telling life stories, making plans, hearing surprising state- dissimilarity = wi |xi yi | (1) ments, giving advice, explaining, and so on. In dialog, such i=1 activities and topics often are associated with characteristic First we tried this with uniform weights, giving the “dis- prosodic features and patterns. sim” results in the tables. We then tried optimized weights, Our basic idea is to use a vector-space model of dialog trained using linear regression, where the target was a dis- activity, where each moment in time maps to a point in this tance of 0 if x and y were similar, and 1 if they were not space. This representation is obtained by applying Principal similar. Thus, for example, if two selected timepoints x and Component Analysis to 78 local prosodic features computed y both were located in regions that had been tagged as talk every 10ms calculated over a 6 second sliding window [2]. about “favorite movies,” then x and y were counted as sim- This feature set was choosen for simplicity of computation ilar. If x and y shared no tags, they were counted as not and for providing coverage of most of the prosodic aspects similar. This is of course not ideal, since a point-pair might known to be most relevent for dialog. It resembles that used be similar even if not belonging to regions that were felt to in [2], but with more volume features and fewer pitch fea- be worth tagging. Sets of similar and non-similar timepoint- tures, more speaker features and fewer interlocutor features, pairs were obtained by random sampling over the training and more narrow-window features close to the point of in- set. terest and fewer distant-context features. After PCA this For sampling we experimented with various more restric- gave 78 dimensions, ordered by how much of the variation tive definitions of similar. One type of constraint was to they explained. require agreement by at least some number of annotators in In previous work [4] we found that dialog timepoints which order to consider a timepoint pair as similar. For this the were proximal in this space tended to be similar not only in label names, were ignored (as always), and so the annotators dialog activity but in topic as well. Here we extend this might have considered the points to be similar in di↵erent work to use better similarity models, and report positive ways entirely. The second type of constraint relied on the results on a standard problem, namely the Similar Segments utility values (“weights”) assigned by the annotators to their in Social Speech Task at MediaEval 2013 for which the task tags, higher the more informative and cohesive they thought definition, data set, and evaulation metrics may be found in the tagset was. For example, in one sampling we included [5]. only pairs whose connecting tag was rated 3, excluding those rated 0, 1, or 2 [3]. Requiring higher tagweights and more 2. THE MODELS agreement gave higher-quality training data, but at the cost of reducing the quantity of similar point-pairs available to train with. Copyright is held by the authors. We also experimented with pruning the dimensions, using MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain naive raw raw norm. norm. naive raw raw norm. norm. model prec. recall s.u.r s.u.r recall F model prec. recall s.u.r s.u.r recall F Random 6% 23% 0.25 0.86 0.83 0.86 Random 7% 11% 0.12 0.43 0.40 0.43 Expl. 16% 46% 0.43 1.49 1.67 1.50 Expl. 9% 18% 0.29 1.00 0.67 0.95 Distance 3% 26% 0.21 0.74 0.96 0.76 Distance 3% 16% 0.12 0.41 0.57 0.42 Dissim. 4% 26% 0.22 0.76 0.96 0.78 Dissim. 6% 22% 0.17 0.58 0.81 0.60 all+ 6% 31% 0.26 0.89 1.12 0.91 all+ 7% 28% 0.22 0.75 1.03 0.77 good+ 6% 34% 0.27 0.94 1.25 0.97 good+ 6% 22% 0.17 0.60 0.81 0.61 all-p+ 7% 32% 0.27 0.92 1.17 0.94 all-p+ 7% 30% 0.22 0.77 1.08 0.79 good-p+ 7% 34% 0.28 0.96 1.24 0.98 good-p+ 7% 26% 0.20 0.69 0.93 0.71 Table 1: Performance on Training Set. all = Table 2: Performance on the Test Set, as above. trained using all training-data similarity sets; good = trained on only point-pairs which were in same- tagged regions according to at least three annota- better. Exploring this is a priority for future research. tors; p = iterative-leave-one-out pruning applied to The e↵ects of using higher quality training data varied dimensions, + = only positively-weighted dimen- with the testset: on the training set, using the good quality sions retained; s.u.r = speaker utility ratio. set gave the best performance, but on the test set the model trained using all the data performed best. Pruning was gen- erally beneficial, with dropping dimensions with negative two feature selection methods. This was prompted by the weights being the most useful, with some additional benefit observation that linear regression consistently gave negative from also selectively dropping dimensions. weights to some of the dimensions, for example 67, which, Looking at robustness to changes in the data, the picture when we listened to it, seemed to encode the di↵erence be- is clouded by the fact that the test set was harder, in terms of tween calm, indi↵erent speech and energetic explaining. The recall (because the target regions, like all regions in this set, first method was to try to leave a dimension out of the model tended to be shorter and thus harder to find). Nevertheless, (set its weight to zero), and if that improved performance on the training set the best model’s performance was still on a held-out subset of the training data, to drop it from far above baseline, showing a degree of generalizability. the set. This was iterated, typically resulting in dropping Although the potential utility of prosody for search has about a third of the dimensions. The second approach was been long discussed [1], and demonstrations of the relevance to simply drop any dimension to which regression assigned for prosody for inferring emotion and dialog acts are com- a negative weight. mon, here we demonstrate, for the first time, that prosodic information, used by itself, is actually of value for search in audio archives. 3. RESULTS AND DISCUSSION The tables show the results1 . for the four models which performed best on the training set and four reference mod- 4. ACKNOWLEDGMENTS els: the baseline, where the jump in points for each query are We thank the National Science Foundation for support via randomly selected; a tagset-exploiting model, where jump a REU supplement to Award IIS-0914868, and Olac Fuentes. in points are found by considering tags by other annotators with regions that overlap the query region; the Euclidean 5. REFERENCES distance model; and a model based on uniform-weight dissi- [1] D. Hakkani-Tur, G. Tur, A. Stolcke, and E. E. malarity, that is, like distance but using absolute-value in- Shriberg. Combining words and prosody for stead of squared di↵erences. We used the tagset-exploiting information extraction from speech. In Proc. model as a likely upper bound on performance, as it is akin Eurospeech, vol. 5, pages 1991–1994, 1999. to how a second human might themselves perform the search [2] N. G. Ward and A. Vega. A bottom-up exploration of task. For the best models, performance is far above base- the dimensions of dialog state in spoken interaction. In line, showing that information retrieval can indeed benefit 13th Annual SIGdial Meeting on Discourse and by using prosodic information. Dialogue, 2012. These results are, however, weaker than those that can be [3] N. G. Ward and S. D. Werner. Data collection for the obtained by using lexical features. Perhaps in this corpus Similar Segments in Social Speech task. University of topical similarity was more relevent then functional similar- Texas at El Paso, Technical Report, UTEP-CS-13-58, ity, and perhaps lexical models are better for topic similar- 2013. ity. Thus prosodic models may still be of value, as is, for languages for which speech recognizers are not available or [4] N. G. Ward and S. D. Werner. Using dialog-activity perform poorly. We further conjecture that the prosody is similarity for spoken information retrieval. In capturing dimensions of similarity not seen in lexical simi- Interspeech, 2013. larity, and therefore that a combined model could do even [5] N. G. Ward, S. D. Werner, D. G. Novick, T. Kawahara, E. E. Shriberg, L.-P. Morency, and C. Oertel. The 1 From the point of view of the competition, these results are similar segments in social speech task. In MediaEval all unofficial, since the authors, being also the competition Workshop, 2013. organizers, had privileged access to the data.