=Paper=
{{Paper
|id=None
|storemode=property
|title=Rich Speech Retrieval Using Query Word Filter
|pdfUrl=https://ceur-ws.org/Vol-807/wartena_NOVAY-TUD_RSR_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WartenaL11
}}
==Rich Speech Retrieval Using Query Word Filter
==
Rich Speech Retrieval Using Query Word Filter Christian Wartena Martha Larson Univ. of Applied Sciences and Arts Hannover ∗ Delft University of Technology Hannover, Germany Delft, the Netherlands christian.wartena@fh-hannover.de m.a.larson@tudelft.nl ABSTRACT podcast search behavior [1] during which subjects reported Rich Speech Retrieval performance improves when general adding general words such as ‘podcast’, ‘audio’ or ‘mp3’ to query-language words are filtered and both speech recogni- queries when looking for podcasts using a general search tion transcripts and metadata are indexed via BM25F(ields). engine. Primary language is query language that echos the words of the person who is speaking in the relevant video seg- Categories and Subject Descriptors ment. We assume that automatic speech recognition (ASR) transcripts will help us match primary language in queries H.3 [Information Storage and Retrieval]: Content Anal- with jump-in points, but that general query language found ysis and Indexing in ASR-transcripts is less likely to be specifically relevant to General Terms the user’s information need. We describe each of our algo- rithms, report results and end with conclusion and outlook. Algorithms, Experimentation Keywords 2. EXPERIMENTAL FRAMEWORK Spoken content retrieval, Query word classification In this section, we describe our approaches to RSR. For all runs, we produce our ranked list of jump-in points using 1. INTRODUCTION a standard IR algorithm to retrieve video fragments that Our Rich Speech Retrieval (RSR) approach filters words have been defined on the basis of the ASR-transcripts. We in the query into two categories and treats each separately. return the start point of each fragment as a jump-in point. RSR is a known-item task that involves returning a ranked Fragments are defined as a sequence of sentences of about 40 list of jump-in points in response to a user query describ- non-stop-words. Sentences are derived on the basis of punc- ing a segment of video in which someone is speaking. The tuation (full-stop = sentence end), which is hypothesized by queries are given in two formulations: a long form consisting the recognizer and included in the output of ASR-system. If of a natural language description of what the known item is a sentence is less than 40 words in length, subsequent sen- about (ca. one sentence in length) and a short form consist- tences are added until it approximately meets this target. ing of a keyword version of the query as it might be issued Mark Hepple’s [2] part-of-speech (POS) tagger is used to to a general-purpose search engine. The video corpus used tag and lemmatize all words. We remove all closed class contains Creative Commons content collected from blip.tv words (i.e., prepositions, articles, auxiliaries, particles, etc.). and the spoken channel is a mixture of planned and spon- To compensate for POS tagging errors, we additionally re- taneous speech. Although visual features might prove help- move English and Dutch stop words (standard Lucene search ful for some RSR queries, here, we investigate only the use engine stopword lists). Word and sentence segmentation, of ASR-transcripts and metadata. Note that although the POS-tagging and term selection are implemented as a UIMA know-items targeted in the RSR task correspond to particu- (http://uima.apache.org) analysis pipeline. lar speech acts, we did not investigate this aspect here. More We carry out ranking using BM25 [5]. Since fragments details on the RSR task are available in [4]. may overlap, we calculate idf (Eq. 1) on the basis of the We conjecture that users queries are a mixture of two dis- sentence, the basic organizational unit of the speech channel, tinct types of language: general query language and primary language. General query language is language the users al- N − dft + 0.5 idf(t) = log . (1) ways use when formulating queries for videos during a search dft + 0.5 session with a general search engine (e.g., video, episode, show ). Our conjecture is based on informal observation of Here, N is the total number of fragments, and dft is the user query behavior. It is supported by a user study of number of fragments in which term t occurs. The weight of each term in each fragment-document is given by w(d, t), ∗ At the time the work presented here was done the author was affiliated with Novay, Enschede (The Netherlands) and (k + 1) ∗ fdt w(d, t) = idf(t) ld , (2) Delft University of Technology. fdt + k ∗ (1 − b + b ∗ avgdl ) where fdt is the number of occurrences of term t in document Copyright is held by the author/owner(s). d, ld is the length of d, and avgdl is the average document MediaEval 2011 Workshop, September 1–2, 2011, Pisa, Italy length. In our experiments, we set k = 2 and b = 0.75. The retrieval status value (RSV) of a document for query consisting of more than one word is defined as, Table 2: Results reported in terms of mGRR X Run ID 10 30 60 w(d, Q) = w(d, t). (3) 1 0.24 0.33 0.38 t∈Q 2 0.22 0.34 0.40 Note that each query word contributes once to the sum, i.e., 4 0.23 0.34 0.39 repetition of query words is ignored. 5a 0.24 0.36 0.41 We create an initial ranking by ordering all fragments by 5b 0.28 0.39 0.45 their RSV values (Eq. 3). In order to generate our final results list, we remove all fragments with a starting time general query-language words should not be treated by ex- within a window of 600 seconds of a higher ranked fragment. tending a conventional stop word list for application in video The approaches used by our runs are shown in Table 1. retrieval. No appreciable difference was observed between In runs 4 and 5 we use metadata (descriptions, title and using ASR transcripts alone and using both ASR transcripts and metadata in the conventional case in which query words Table 1: Description of RSR runs are all treated the same (cf. run 2 vs. run 4). Apparently, Run ID Query Fields Filtering separate treatment for different types of query words is par- 1 full ASR no ticularly important to fully exploit the contribution of meta- 2 full + short ASR no data (cf. run 2 vs. run 5b). In the experiments, we find that 4 full + short ASR + metadata no adding the query-language downweighting slightly improves 5a full + short ASR + metadata dx (q) > 200 the results of very many queries, as long as they already per- 5b full + short ASR + metadata weighted formed reasonably well without downweighting. However, a number of queries fail completely. An investigation of these tags) along with the ASR-transcripts. These runs make use cases carried out by hand revealed that failure was in most of the BM25 extension known as, BM25F(ields) [6], cases due to vocabulary mismatch between query and tar- X get item, suggesting that performance would benefit from w(d, Q) = wf w(df , t). (4) the use of conventional techniques for query expansion. t∈Q,f ∈F Future work will focus on developing more sophisticated models for general-language query words. Additionally, we Here, F is a set of fields, df the part of document d labeled will attempt to model of query words that are ‘primary’, as field f , and where wf is the weight for field f . In our i.e., more likely to occur in spontaneously produced and/or experiments we use wf = 1 for the ASR and wf = 0.5 for direct speech and less likely to occur in the descriptive or all other fields. Tests on the development set showed that indirect descriptions of the video in the metadata. results are not particularly sensitive to the exact value and Acknowledgments The research leading to these results we used 0.5 since it gave the best results. has received funding from the European Commission’s 7th In runs 5, we applied a query word filter built using a cor- Framework Programme (FP7) under grant agreement no. pus of 3,400 requests for video made by users on Yahoo! An- 216444 (EU PetaMedia Network of Excellence). swers, cf. [3]. In run 5a, we removed the most frequent words occurring in the corpus from the queries (83 terms with fre- 4. REFERENCES quency over 200 were removed). In run 5b, terms frequent [1] J. Besser, M. Larson, and K. Hofmann. Podcast search: across requests in the corpus were given lower weights. We User goals and retrieval technologies. Online implemented this downweighting by replacing Eq. 1 by, Information Review, 34:3, 2010. N − dft + 0.5 [2] M. Hepple. Independence and commitment: idf 0 (t) = α log (5) Assumptions for rapid training and execution of dft + 0.5 Nreq − reqft + 0.5 rule-based POS taggers. In ACL, 2000. +(1 − α) log , [3] C. Kofler, M. Larson, and A. Hanjalic. To seek, reqft + 0.5 perchance to fail: expressions of user needs in internet where Nreq is the number of requests in the corpus and video search. In Proceedings of the 33rd European reqft is the number of requests in which term t occurs. In conference on Advances in information retrieval, the reported runs we have set α = 0.5. ECIR’11, pages 611–616, Berlin, Heidelberg, 2011. Springer-Verlag. 3. RESULTS AND CONCLUSION [4] M. Larson, M. Eskevich, R. Ordelman, C. Kofler, Our results our reported in Table 2 and given in terms S. Schmiedeke, and G. Jones. Overview of MediaEval of the mean Generalized Reciprocal Rank (mGRR) [4] with 2011 Rich Speech Retrieval Task and Genre Tagging tolerance windows of 10, 30 and 60 seconds. In general, Task. In MediaEval 2011 Workshop, Pisa, Italy, larger tolerance windows correspond to larger scores. How- September 1-2 2011. ever, whether adding the short query improves performance [5] C. D. Manning, P. Raghavan, and H. Schütze. (cf. run 1 vs. 2) varies depending on the tolerance window Introduction to Information Retrieval. Cambridge used. Note that the statistical significance of this difference University Press, 1 edition, July 2008. remains to be checked. [6] S. E. Robertson, H. Zaragoza, and M. J. Taylor. Simple We can see that filtering or downweighting general query- BM25 extension to multiple weighted fields. In D. A. language words (e.g., video and tv ) can indeed improve re- Grossman, L. Gravano, C. Zhai, O. Herzog, and D. A. sults. Downweighting has a larger impact, suggesting that Evans, editors, CIKM, pages 42–49. ACM, 2004.