=Paper=
{{Paper
|id=None
|storemode=property
|title=Rich Speech Retrieval Using Query Word Filter
        
|pdfUrl=https://ceur-ws.org/Vol-807/wartena_NOVAY-TUD_RSR_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WartenaL11
}}
==Rich Speech Retrieval Using Query Word Filter
        ==
<pdf width="1500px">https://ceur-ws.org/Vol-807/wartena_NOVAY-TUD_RSR_me11wn.pdf</pdf>
<pre>
             Rich Speech Retrieval Using Query Word Filter

                       Christian Wartena                                          Martha Larson
         Univ. of Applied Sciences and Arts Hannover ∗                    Delft University of Technology
                      Hannover, Germany                                       Delft, the Netherlands
            christian.wartena@fh-hannover.de                                 m.a.larson@tudelft.nl


ABSTRACT                                                         podcast search behavior [1] during which subjects reported
Rich Speech Retrieval performance improves when general          adding general words such as ‘podcast’, ‘audio’ or ‘mp3’ to
query-language words are filtered and both speech recogni-       queries when looking for podcasts using a general search
tion transcripts and metadata are indexed via BM25F(ields).      engine. Primary language is query language that echos the
                                                                 words of the person who is speaking in the relevant video seg-
Categories and Subject Descriptors                               ment. We assume that automatic speech recognition (ASR)
                                                                 transcripts will help us match primary language in queries
H.3 [Information Storage and Retrieval]: Content Anal-           with jump-in points, but that general query language found
ysis and Indexing                                                in ASR-transcripts is less likely to be specifically relevant to
General Terms                                                    the user’s information need. We describe each of our algo-
                                                                 rithms, report results and end with conclusion and outlook.
Algorithms, Experimentation
Keywords                                                         2.   EXPERIMENTAL FRAMEWORK
Spoken content retrieval, Query word classification                 In this section, we describe our approaches to RSR. For
                                                                 all runs, we produce our ranked list of jump-in points using
1.   INTRODUCTION                                                a standard IR algorithm to retrieve video fragments that
   Our Rich Speech Retrieval (RSR) approach filters words        have been defined on the basis of the ASR-transcripts. We
in the query into two categories and treats each separately.     return the start point of each fragment as a jump-in point.
RSR is a known-item task that involves returning a ranked        Fragments are defined as a sequence of sentences of about 40
list of jump-in points in response to a user query describ-      non-stop-words. Sentences are derived on the basis of punc-
ing a segment of video in which someone is speaking. The         tuation (full-stop = sentence end), which is hypothesized by
queries are given in two formulations: a long form consisting    the recognizer and included in the output of ASR-system. If
of a natural language description of what the known item is      a sentence is less than 40 words in length, subsequent sen-
about (ca. one sentence in length) and a short form consist-     tences are added until it approximately meets this target.
ing of a keyword version of the query as it might be issued         Mark Hepple’s [2] part-of-speech (POS) tagger is used to
to a general-purpose search engine. The video corpus used        tag and lemmatize all words. We remove all closed class
contains Creative Commons content collected from blip.tv         words (i.e., prepositions, articles, auxiliaries, particles, etc.).
and the spoken channel is a mixture of planned and spon-         To compensate for POS tagging errors, we additionally re-
taneous speech. Although visual features might prove help-       move English and Dutch stop words (standard Lucene search
ful for some RSR queries, here, we investigate only the use      engine stopword lists). Word and sentence segmentation,
of ASR-transcripts and metadata. Note that although the          POS-tagging and term selection are implemented as a UIMA
know-items targeted in the RSR task correspond to particu-       (http://uima.apache.org) analysis pipeline.
lar speech acts, we did not investigate this aspect here. More      We carry out ranking using BM25 [5]. Since fragments
details on the RSR task are available in [4].                    may overlap, we calculate idf (Eq. 1) on the basis of the
   We conjecture that users queries are a mixture of two dis-    sentence, the basic organizational unit of the speech channel,
tinct types of language: general query language and primary
language. General query language is language the users al-                                             N − dft + 0.5
                                                                                  idf(t)   =     log                 .             (1)
ways use when formulating queries for videos during a search                                             dft + 0.5
session with a general search engine (e.g., video, episode,
show ). Our conjecture is based on informal observation of       Here, N is the total number of fragments, and dft is the
user query behavior. It is supported by a user study of          number of fragments in which term t occurs. The weight of
                                                                 each term in each fragment-document is given by w(d, t),
∗
  At the time the work presented here was done the author
was affiliated with Novay, Enschede (The Netherlands) and                                               (k + 1) ∗ fdt
                                                                        w(d, t)   =    idf(t)                           ld
                                                                                                                               ,   (2)
Delft University of Technology.                                                                 fdt + k ∗ (1 − b + b ∗ avgdl )

                                                                 where fdt is the number of occurrences of term t in document
Copyright is held by the author/owner(s).                        d, ld is the length of d, and avgdl is the average document
MediaEval 2011 Workshop, September 1–2, 2011, Pisa, Italy        length. In our experiments, we set k = 2 and b = 0.75.
  The retrieval status value (RSV) of a document for query
consisting of more than one word is defined as,                        Table 2: Results reported in terms of mGRR
                               X                                                 Run ID      10     30     60
                  w(d, Q) =       w(d, t).             (3)                       1           0.24   0.33   0.38
                                 t∈Q
                                                                                 2           0.22   0.34   0.40
Note that each query word contributes once to the sum, i.e.,                     4           0.23   0.34   0.39
repetition of query words is ignored.                                            5a          0.24   0.36   0.41
   We create an initial ranking by ordering all fragments by                     5b          0.28   0.39   0.45
their RSV values (Eq. 3). In order to generate our final
results list, we remove all fragments with a starting time        general query-language words should not be treated by ex-
within a window of 600 seconds of a higher ranked fragment.       tending a conventional stop word list for application in video
   The approaches used by our runs are shown in Table 1.          retrieval. No appreciable difference was observed between
  In runs 4 and 5 we use metadata (descriptions, title and        using ASR transcripts alone and using both ASR transcripts
                                                                  and metadata in the conventional case in which query words
           Table 1: Description of RSR runs
                                                                  are all treated the same (cf. run 2 vs. run 4). Apparently,
 Run ID      Query           Fields             Filtering         separate treatment for different types of query words is par-
 1           full            ASR                no                ticularly important to fully exploit the contribution of meta-
 2           full + short    ASR                no                data (cf. run 2 vs. run 5b). In the experiments, we find that
 4           full + short    ASR + metadata     no                adding the query-language downweighting slightly improves
 5a          full + short    ASR + metadata     dx (q) > 200      the results of very many queries, as long as they already per-
 5b          full + short    ASR + metadata     weighted          formed reasonably well without downweighting. However, a
                                                                  number of queries fail completely. An investigation of these
tags) along with the ASR-transcripts. These runs make use         cases carried out by hand revealed that failure was in most
of the BM25 extension known as, BM25F(ields) [6],                 cases due to vocabulary mismatch between query and tar-
                            X                                     get item, suggesting that performance would benefit from
             w(d, Q) =            wf w(df , t).       (4)         the use of conventional techniques for query expansion.
                             t∈Q,f ∈F                                Future work will focus on developing more sophisticated
                                                                  models for general-language query words. Additionally, we
Here, F is a set of fields, df the part of document d labeled     will attempt to model of query words that are ‘primary’,
as field f , and where wf is the weight for field f . In our      i.e., more likely to occur in spontaneously produced and/or
experiments we use wf = 1 for the ASR and wf = 0.5 for            direct speech and less likely to occur in the descriptive or
all other fields. Tests on the development set showed that        indirect descriptions of the video in the metadata.
results are not particularly sensitive to the exact value and
                                                                     Acknowledgments The research leading to these results
we used 0.5 since it gave the best results.
                                                                  has received funding from the European Commission’s 7th
   In runs 5, we applied a query word filter built using a cor-
                                                                  Framework Programme (FP7) under grant agreement no.
pus of 3,400 requests for video made by users on Yahoo! An-
                                                                  216444 (EU PetaMedia Network of Excellence).
swers, cf. [3]. In run 5a, we removed the most frequent words
occurring in the corpus from the queries (83 terms with fre-      4.    REFERENCES
quency over 200 were removed). In run 5b, terms frequent          [1] J. Besser, M. Larson, and K. Hofmann. Podcast search:
across requests in the corpus were given lower weights. We            User goals and retrieval technologies. Online
implemented this downweighting by replacing Eq. 1 by,                 Information Review, 34:3, 2010.
                           N − dft + 0.5                          [2] M. Hepple. Independence and commitment:
       idf 0 (t)   = α log                                 (5)        Assumptions for rapid training and execution of
                             dft + 0.5
                                   Nreq − reqft + 0.5                 rule-based POS taggers. In ACL, 2000.
                      +(1 − α) log                    ,           [3] C. Kofler, M. Larson, and A. Hanjalic. To seek,
                                       reqft + 0.5
                                                                      perchance to fail: expressions of user needs in internet
where Nreq is the number of requests in the corpus and                video search. In Proceedings of the 33rd European
reqft is the number of requests in which term t occurs. In            conference on Advances in information retrieval,
the reported runs we have set α = 0.5.                                ECIR’11, pages 611–616, Berlin, Heidelberg, 2011.
                                                                      Springer-Verlag.
3.   RESULTS AND CONCLUSION                                       [4] M. Larson, M. Eskevich, R. Ordelman, C. Kofler,
   Our results our reported in Table 2 and given in terms             S. Schmiedeke, and G. Jones. Overview of MediaEval
of the mean Generalized Reciprocal Rank (mGRR) [4] with               2011 Rich Speech Retrieval Task and Genre Tagging
tolerance windows of 10, 30 and 60 seconds. In general,               Task. In MediaEval 2011 Workshop, Pisa, Italy,
larger tolerance windows correspond to larger scores. How-            September 1-2 2011.
ever, whether adding the short query improves performance         [5] C. D. Manning, P. Raghavan, and H. Schütze.
(cf. run 1 vs. 2) varies depending on the tolerance window            Introduction to Information Retrieval. Cambridge
used. Note that the statistical significance of this difference       University Press, 1 edition, July 2008.
remains to be checked.                                            [6] S. E. Robertson, H. Zaragoza, and M. J. Taylor. Simple
   We can see that filtering or downweighting general query-          BM25 extension to multiple weighted fields. In D. A.
language words (e.g., video and tv ) can indeed improve re-           Grossman, L. Gravano, C. Zhai, O. Herzog, and D. A.
sults. Downweighting has a larger impact, suggesting that             Evans, editors, CIKM, pages 42–49. ACM, 2004.

</pre>