=Paper=
{{Paper
|id=None
|storemode=property
|title=A Two-step Approach to Video Retrieval based on ASR transcriptions
|pdfUrl=https://ceur-ws.org/Vol-807/Schmidt_CUT_RSR_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SchmidtKHW11
}}
==A Two-step Approach to Video Retrieval based on ASR transcriptions==
A Two-step Approach to Video Retrieval based on ASR transcriptions Ken Schmidt, Thomas Körner, Stephan Heinich, Thomas Wilhelm Chemnitz University of Technology Department of Computer Science Straße der Nationen 62, 09111 Chemnitz, Germany {sken, koert, heist, wilt}@hrz.tu-chemnitz.de ABSTRACT The Xtrieval framework itself works in the following way. The In this paper, we describe our experiments for the Rich Speech textual data from various sources is captured by the indexer. The Retrieval Task at the MediaEval Benchmark Initiative 2011. We indexing process is based on a data collection concept that start with a brief overview on the used framework and its abstracts the actual collection under investigation. The indexing structure. Our experiments indicate that a two-step retrieval itself is done by the Indexer class. The resulting index is used for approach and applying a spell checker can improve the quality of searching. The topics (or queries), index and retrieval parts are retrieval results in the given scenario. Finally, we discuss other necessary to run an evaluation experiment. Using the experiment techniques that may further improve the quality of the results. data structure, we can evaluate various indexing and search approaches. Categories and Subject Descriptors Xtrieval is capable of providing different retrieval API's, such as Apache Lucene1, Terrier2 and Lemur3, through a common H.3.1 [Content Analysis and Indexing] and H.3.3 [Information programming interface. We used Lucene as retrieval core for the Search and Retrieval] present experiments. There are many possibilities to tune General Terms components and parameters in Xtrieval. Here, we opted to weight Measurement, Experimentation, Languages a part of the query to emphasize its impact on the result. We were required to submit one specifically configured run and Keywords up to four arbitrary runs. The respective data was extracted from Information retrieval, automatic speech recognition, multimedia the XML documents using XPath. Before indexing we applied retrieval some standard token filters, such as lowercase transformation, Porter stemming, and a standard list of stopwords for English4. 1. INTRODUCTION We built several indexes using the development data set. Various The aim of Rich Speech Retrieval Task at MediaEval 2011 [1] system configurations were tested to investigate the specific was to provide jump-in points for videos given a user query. We characteristics of the data and the respective retrieval problem. had 350 hours of internet video material from blip.tv shows, mostly in English language. Automatic speech recognition The initial idea of preferring a two-step approach over standard transcripts in two versions one from 2010 and another from 2011 retrieval is based on the following thoughts. If the search terms do served as basic metadata. We used the transcripts from 2010 in not appear all together in a single segment, it is hard to define the our experiments. Additional metadata, such as tags and shot most relevant segment. We supposed that results can be improved segmentation were supplied in the form of XML documents. by first identifying a possibly relevant document (based on its full Along with the shot segmentation, key frames of every scene were transcript) and determining the most relevant segment in a second also provided. step. Thus, we created two basic indexes. In the first, we indexed the ASR transcripts and used the documents as identifiers to create 2. SYSTEM DESCRIPTION a (preliminary) ranking. For the second index we treated the For the participation at MediaEval 2011, we based our speech segments as documents in order to create a ranking on experiments on the information retrieval framework Xtrieval segment level. (eXtensible reTRIeval and EVALuation) [2]. Xtrieval is being In the first step of retrieval, we identify the most relevant developed at Chemnitz University of Technology since 2005. The document, weight it and add its ID to the query. This modified idea behind this framework was to create a flexible and adjustable query is submitted to the second index in order to identify the framework with state-of-art retrieval-techniques. Xtrieval most relevant segment, related to the first document. Here, a provides several Java-based object-orientated API's for different weight of 0.0 for the document ID in the modified query means no retrieval tasks. There are four main components: indexing, effect (or modification). In contrast to that, assigning a weight of retrieval, evaluation and the user interface. The first three are the 1.0 to the ID results in retrieving only segments of documents that main components; we did not use the UI component. were already identified before, i.e. it defaults to standard retrieval. 1 http://lucene.apache.org/ 2 http://terrier.org/ Copyright is held by the author/owner(s). 3 http://www.lemurproject.org/ MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy. 4 http://members.unine.ch/jacques.savoy/clef/index.html Because the documents that were retrieved from the first index fourth and fifth) includes additional metadata in the first index. In were not always relevant, we decided to give them only a partial run2 we searched with the raw (original) query. The weighting for influence on the results for the search in the second index. the first document was 0.1. For the third experiment (run3) we Obviously, using different weights produced different results, filtered the query using the Google spellchecker in the first search which are reported in more detail in the next section. We decided step (FC). The weighting remained at 0.1. For the last two to use this two-step retrieval approach, because we observed better experiments run4 and run5 we increased the weighting to 0.6 and results than with a standard search when we were experimenting 0.8. Additionally, we used the spellchecker twice (FCFC), for with the development set. document and segment search step. Our experiments with the development data achieved much higher 3. CONFIGURATIONS AND RESULTS mGAP values than the test set results. This may be due to the In this section we report the results that were obtained by smaller size of the development set. alternating parameters of our system configuration. We provide a brief discussion of our results for the official evaluation along 3.2 Additional Experiments with some observations on additional experiments. In addition to the submitted runs we followed some other 3.1 Submitted Runs approaches. We also tried searching in one index that contained We submitted five different configurations of our experiments. all available information. But there were no improvements. We For the required run (run1) we indexed the complete ASR decided to explore the possibilities of text recognition in the key information from the XML transcripts of 2010 in the first index. frame pictures. But the recognition results were very bad, because Our second index contained only the segments of speech detected of the moderate image quality. Only in some cases we could by ASR. So we did for all our runs. In our retrieval experiments extract words which were ready to use. we used the title and the short title from the queries. The field act We see some other problems with short queries. Queries that was discarded, because it significantly decreased the result quality contain for example the name “Hillary Clinton” and only a few in preliminary experiments. The starting point of the identified general terms deliver bad results. This may be due to the fact that speech segment was returned as the required jump-in point. the term “Clinton” appears more often alone or in conjunction For our arbitrary configurable runs, we also included the metadata with “Bill” and refers to the former president of the USA. information in the first index. This increased the result quality on the development data. During our experiments we observed, that 4. CONCLUSIONS AND FUTURE WORK some of the queries did contain spelling errors. Since these Our tests showed that a two-step retrieval approach works well for mistakes were not intended, they reflected a realistic use case. the present scenario. But only in the case that the first identified Thus, we decided to use a spellchecker and we filtered the queries document is not overrated to avoid excluding alternative with it. We used the Google spell check API5, which returned a solutions. A spelling checker works well in cases of misspelled weighted list of suggested words. The first word from this list was names. If there are several variations for spelling a name like added to the search query. Applying this technique resulted in a Denis (or also Dennis), the spell checker adds the most common slight improvement of quality. notation. We suppose a case where the ASR system returns a misspelled name (Denis vs. Dennis). The user knows the person is The weighting of the first identified document was varied from spelled with two “n” but it was transcribed only with one. So the 0.1 to 0.9. We recognized that in case of a non-relevant document spell checker adds “Denis” to the query and the document could on the top of the ranking, giving a high weight to this document be found by the user. ID in the modified query significantly decreased the result quality. A reason could be the fact that higher weights pushed down Future work could focus on improving quality of the first alternative results in the result list. Our results are listed in detail document. Another possibility is to take various languages of the in the following table. documents into account. We disregarded this, because there are only few videos with other languages than English. Furthermore, Table 1: Experiments and results for test and development set there are prospects to work with the shot segmentation: combining experiment id mGAP* the shot time with the speech segment time to improve the jump- test set window 10s window 30s window 60s in point. Another disregarded option is to make use of the run1 0.1 TT ASR 0.1432 0.2420 0.3051 provided tags. It might be possible to categorize the query in such run2 0.1 TT 0.1432 0.2419 0.3051 a way that the search heaps only videos with tags fitting the query. run3 0.1 FCT 0.1432 0.2420 0.3052 5. ACKNOWLEDGMENTS run4 0.6 FCFC 0.1164 0.2039 0.2557 This publication was prepared as a part of the research initiative run5 0.8 FCFC 0.1140 0.2004 0.2511 sachsMedia (http://sachsmedia.tv), which is funded by the development set German Federal Ministry of Education and Research under the run1 0.1 TT ASR 0.2325 0.2886 0.3228 grant reference number 03IP608. The authors take sole run2 0.1 TT 0.2926 0.3585 0.3946 responsibility for the contents of this publication. run3 0.1 FCT 0.2925 0.3620 0.4017 run4 0.6 FCFC 0.2863 0.3609 0.4071 6. REFERENCES run5 0.8 FCFC 0.2863 0.3609 0.4071 [1] Martha Larson, Maria Eskevich, Roeland Ordelman, *mGAP: mean generalized average precision Christoph Kofler, Sebastian Schmeideke, Gareth J. F. Jones: Overview of MediaEval 2011 Rich Speech Retrieval Task Run1 is the required and restricted experiment. Hence, it is based and Genre Tagging Task, MediaEval 2011 Workshop. on the 2010 ASR transcripts only. Our second run (and also third, [2] Jens Kürsten, Thomas Wilhelm, Maximilian Eibl: Extensible Retrieval and Evaluation Framework: Xtrieval. In 5 Proceedings of the LWA – Workshop FGIR, p. 107-110. http://code.google.com/p/google-api-spelling-java/