=Paper=
{{Paper
|id=None
|storemode=property
|title=A Two-step Approach to Video Retrieval based on ASR transcriptions
|pdfUrl=https://ceur-ws.org/Vol-807/Schmidt_CUT_RSR_me11wn.pdf
|volume=Vol-807
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SchmidtKHW11
}}
==A Two-step Approach to Video Retrieval based on ASR transcriptions==
<pdf width="1500px">https://ceur-ws.org/Vol-807/Schmidt_CUT_RSR_me11wn.pdf</pdf>
<pre>
    A Two-step Approach to Video Retrieval based on ASR
                      transcriptions
                       Ken Schmidt, Thomas Körner, Stephan Heinich, Thomas Wilhelm
                                                Chemnitz University of Technology
                                                 Department of Computer Science
                                        Straße der Nationen 62, 09111 Chemnitz, Germany
                                            {sken, koert, heist, wilt}@hrz.tu-chemnitz.de


ABSTRACT                                                                The Xtrieval framework itself works in the following way. The
In this paper, we describe our experiments for the Rich Speech          textual data from various sources is captured by the indexer. The
Retrieval Task at the MediaEval Benchmark Initiative 2011. We           indexing process is based on a data collection concept that
start with a brief overview on the used framework and its               abstracts the actual collection under investigation. The indexing
structure. Our experiments indicate that a two-step retrieval           itself is done by the Indexer class. The resulting index is used for
approach and applying a spell checker can improve the quality of        searching. The topics (or queries), index and retrieval parts are
retrieval results in the given scenario. Finally, we discuss other      necessary to run an evaluation experiment. Using the experiment
techniques that may further improve the quality of the results.         data structure, we can evaluate various indexing and search
                                                                        approaches.
Categories and Subject Descriptors                                      Xtrieval is capable of providing different retrieval API's, such as
                                                                        Apache Lucene1, Terrier2 and Lemur3, through a common
H.3.1 [Content Analysis and Indexing] and H.3.3 [Information
                                                                        programming interface. We used Lucene as retrieval core for the
Search and Retrieval]
                                                                        present experiments. There are many possibilities to tune
General Terms                                                           components and parameters in Xtrieval. Here, we opted to weight
Measurement, Experimentation, Languages                                 a part of the query to emphasize its impact on the result.
                                                                        We were required to submit one specifically configured run and
Keywords                                                                up to four arbitrary runs. The respective data was extracted from
Information retrieval, automatic speech recognition, multimedia         the XML documents using XPath. Before indexing we applied
retrieval                                                               some standard token filters, such as lowercase transformation,
                                                                        Porter stemming, and a standard list of stopwords for English4.
1. INTRODUCTION                                                         We built several indexes using the development data set. Various
The aim of Rich Speech Retrieval Task at MediaEval 2011 [1]             system configurations were tested to investigate the specific
was to provide jump-in points for videos given a user query. We         characteristics of the data and the respective retrieval problem.
had 350 hours of internet video material from blip.tv shows,
mostly in English language. Automatic speech recognition                The initial idea of preferring a two-step approach over standard
transcripts in two versions one from 2010 and another from 2011         retrieval is based on the following thoughts. If the search terms do
served as basic metadata. We used the transcripts from 2010 in          not appear all together in a single segment, it is hard to define the
our experiments. Additional metadata, such as tags and shot             most relevant segment. We supposed that results can be improved
segmentation were supplied in the form of XML documents.                by first identifying a possibly relevant document (based on its full
Along with the shot segmentation, key frames of every scene were        transcript) and determining the most relevant segment in a second
also provided.                                                          step. Thus, we created two basic indexes. In the first, we indexed
                                                                        the ASR transcripts and used the documents as identifiers to create
2. SYSTEM DESCRIPTION                                                   a (preliminary) ranking. For the second index we treated the
For the participation at MediaEval 2011, we based our                   speech segments as documents in order to create a ranking on
experiments on the information retrieval framework Xtrieval             segment level.
(eXtensible reTRIeval and EVALuation) [2]. Xtrieval is being            In the first step of retrieval, we identify the most relevant
developed at Chemnitz University of Technology since 2005. The          document, weight it and add its ID to the query. This modified
idea behind this framework was to create a flexible and adjustable      query is submitted to the second index in order to identify the
framework with state-of-art retrieval-techniques. Xtrieval              most relevant segment, related to the first document. Here, a
provides several Java-based object-orientated API's for different       weight of 0.0 for the document ID in the modified query means no
retrieval tasks. There are four main components: indexing,              effect (or modification). In contrast to that, assigning a weight of
retrieval, evaluation and the user interface. The first three are the   1.0 to the ID results in retrieving only segments of documents that
main components; we did not use the UI component.                       were already identified before, i.e. it defaults to standard retrieval.

                                                                        1
                                                                          http://lucene.apache.org/
                                                                        2
                                                                          http://terrier.org/
Copyright is held by the author/owner(s).                               3
                                                                          http://www.lemurproject.org/
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy.              4
                                                                          http://members.unine.ch/jacques.savoy/clef/index.html
Because the documents that were retrieved from the first index             fourth and fifth) includes additional metadata in the first index. In
were not always relevant, we decided to give them only a partial           run2 we searched with the raw (original) query. The weighting for
influence on the results for the search in the second index.               the first document was 0.1. For the third experiment (run3) we
Obviously, using different weights produced different results,             filtered the query using the Google spellchecker in the first search
which are reported in more detail in the next section. We decided          step (FC). The weighting remained at 0.1. For the last two
to use this two-step retrieval approach, because we observed better        experiments run4 and run5 we increased the weighting to 0.6 and
results than with a standard search when we were experimenting             0.8. Additionally, we used the spellchecker twice (FCFC), for
with the development set.                                                  document and segment search step.
                                                                           Our experiments with the development data achieved much higher
3. CONFIGURATIONS AND RESULTS                                              mGAP values than the test set results. This may be due to the
In this section we report the results that were obtained by
                                                                           smaller size of the development set.
alternating parameters of our system configuration. We provide a
brief discussion of our results for the official evaluation along          3.2 Additional Experiments
with some observations on additional experiments.                          In addition to the submitted runs we followed some other
3.1 Submitted Runs                                                         approaches. We also tried searching in one index that contained
We submitted five different configurations of our experiments.             all available information. But there were no improvements. We
For the required run (run1) we indexed the complete ASR                    decided to explore the possibilities of text recognition in the key
information from the XML transcripts of 2010 in the first index.           frame pictures. But the recognition results were very bad, because
Our second index contained only the segments of speech detected            of the moderate image quality. Only in some cases we could
by ASR. So we did for all our runs. In our retrieval experiments           extract words which were ready to use.
we used the title and the short title from the queries. The field act      We see some other problems with short queries. Queries that
was discarded, because it significantly decreased the result quality       contain for example the name “Hillary Clinton” and only a few
in preliminary experiments. The starting point of the identified           general terms deliver bad results. This may be due to the fact that
speech segment was returned as the required jump-in point.                 the term “Clinton” appears more often alone or in conjunction
For our arbitrary configurable runs, we also included the metadata         with “Bill” and refers to the former president of the USA.
information in the first index. This increased the result quality on
the development data. During our experiments we observed, that             4. CONCLUSIONS AND FUTURE WORK
some of the queries did contain spelling errors. Since these               Our tests showed that a two-step retrieval approach works well for
mistakes were not intended, they reflected a realistic use case.           the present scenario. But only in the case that the first identified
Thus, we decided to use a spellchecker and we filtered the queries         document is not overrated to avoid excluding alternative
with it. We used the Google spell check API5, which returned a             solutions. A spelling checker works well in cases of misspelled
weighted list of suggested words. The first word from this list was        names. If there are several variations for spelling a name like
added to the search query. Applying this technique resulted in a           Denis (or also Dennis), the spell checker adds the most common
slight improvement of quality.                                             notation. We suppose a case where the ASR system returns a
                                                                           misspelled name (Denis vs. Dennis). The user knows the person is
The weighting of the first identified document was varied from             spelled with two “n” but it was transcribed only with one. So the
0.1 to 0.9. We recognized that in case of a non-relevant document          spell checker adds “Denis” to the query and the document could
on the top of the ranking, giving a high weight to this document           be found by the user.
ID in the modified query significantly decreased the result quality.
A reason could be the fact that higher weights pushed down                 Future work could focus on improving quality of the first
alternative results in the result list. Our results are listed in detail   document. Another possibility is to take various languages of the
in the following table.                                                    documents into account. We disregarded this, because there are
                                                                           only few videos with other languages than English. Furthermore,
Table 1: Experiments and results for test and development set              there are prospects to work with the shot segmentation: combining
experiment id    mGAP*                                                     the shot time with the speech segment time to improve the jump-
test set         window 10s      window 30s              window 60s        in point. Another disregarded option is to make use of the
run1 0.1 TT ASR      0.1432         0.2420                  0.3051         provided tags. It might be possible to categorize the query in such
run2 0.1 TT          0.1432         0.2419                  0.3051         a way that the search heaps only videos with tags fitting the query.
run3 0.1 FCT         0.1432         0.2420                  0.3052         5. ACKNOWLEDGMENTS
run4 0.6 FCFC        0.1164         0.2039                  0.2557         This publication was prepared as a part of the research initiative
run5 0.8 FCFC        0.1140         0.2004                  0.2511         sachsMedia (http://sachsmedia.tv), which is funded by the
development set                                                            German Federal Ministry of Education and Research under the
run1 0.1 TT ASR      0.2325         0.2886                   0.3228        grant reference number 03IP608. The authors take sole
run2 0.1 TT          0.2926         0.3585                   0.3946        responsibility for the contents of this publication.
run3 0.1 FCT         0.2925         0.3620                   0.4017
run4 0.6 FCFC        0.2863         0.3609                   0.4071        6. REFERENCES
run5 0.8 FCFC        0.2863         0.3609                   0.4071        [1] Martha Larson, Maria Eskevich, Roeland Ordelman,
*mGAP: mean generalized average precision                                      Christoph Kofler, Sebastian Schmeideke, Gareth J. F. Jones:
                                                                               Overview of MediaEval 2011 Rich Speech Retrieval Task
Run1 is the required and restricted experiment. Hence, it is based             and Genre Tagging Task, MediaEval 2011 Workshop.
on the 2010 ASR transcripts only. Our second run (and also third,
                                                                           [2] Jens Kürsten, Thomas Wilhelm, Maximilian Eibl: Extensible
                                                                               Retrieval and Evaluation Framework: Xtrieval. In
5
                                                                               Proceedings of the LWA – Workshop FGIR, p. 107-110.
    http://code.google.com/p/google-api-spelling-java/

</pre>