LACS 6ystem $nalysis on 5etrieval 0odels for the
             MediaEval 2014 Search and Hyperlinking Task
                           Justin Chiu                                                   Alexander Rudnicky
          Language Technologies Institute                                         Language Technologies Institute
             School of Computer Science                                              School of Computer Science
      Carnegie Mellon University, Pittsburgh, USA                             Carnegie Mellon University, Pittsburgh, USA
               Jchiu1@andrew.cmu.edu                                               Alex.Rudnicky@cs.cmu.edu


ABSTRACT
We describe the LACS submission to the Search sub-task of the
                                                                      2. EXPERIMENTAL SETUP
                                                                      The MediaEval 2014 Search and Hyperlinking task [4] uses
Search and Hyperlinking Task at MediaEval 2014. Our
                                                                      television broadcast data provided by the BBC, together with
experiments investigate how different retrieval models interact
                                                                      subtitles. We also tested on the ASR transcription provided by
with word stemming and stopword removal. On the development
                                                                      LIMSI [6] as a comparison to investigate how retrieval models
data, we segment the subtitle and Automatic Speech Recognition
                                                                      and techniques interact with a different type of data. In all of the
(ASR) transcripts into fixed length time units, and examine the
                                                                      following experiments, the transcription is first segmented into
effect of different retrieval models. We find that stemming
                                                                      smaller units with fixed length (60 seconds) according to the
provides consistent improvement; stopword removal is more
                                                                      methods presented in [5]. We tested a stopword list from Indri
sensitive to the retrieval models on the subtitles. These
                                                                      toolkit that contains 418 common English words. We also used
manipulations do not contribute to stable improvement on the
                                                                      the Krovetz word stemming algorithm [8]. Finally, we tested three
ASR transcripts. Our experiments on test data focus on the
                                                                      different retrieval algorithms: Unigram language-modeling
subtitle. The gap in performance for different retrieval models is
                                                                      algorithm (LM) [3], Okapi [9] retrieval algorithm (Okapi) and a
much less compared to the development data. We achieved 0.477
                                                                      dot-product function using TF-IDF weighting (TF-IDF) [10].
MAP on the test data.

1. INTRODUCTION                                                       3. EXPERIMENTS ON DEV DATA
                                                                      We first present our results on the development (dev) data,
The amount and variety of multimedia data available online is
                                                                      reporting the Mean Reciprocal Rank (MRR). The dev experiment
rapidly increasing. As a result, the techniques for identifying
                                                                      is known item retrieval. The parameter for Okapi retrieval models
content relevant to a query need to improve, to effectively process
                                                                      is k1= 1.2, b= 0.75 and k3 = 7, and the µ for LM is 2500.
large multimedia data collections. There are existing works
utilizing multi-modality for multimedia retrieval [7]; the ASR
transcript is part of the multi-modality, which is similar to the                 Table 1. MRR on subtitles for dev data
Speech Retrieval framework. However, we believe there is more
                                                                                                LM            Okapi       TF-IDF
to be discovered on the Speech Retrieval part, especially the
interaction between retrieval models and ASR transcripts quality.             Baseline          0.265         0.279         0.296
Established retrieval models are commonly used for the text                   Stopword          0.278         0.285         0.300
retrieval. Applying the retrieval model to ASR transcripts is a
standard approach for Speech Retrieval. However, there are                   Stemming           0.295         0.344         0.355
fundamental differences between text documents and spoken                       Both            0.310         0.341         0.368
documents, and different retrieval model might have different
characteristics that can be beneficial, or harmful, for retrieval
                                                                          Table 2. MRR on ASR transcript (LIMSI) for dev data
performance. Specifically, we examine word stemming and
stopword removal, techniques that have been shown to be helpful                                 LM            Okapi       TF-IDF
in text retrieval. Can these techniques also help in speech                   Baseline          0.187         0.180        0.173
retrieval? This question is the basis for our experiments1. We
carried out two different sets of experiments on the development              Stopword          0.167         0.175        0.160
data to examine the difference between subtitle and ASR                      Stemming           0.158         0.162        0.183
transcript. Each set of experiments investigates the effectiveness
                                                                                Both            0.157         0.177        0.183
of different retrieval models and processing techniques. Due to
the time constraint, we only submitted experiments on subtitle test
data. We find that the performance gap observed on development        From Table 1 and 2, we can observe the interaction between
data does not show up in the test data.                               different processing and retrieval models. Stemming and stopword
                                                                      removal provides persistent improvement on subtitles. On the
                                                                      other hand, for the ASR transcript, these appear unstable. Aside
                                                                      from the difference due to recognition errors, one possible factor
Copyright is held by the author/owner(s).                             contributing to this phenomenon is the size of vocabulary. The
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain        vocabulary size for the subtitle is 251506, while the vocabulary
size for the ASR transcription is 83094, one-third of subtitle          shorter query length, so the gap is not as great as we observed on
vocabulary. The lack of vocabulary, combining with stemming or          the dev data.
stopword removal, can potentially decimate words in the
                                                                        Research in the Spoken Term Detection community suggests
transcript, hence harm the retrieval result. Another phenomenon
                                                                        using context for improving retrieval performance [1] or using
we observed are a significant performance gap between different
                                                                        retrieval system fusion [2]. We did not complete our experiments
retrieval models. TF-IDF retrieval model outperforms LM and
                                                                        on using context for improving retrieval performance, but we tried
Okapi retrieval models, which was unexpected. Since the dev data
                                                                        system fusion approaches with our 3 retrieval models. The
is a known item retrieval task (For each query, there is only 1
                                                                        resulting system usually has the performance that’s between the 2
matching speech segment), we suspect that dev data might have
                                                                        fused systems. We conjecture that the three retrieval models we
some bias in favor of the TF-IDF retrieval model. Another
                                                                        used in this work is in generally similar with each other, and
possible factor for superior performance on the TF-IDF retrieval
                                                                        fusion is not helpful due to lack of complementary.
model is smoothing. Both LM and Okapi retrieval model relies on
smoothing parameters, but there is no smoothing for the TF-IDF
retrieval model. If the data have a good number of exact matching       6. CONCLUSION
between query and documents, TF-IDF may outperform other                We examined how different retrieval models interact with
retrieval models due to the absence of smoothing.                       different text processing techniques such as word stemming and
                                                                        stopword removal, on subtitles and ASR transcript, two different
4. EXPERIMENTS ON TEST DATA                                             forms on the dev data. We find that stemming and stopword
                                                                        removal can provide persistent improvement on the subtitle data,
The experiments on test data are ad-hoc retrieval task, which is no
                                                                        yet for the ASR transcript, these processing mostly harm
longer restricted to one result per query. Due to the time
                                                                        performance except for stemming on the TF-IDF retrieval model.
constraint, we only submitted systems based on subtitle data. Our
                                                                        The result on test data shows that the difference on retrieval
submissions use both word stemming and stopword removal, as
                                                                        methods is not that significant when the retrieval task contains
this setup gave the most promising result on the dev data. Results
                                                                        more possible targets. TF-IDF still has the best performance,
on test data are in Table 3.
                                                                        which we believe is due to the absence of smoothing technique.

                   Table 3. Result on test data
                                                                        7. REFERENCES
                           LM           Okapi        TF-IDF             [1] J. Chiu, and A. Rudnicky. Using Conversational Word Burst
          MAP             0.470          0.473        0.477                 in Spoken Term Detection. In Proc. of Interspeech 2013.
                                                                            Lyon, France, 2013
          P@5             0.767          0.720        0.747             [2] J. Chiu, Y. Wang, J. Trmal, D. Povey, G. Chen, and A.
         P@10             0.677          0.683        0.673                 Rudnicky. Combination of FST and CN Search in Spoken
         P@20             0.560          0.575        0.578                 Term Detection. In Proc. of Interspeech 2014. Singapore,
                                                                            2014
                                                                        [3] T. M. Conver, and J. A. Thomas. Elements of Information
The performance gap between retrieval models is much smaller                Theory. 1991
compare to dev data. Yet the trend is still the same: TF-IDF gives      [4] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman, S. Chen,
the best performance compared to other retrieval models. We                 and G. J. F. Jones. The Search and Hyperlinking Task at
suspect that the absence of smoothing contributes, and can explain          MediaEval 2014. In Proc. of the MediaEval
the superior performance on TF-IDF. In a regular retrieval task,            2014 Multimedia Benchmark Workshop. Barcelona, Spain
TF-IDF is not expected to outperform Okapi and LM consistently.             2014.
While processing the experiments on the test data, we noticed a         [5] M. Eskevich, and G. J. F. Jones. Time-based Segmentation
difference between dev and test queries. The number of words in              and Use of Jump-in Points in DCU Search Runs at the
dev queries was greater than for test queries. Originally we                 Search and Hyperlinking Task at MediaEval 2013. In Proc of
thought that this might be a factor affecting performance on                 the MediaEval 2013 Multimedia Benchmark Workshop.
different retrieval models, but it does not appear to be an issue.           Barcelona, Spain, 2013.
Still, we suggest the characteristics of queries in dev and test data   [6] J. L. Gauvain, L. Lamel, and G. Adda. The LIMSI broadcast
should be more consistent, so that the datasets are better matched.          news transcription system. Speech Communication 37, page
                                                                             89-108, 2002.
                                                                        [7] L. Jiang, T. Mitamura, S.-I. Yu, and A. G. Hauptmann. Zero-
5. ANALYSIS                                                                  Example Event Search using MultiModal Pseudo Relevance
We find that the TF-IDF retrieval model is the best of the three             Feedback. In Proc. of International Conference on
models tested. We believe this is because it does not do                     Multimedia Retrieval, page 297. ACM, Glasgow, UK, 2014.
smoothing. However, generally speaking, smoothing can provide           [8] R. Krovetz. Viewing morphology as an inference process. In
significant improvement on the standard retrieval task. We                   Proc. of SIGIR’93, page 191-202, Pittsburgh, USA, 1993
conducted experiments with LM retrieval model without                   [9] S. Walker, S.E. Robertson, M. Boughamen, G. J. F. Jones,
smoothing; the resulting MAP on dev data is less than 0.05. So               and K. Sparck-Jones. Okapi at TREC-6: Automatic adhoc,
we can only assume that TF-IDF retrieval model could possibly                VLC, routing, filtering and QSDR. In Proc. of Text Retrieval
find the correct way for processing the absent query word on our             Conference (TREC-6), pages 125-136, 1998
data. The possible reason for the performance gap on dev data is        [10] C. Zhai. Notes on Lemur TFIDF model
query text length. TF-IDF (which relies on exact word matching)              http://www.cs.cmu.edu/~lemur/tfidf.ps
is stronger than the other approaches. The test data has much