LACS 6ystem $nalysis on 5etrieval 0odels for the MediaEval 2014 Search and Hyperlinking Task Justin Chiu Alexander Rudnicky Language Technologies Institute Language Technologies Institute School of Computer Science School of Computer Science Carnegie Mellon University, Pittsburgh, USA Carnegie Mellon University, Pittsburgh, USA Jchiu1@andrew.cmu.edu Alex.Rudnicky@cs.cmu.edu ABSTRACT We describe the LACS submission to the Search sub-task of the 2. EXPERIMENTAL SETUP The MediaEval 2014 Search and Hyperlinking task [4] uses Search and Hyperlinking Task at MediaEval 2014. Our television broadcast data provided by the BBC, together with experiments investigate how different retrieval models interact subtitles. We also tested on the ASR transcription provided by with word stemming and stopword removal. On the development LIMSI [6] as a comparison to investigate how retrieval models data, we segment the subtitle and Automatic Speech Recognition and techniques interact with a different type of data. In all of the (ASR) transcripts into fixed length time units, and examine the following experiments, the transcription is first segmented into effect of different retrieval models. We find that stemming smaller units with fixed length (60 seconds) according to the provides consistent improvement; stopword removal is more methods presented in [5]. We tested a stopword list from Indri sensitive to the retrieval models on the subtitles. These toolkit that contains 418 common English words. We also used manipulations do not contribute to stable improvement on the the Krovetz word stemming algorithm [8]. Finally, we tested three ASR transcripts. Our experiments on test data focus on the different retrieval algorithms: Unigram language-modeling subtitle. The gap in performance for different retrieval models is algorithm (LM) [3], Okapi [9] retrieval algorithm (Okapi) and a much less compared to the development data. We achieved 0.477 dot-product function using TF-IDF weighting (TF-IDF) [10]. MAP on the test data. 1. INTRODUCTION 3. EXPERIMENTS ON DEV DATA We first present our results on the development (dev) data, The amount and variety of multimedia data available online is reporting the Mean Reciprocal Rank (MRR). The dev experiment rapidly increasing. As a result, the techniques for identifying is known item retrieval. The parameter for Okapi retrieval models content relevant to a query need to improve, to effectively process is k1= 1.2, b= 0.75 and k3 = 7, and the µ for LM is 2500. large multimedia data collections. There are existing works utilizing multi-modality for multimedia retrieval [7]; the ASR transcript is part of the multi-modality, which is similar to the Table 1. MRR on subtitles for dev data Speech Retrieval framework. However, we believe there is more LM Okapi TF-IDF to be discovered on the Speech Retrieval part, especially the interaction between retrieval models and ASR transcripts quality. Baseline 0.265 0.279 0.296 Established retrieval models are commonly used for the text Stopword 0.278 0.285 0.300 retrieval. Applying the retrieval model to ASR transcripts is a standard approach for Speech Retrieval. However, there are Stemming 0.295 0.344 0.355 fundamental differences between text documents and spoken Both 0.310 0.341 0.368 documents, and different retrieval model might have different characteristics that can be beneficial, or harmful, for retrieval Table 2. MRR on ASR transcript (LIMSI) for dev data performance. Specifically, we examine word stemming and stopword removal, techniques that have been shown to be helpful LM Okapi TF-IDF in text retrieval. Can these techniques also help in speech Baseline 0.187 0.180 0.173 retrieval? This question is the basis for our experiments1. We carried out two different sets of experiments on the development Stopword 0.167 0.175 0.160 data to examine the difference between subtitle and ASR Stemming 0.158 0.162 0.183 transcript. Each set of experiments investigates the effectiveness Both 0.157 0.177 0.183 of different retrieval models and processing techniques. Due to the time constraint, we only submitted experiments on subtitle test data. We find that the performance gap observed on development From Table 1 and 2, we can observe the interaction between data does not show up in the test data. different processing and retrieval models. Stemming and stopword removal provides persistent improvement on subtitles. On the other hand, for the ASR transcript, these appear unstable. Aside from the difference due to recognition errors, one possible factor Copyright is held by the author/owner(s). contributing to this phenomenon is the size of vocabulary. The MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain vocabulary size for the subtitle is 251506, while the vocabulary size for the ASR transcription is 83094, one-third of subtitle shorter query length, so the gap is not as great as we observed on vocabulary. The lack of vocabulary, combining with stemming or the dev data. stopword removal, can potentially decimate words in the Research in the Spoken Term Detection community suggests transcript, hence harm the retrieval result. Another phenomenon using context for improving retrieval performance [1] or using we observed are a significant performance gap between different retrieval system fusion [2]. We did not complete our experiments retrieval models. TF-IDF retrieval model outperforms LM and on using context for improving retrieval performance, but we tried Okapi retrieval models, which was unexpected. Since the dev data system fusion approaches with our 3 retrieval models. The is a known item retrieval task (For each query, there is only 1 resulting system usually has the performance that’s between the 2 matching speech segment), we suspect that dev data might have fused systems. We conjecture that the three retrieval models we some bias in favor of the TF-IDF retrieval model. Another used in this work is in generally similar with each other, and possible factor for superior performance on the TF-IDF retrieval fusion is not helpful due to lack of complementary. model is smoothing. Both LM and Okapi retrieval model relies on smoothing parameters, but there is no smoothing for the TF-IDF retrieval model. If the data have a good number of exact matching 6. CONCLUSION between query and documents, TF-IDF may outperform other We examined how different retrieval models interact with retrieval models due to the absence of smoothing. different text processing techniques such as word stemming and stopword removal, on subtitles and ASR transcript, two different 4. EXPERIMENTS ON TEST DATA forms on the dev data. We find that stemming and stopword removal can provide persistent improvement on the subtitle data, The experiments on test data are ad-hoc retrieval task, which is no yet for the ASR transcript, these processing mostly harm longer restricted to one result per query. Due to the time performance except for stemming on the TF-IDF retrieval model. constraint, we only submitted systems based on subtitle data. Our The result on test data shows that the difference on retrieval submissions use both word stemming and stopword removal, as methods is not that significant when the retrieval task contains this setup gave the most promising result on the dev data. Results more possible targets. TF-IDF still has the best performance, on test data are in Table 3. which we believe is due to the absence of smoothing technique. Table 3. Result on test data 7. REFERENCES LM Okapi TF-IDF [1] J. Chiu, and A. Rudnicky. Using Conversational Word Burst MAP 0.470 0.473 0.477 in Spoken Term Detection. In Proc. of Interspeech 2013. Lyon, France, 2013 P@5 0.767 0.720 0.747 [2] J. Chiu, Y. Wang, J. Trmal, D. Povey, G. Chen, and A. P@10 0.677 0.683 0.673 Rudnicky. Combination of FST and CN Search in Spoken P@20 0.560 0.575 0.578 Term Detection. In Proc. of Interspeech 2014. Singapore, 2014 [3] T. M. Conver, and J. A. Thomas. Elements of Information The performance gap between retrieval models is much smaller Theory. 1991 compare to dev data. Yet the trend is still the same: TF-IDF gives [4] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman, S. Chen, the best performance compared to other retrieval models. We and G. J. F. Jones. The Search and Hyperlinking Task at suspect that the absence of smoothing contributes, and can explain MediaEval 2014. In Proc. of the MediaEval the superior performance on TF-IDF. In a regular retrieval task, 2014 Multimedia Benchmark Workshop. Barcelona, Spain TF-IDF is not expected to outperform Okapi and LM consistently. 2014. While processing the experiments on the test data, we noticed a [5] M. Eskevich, and G. J. F. Jones. Time-based Segmentation difference between dev and test queries. The number of words in and Use of Jump-in Points in DCU Search Runs at the dev queries was greater than for test queries. Originally we Search and Hyperlinking Task at MediaEval 2013. In Proc of thought that this might be a factor affecting performance on the MediaEval 2013 Multimedia Benchmark Workshop. different retrieval models, but it does not appear to be an issue. Barcelona, Spain, 2013. Still, we suggest the characteristics of queries in dev and test data [6] J. L. Gauvain, L. Lamel, and G. Adda. The LIMSI broadcast should be more consistent, so that the datasets are better matched. news transcription system. Speech Communication 37, page 89-108, 2002. [7] L. Jiang, T. Mitamura, S.-I. Yu, and A. G. Hauptmann. Zero- 5. ANALYSIS Example Event Search using MultiModal Pseudo Relevance We find that the TF-IDF retrieval model is the best of the three Feedback. In Proc. of International Conference on models tested. We believe this is because it does not do Multimedia Retrieval, page 297. ACM, Glasgow, UK, 2014. smoothing. However, generally speaking, smoothing can provide [8] R. Krovetz. Viewing morphology as an inference process. In significant improvement on the standard retrieval task. We Proc. of SIGIR’93, page 191-202, Pittsburgh, USA, 1993 conducted experiments with LM retrieval model without [9] S. Walker, S.E. Robertson, M. Boughamen, G. J. F. Jones, smoothing; the resulting MAP on dev data is less than 0.05. So and K. Sparck-Jones. Okapi at TREC-6: Automatic adhoc, we can only assume that TF-IDF retrieval model could possibly VLC, routing, filtering and QSDR. In Proc. of Text Retrieval find the correct way for processing the absent query word on our Conference (TREC-6), pages 125-136, 1998 data. The possible reason for the performance gap on dev data is [10] C. Zhai. Notes on Lemur TFIDF model query text length. TF-IDF (which relies on exact word matching) http://www.cs.cmu.edu/~lemur/tfidf.ps is stronger than the other approaches. The test data has much