Time-based Segmentation and Use of Jump-in Points in DCU Search Runs at the Search and Hyperlinking Task at MediaEval 2013 Maria Eskevich Gareth J.F. Jones CNGL Centre for Global Intelligent Content CNGL Centre for Global Intelligent Content School of Computing School of Computing Dublin City University, Dublin, Ireland Dublin City University, Dublin, Ireland meskevich@computing.dcu.ie gjones@computing.dcu.ie ABSTRACT Table 1: Number of documents We describe the runs for our participation in the Search Transcript Window Size (seconds) sub-task of the Search and Hyperlinking Task at MediaE- Type 60 90 180 val 2013. Our experiments investigate the affect of using LIMSI 96 418 64 403 31 907 information about speech segment boundaries and pauses LIUM 95 091 63 308 31 210 on the effectiveness of retrieving jump-in points within the Subtitles 82 220 54 698 26 742 retrieved segments. We segment all three available types of transcripts (automatic ones provided by LIMSI/Vocapia and LIUM, and manual subtitles provided by BBC) into of broadcast content, including news programs, talk shows, fixed-length time units, and present the resulting runs using episodes of TV series, etc. The 50 test set queries for the the original segment starts and using the potential jump-in known-item retrieval task were created during user studies points. Our method for adjustment of the jump-in points at the BBC [1]. achieves higher scores for all LIMSI/Vocapia, LIUM, and The task was evaluated using three metrics: mean recip- subtitles based runs. rocal rank (MRR) which scores the rank of the retrieved segment containing relevant content, mean generalized av- erage precision (mGAP) which combines the rank of the 1. INTRODUCTION relevant segment and distance to the ideal jump-in point at The constant growth in the size and variability of digital the start of the relevant content within the segment [6], and multimedia content being stored requires the development mean average segment precision (MASP) which combines of techniques that not only identify files containing relevant the rank of the relevant segment with (ir)relevant length of content, but also bring the user as close as possible to the be- the segment.[3]. ginning of the relevant passage within this file to maximize the efficiency of information access. This starting point, re- ferred to as the jump-in point, cannot simply be related to 3. RETRIEVAL FRAMEWORK the locations of the words of interest being spoken, since the As the files in the collection vary in style and length, we user may need to listen to the whole utterance in which the decided to segment all the content into fixed length units. words were used, or slightly bigger passages, in order to get For these experiments we chose three values for segment the idea of the context. Thus we assume that these jump-in length: 60, 90, and 180 seconds. These time units were the points should occur at the beginning of the speech segments same for all types of transcripts. However, the transcripts or utterances, and might be expressed by a pause in the do not always cover the spoken content in the same way: speech signal. This idea underlies our experimental setup. the ASR system might recognise some noise as words, or We create one retrieval run for each fixed-length segmenta- humans who create the manual subtitle transcripts might tion unit, but present it in two ways for further comparison: consider certain parts of the video non relevant for tran- with the initial boundaries of the segments, and with ad- scription. This explains the difference in the number of doc- justed jump-in points. uments for diverse types of transcripts and time units given in Table 1. At the segmentation stage we stored the information about 2. DATASET AND EVALUATION METRICS potential jump-in points within each segment in a separate The Search and Hyperlinking Task at MediaEval 2013 file. The LIMSI/Vocapia transcript contains speech seg- uses television broadcast data provided by BBC, and en- ments boundaries predicted by their system [4]; whereas hanced with varying additional content such as automatic the LIUM transcript [7] has only time stamps for the words speech recognition (ASR) transcripts [2]. The collection in the transcript; and manual subtitles have time stamps consists of circa 1260 hours of data that represent 6 weeks assigned on the utterance level. Thus for the official sub- mission to the task we used as potential jump-in points the speech segments in the LIMSI transcript, pauses that Copyright is held by the author/owner(s). are longer than 0.5 seconds between words in case of LIUM MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain. transcript and utterances in case of manual subtitles. Ad- Table 2: Metric results (window = 60 seconds) Run parameters MRR mGAP MASP Transcript Type Speech Segment Pause 60 90 180 60 90 180 60 90 180 LIMSI – – 0.241 0.266 0.185 0.132 0.133 0.089 0.142 0.138 0.010 LIMSI + – 0.250 0.295 0.240 0.151 0.164 0.132 0.132 0.146 0.124 LIMSI* – + 0.258 0.305 0.240 0.150 0.153 0.135 0.139 0.145 0.124 LIUM – – 0.265 0.298 0.205 0.124 0.152 0.094 0.140 0.169 0.103 LIUM – + 0.284 0.317 0.254 0.146 0.163 0.114 0.138 0.173 0.126 Subtitles – – 0.343 0.369 0.217 0.209 0.191 0.0.092 0.223 0.231 0.093 Subtitles – + 0.365 0.376 0.280 0.211 0.221 0.154 0.212 0.220 0.116 Table 3: MRR results with varying window size (window size = 60, 30, 10 seconds) Run parameters Unit Size = 60 sec Unit Size = 90 sec Unit Size = 180 sec Transcript Type Speech Segment Pause metric window metric window metric window 60 30 10 60 30 10 60 30 10 LIMSI – – 0.241 0.169 0.090 0.266 0.195 0.059 0.185 0.107 0.041 LIMSI – – 0.250 0.223 0.091 0.295 0.226 0.110 0.240 0.178 0.080 LIMSI* – + 0.258 0.194 0.109 0.305 0.226 0.090 0.240 0.175 0.081 LIUM – – 0.265 0.157 0.071 0.298 0.204 0.080 0.205 0.116 0.041 LIUM – + 0.284 0.182 0.106 0.317 0.213 0.110 0.254 0.146 0.081 Subtitles – – 0.343 0.273 0.144 0.369 0.255 0.096 0.217 0.113 0.042 Subtitles – + 0.365 0.300 0.155 0.376 0.292 0.141 0.280 0.193 0.120 ditionally we created an unofficial run that uses the pauses formation contained in all types of transcripts. in the LIMSI transcript in order to be able to make a better comparison with the other types of transcript. 5. ACKNOWLEDGMENTS We do not have access to details of the ASR transcrip- This work was supported by Science Foundation Ireland tion systems. However we can distinguish them by the size (Grant 07/CE/I1142) as part of the Centre for Next Gener- of the vocabulary they used for this collection which con- ation Localisation (CNGL) project at DCU. tain 36,815, 57,259, and 98,332 entries for LIMSI/Vocapia, LIUM, and subtitles respectively. For indexing and retrieval experiments we used the open- 6. REFERENCES [1] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones, and source Terrier Information Retrieval platform1 [5] with a S. Chen. Linking inside a video collection: what and standard language modelling method, with default lamda how to measure? In WWW (Companion Volume), value equal to 0.15. pages 457–460, 2013. [2] M. Eskevich, G. J. Jones, S. Chen, R. Aly, and 4. RESULTS, CONCLUSIONS AND R. Ordelman. The Search and Hyperlinking Task at MediaEval 2013. In MediaEval 2013 Workshop, FURTHER WORK Barcelona, Spain, October 18-19 2013. Table 2 shows the results obtained for all three types of [3] M. Eskevich, W. Magdy, and G. J. F. Jones. New transcript and different segmentation unit size. The lines for metrics for meaningful evaluation of informally the same transcript represent the same retrieval run, with structured speech retrieval. In Proceedings of ECIR the second line (and the third one for LIMSI) represent- 2012, pages 170–181, 2012. ing the enhanced result list. We highlight in bold the runs [4] L. Lamel and J.-L. Gauvain. Speech processing for for which the addition of the jump-in point information in- audio indexing. In Advances in Natural Language creases the effectiveness of the results. In the case of LIMSI Processing (LNCS 5221), pages 4–15. Springer, 2008. transcript there is no consistency that indicates whether the use of speech segments or pauses is preferable. This may be [5] I. Ounis, G. Amati, V. Plachouras, B. He, caused by the fact that sometimes these potential jump-in C. Macdonald, and C. Lioma. Terrier: A High points coincide. For shorter segments the use of speech seg- Performance and Scalable Information Retrieval ments has better mGAP scoring, meaning that the speech Platform. In Proceedings of ACM SIGIR’06 Workshop segment based jump-in point brings the user closer to the on Open Source Information Retrieval (OSIR 2006), beginning of the relevant content. 2006. Table 3 shows the MRR results for varying window used in [6] P. Pecina, P. Hoffmannova, G. J. F. Jones, Y. Zhang, the metric calculation. The LIUM and subtitles based runs and D. W. Oard. Overview of the CLEF 2007 show the same trend of improvement of the score from the cross-language speech retrieval track. In Proceedings of use of pause information in calculating the jump-in point. CLEF 2007, pages 674–686. Springer, 2007. These results allow us to argue that the simple time based [7] A. Rousseau, F. Bougares, P. Deléglise, H. Schwenk, segmentation results can be improved by use of pauses in- and Y. Estèv. LIUM’s systems for the IWSLT 2011 Speech Translation Tasks. In Proceedings of IWSLT 1 http://www.terrier.org 2011, San Francisco, USA, 2011.