=Paper=
{{Paper
|id=Vol-1263/paper31
|storemode=property
|title=DCU Search Runs at MediaEval 2014 Search and Hyperlinking
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_31.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RaccaEJ14
}}
==DCU Search Runs at MediaEval 2014 Search and Hyperlinking==
DCU Search Runs at MediaEval 2014 Search and Hyperlinking David N. Racca, Maria Eskevich, Gareth J.F. Jones CNGL Centre for Global Intelligent Content School of Computing Dublin City University, Ireland {dracca, meskevich, gjones}@computing.dcu.ie ABSTRACT This paper describes an implementation of an approach We described Dublin City University (DCU)’s participation that incorporates loudness, duration, and pitch into TF-IDF in the Search sub-task of the Search and Hyperlinking Task weights in order to examine their potential to improve re- at MediaEval 2014. Exploratory experiments were carried trieval effectiveness of video segments. out to investigate the utility of prosodic prominence features in the task of retrieving relevant video segments from a col- 2. FEATURE PROCESSING lection of BBC videos. Normalised acoustic correlates of Following Guinaudeau’s method [5], loudness and pitch loudness, pitch, and duration were incorporated in a stan- correlates were extracted from the speech signal and nor- dard TF-IDF weighting scheme to increase weights for terms malised and aligned to each word occurrence in the tran- that were prominent in speech. Prosodic models outper- scripts. To perform this alignment, word timestamps were formed a text-based TF-IDF baseline on the training set used in the case of LIMSI/Vocapia and NST-Sheffield tran- but failed to surpass the baseline on the test set. scripts. For subtitles, word timestamps had to be approxi- mated from each segment’s starting and ending timestamps. 1. INTRODUCTION This was done by dividing the number of words included in a segment by its length to obtain the average word dura- Increasing amounts of multimedia content are being pro- tion for that segment. Starting times and duration of words duced and stored on a daily basis. In order to make this were then approximated by considering the starting time of data useful, computer applications are required that facil- a segment plus multiples of its average word duration. In itate search, browsing, and navigation through these large the case of the LIUM transcript, duration of words was ap- data collections. The MediaEval Search and Hyperlinking proximated for the test set by the average word duration of task seeks to contribute to addressing this problem. all words in the training set. In contrast with previous years where a known-item task After the alignment was performed, minimum, maximum, was examined, this year an ad-hoc search task was intro- mean, and standard deviation of loudness and pitch were duced. The retrieval collection consists of an extension of computed for each word. These four statistics were nor- last year’s collection, comprising 4021 hours of BBC TV malised in order to be compared against other words spoken Broadcast content split into training and test sets of 1335 in different acoustic conditions. The final objective was to and 2686 hours respectively. For every video file in the col- calculate an acoustic score for each spoken word that repre- lection, the organizers provided human-generated subtitles, sents how salient a word is relative to its surounding context. three different automatic speech recognition (ASR) tran- With this in mind, two different definitions of surounding scripts (LIMSI/Vocapia, LIUM, and NST-Sheffield), prosodic context for a word were then considered: features, shot boundaries, visual concept detection output, and additional metadata associated with each TV-show. The • Context given by the words that belong to the same training set includes 50 text queries while the test set com- speech segment predicted by the ASR (seg). prises 30 queries. More details about the data collection and • Full length of document, this is, all the words spoken task evaluation metrics can be found in [4]. in the video (doc). Previous research has demonstrated that prosodic infor- mation is useful for a wide range of speech processing tasks [6], Finally, two normalisation functions were explored for nor- including speech search tasks. In [2], Crestani suggests that malising a feature fi over a context C: there might be a direct relationship between acoustic stress 1. Range: (fi − minC )/(maxC − minC ). of terms and their TF-IDF score in the OGI Stories Corpus, 2. Z-score: (fi − µC )/σC . while Chen reports improvements on a spoken document retrieval task by using energy and durational features [1]. In [5], Guinaudeau and Hirschberg improve a topic tracking 3. RETRIEVAL FRAMEWORK system by incorporating intensity and pitch values into the Text transcripts were segmented into fixed-time adjacent retrieval weighting scheme. (non-overlapping) segments of 90 seconds duration. Before indexing, stop words from the standard Terrier list [7] were removed and Porter stemming applied. Segments were then Copyright is held by the author/owner(s). indexed using a modified version of Terrier-3.5 that associ- MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain ated acoustic features with term occurrences in the inverted Normalisation Type Run Parameters Results Transcript Type Function Context Weighting Scheme θir θac MAP MAP-bin MAP-tol P@5 P@10 P@20 - - TF-IDF 1 0 0.639 0.394 0.293 0.727 0.627 0.478 range seg G-pr 2 3 0.599 0.371 0.278 0.727 0.583 0.452 Subtitles z-score doc G-lp 3 1 0.533 0.336 0.265 0.633 0.570 0.433 - - G-dur 1 3 0.345 0.206 0.154 0.367 0.327 0.280 - - TF-IDF 1 0 0.440 0.295 0.215 0.600 0.513 0.393 range seg G-pr 2 3 0.434 0.294 0.214 0.613 0.510 0.393 NST-Sheffield z-score doc G-lp 3 1 0.435 0.294 0.218 0.587 0.503 0.393 - - G-dur 1 3 0.404 0.272 0.199 0.567 0.457 0.363 - - TF-IDF 1 0 0.525 0.339 0.242 0.620 0.543 0.430 range seg G-pr 2 3 0.508 0.331 0.239 0.607 0.543 0.423 LIMSI z-score doc G-lp 3 1 0.428 0.283 0.201 0.460 0.457 0.370 - - G-dur 1 3 0.505 0.330 0.237 0.607 0.537 0.413 - - TF-IDF 1 0 0.451 0.300 0.233 0.693 0.573 0.430 range seg G-pr 2 3 0.444 0.293 0.222 0.653 0.527 0.418 LIUM z-score doc G-lp 3 1 0.436 0.291 0.215 0.633 0.547 0.412 - - G-dur 1 3 0.358 0.240 0.186 0.540 0.453 0.357 Table 1: Evaluation results over the test set. Overlap MAP [MAP], Binned MAP [MAP-bin], and Tolerance to Irrelevance MAP [MAP-tol] are shown in the Results columns. Precision at different cut-offs are based on a “Tolerance to Irrelevance” definition of relevance. index. Note that due to stemming, multiple words can be ing and test sets were produced with different objectives in mapped to the same stem. In these cases, acoustic feature mind. This could be another reason why the models pre- vectors associated with each non-stemmed word ccurrence sented in this work seem to have overfitted the training set. in a segment were treated as belonging to the same stem In future work, an error analysis will be carried out in and thus were linked with this term in the inverted index. order to identify queries for which prosodic-based models Retrieval was performed using Terrier’s standard imple- could have outperformed the baseline. mentation of the vector space model (VSM) with a modi- fied TF-IDF weighting function that takes into account the 5. ACKNOWLEDGMENTS acoustic features from the inverted index when computing This work was supported by Science Foundation Ireland term weights. The weight of a term t in a segment was (Grant 12/CE/I2267) as part of the Centre for Global Intel- computed using Guinaudeau’s harmonic mean [5]: ligent Content CNGL II project at DCU. θir ∗ idf t ∗ tf t +θac ∗ act w(t) = 6. REFERENCES θir + θac [1] B. Chen, H.-M. Wang, and L.-S. Lee. Improved spoken Different definitions were explored for the acoustic score document retrieval by exploring extra acoustic and (act ). In all cases, act was intended to represent the level linguistic cues. In Proceedings Interspeech’01, pages of salience of t from its surrounding context. In particular, 299–302, Aalborg, Denmark, 2001. simple multiplications of the maximum loudness and max- [2] F. Crestani. Towards the use of prosodic information imum pitch (G-lp), pitch range considering the maximum for spoken document retrieval. In Proceedings ACM and minimum pitch (G-pr), and the maximum duration (G- SIGIR’01, pages 420–421, New Orleans, LA, USA, dur) with which t was pronounced in the segment were used 2001. as definitions for act . Values for the free parameters θir and [3] M. Eskevich, R. Aly, R. Ordelman, S. Chen, and θac were selected to optimise mean reciprocal rank (MRR), G. J. F. Jones. The search and hyperlinking task at mean generalised average presicion (mGAP), and mean av- MediaEval 2013. In Proceedings of the MediaEval 2013 erage segment precision (MASP) [3] on the training set for Workshop, Barcelona, Spain, 2013. individual ASR transcripts and normalisation type. Specifi- [4] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman, cally, the runs G-lp, G-pr, and G-dur were optimised for the S. Chen, and G. J. F. Jones. The search and LIUM, NST-Sheffield, and LIMSI transcripts respectively. hyperlinking task at MediaEval 2014. In Proceedings of the MediaEval 2014 Multimedia Benchmark Workshop, 4. RESULTS AND CONCLUSIONS Barcelona, Spain, 2014. Table 1 shows details of submitted runs and presents a [5] C. Guinaudeau and J. Hirschberg. Accounting for summary of evaluation results over the test set for every prosodic information to improve ASR-based topic type of transcript. Runs that made use of prosodic informa- tracking for TV broadcast news. In Proceedings tion for computing term weights, namely G-pr, G-lp, and Interspeech’11, pages 1401–1404, Florence, Italy, 2011. G-dur, clearly underperformed the baseline TF-IDF system [6] J. Hirschberg. Communication and prosody: Functional in general. Given the fact that the baseline system can be aspects of prosody. Speech Communication, beaten on the training set, results over the test set suggest 36(1):31–43, 2002. that models optimised on MRR, mGAP, and MASP for a [7] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras. known-item task evaluation scheme, as it is the case of the Research directions in Terrier: a search engine for training set, fail to generalise to an ad-hoc retrieval task advanced retrieval on the web. Novatica/UPGRADE performed over the test set. Special Issue on Next Generation Web Search, pages It is important to note here, that text queries for train- 49–56, 2007.