1. INTRODUCTION

DCU Search Runs at MediaEval 2014 Search and Hyperlinking

0 David N. Racca, Maria Eskevich, Gareth J.F. Jones CNGL Centre for Global Intelligent Content School of Computing Dublin City University , Ireland

2014

16 17

We described Dublin City University (DCU)'s participation in the Search sub-task of the Search and Hyperlinking Task at MediaEval 2014. Exploratory experiments were carried out to investigate the utility of prosodic prominence features in the task of retrieving relevant video segments from a collection of BBC videos. Normalised acoustic correlates of loudness, pitch, and duration were incorporated in a standard TF-IDF weighting scheme to increase weights for terms that were prominent in speech. Prosodic models outperformed a text-based TF-IDF baseline on the training set but failed to surpass the baseline on the test set.

1. INTRODUCTION

Increasing amounts of multimedia content are being produced and stored on a daily basis. In order to make this data useful, computer applications are required that facilitate search, browsing, and navigation through these large data collections. The MediaEval Search and Hyperlinking task seeks to contribute to addressing this problem.

In contrast with previous years where a known-item task was examined, this year an ad-hoc search task was introduced. The retrieval collection consists of an extension of last year's collection, comprising 4021 hours of BBC TV Broadcast content split into training and test sets of 1335 and 2686 hours respectively. For every video le in the collection, the organizers provided human-generated subtitles, three di erent automatic speech recognition (ASR) transcripts (LIMSI/Vocapia, LIUM, and NST-She eld), prosodic features, shot boundaries, visual concept detection output, and additional metadata associated with each TV-show. The training set includes 50 text queries while the test set comprises 30 queries. More details about the data collection and task evaluation metrics can be found in [ 4 ].

Previous research has demonstrated that prosodic information is useful for a wide range of speech processing tasks [ 6 ], including speech search tasks. In [ 2 ], Crestani suggests that there might be a direct relationship between acoustic stress of terms and their TF-IDF score in the OGI Stories Corpus, while Chen reports improvements on a spoken document retrieval task by using energy and durational features [ 1 ]. In [ 5 ], Guinaudeau and Hirschberg improve a topic tracking system by incorporating intensity and pitch values into the retrieval weighting scheme. This paper describes an implementation of an approach that incorporates loudness, duration, and pitch into TF-IDF weights in order to examine their potential to improve retrieval e ectiveness of video segments. 2.

FEATURE PROCESSING

Following Guinaudeau's method [ 5 ], loudness and pitch correlates were extracted from the speech signal and normalised and aligned to each word occurrence in the transcripts. To perform this alignment, word timestamps were used in the case of LIMSI/Vocapia and NST-She eld transcripts. For subtitles, word timestamps had to be approximated from each segment's starting and ending timestamps. This was done by dividing the number of words included in a segment by its length to obtain the average word duration for that segment. Starting times and duration of words were then approximated by considering the starting time of a segment plus multiples of its average word duration. In the case of the LIUM transcript, duration of words was approximated for the test set by the average word duration of all words in the training set.

After the alignment was performed, minimum, maximum, mean, and standard deviation of loudness and pitch were computed for each word. These four statistics were normalised in order to be compared against other words spoken in di erent acoustic conditions. The nal objective was to calculate an acoustic score for each spoken word that represents how salient a word is relative to its surounding context. With this in mind, two di erent de nitions of surounding context for a word were then considered:

Context given by the words that belong to the same speech segment predicted by the ASR (seg).

Full length of document, this is, all the words spoken in the video (doc).

Finally, two normalisation functions were explored for normalising a feature fi over a context C: 1. Range: (fi 2. Z-score: (fi minC )=(maxC

C )= C .

minC ). 3.

RETRIEVAL FRAMEWORK

Text transcripts were segmented into xed-time adjacent (non-overlapping) segments of 90 seconds duration. Before indexing, stop words from the standard Terrier list [ 7 ] were removed and Porter stemming applied. Segments were then indexed using a modi ed version of Terrier-3.5 that associated acoustic features with term occurrences in the inverted

Subtitles NST-She eld

Normalisation Type Function Context

- range seg z-score doc - - range seg z-score doc - - range seg z-score doc - - range seg z-score doc - index. Note that due to stemming, multiple words can be mapped to the same stem. In these cases, acoustic feature vectors associated with each non-stemmed word ccurrence in a segment were treated as belonging to the same stem and thus were linked with this term in the inverted index.

Retrieval was performed using Terrier's standard implementation of the vector space model (VSM) with a modied TF-IDF weighting function that takes into account the acoustic features from the inverted index when computing term weights. The weight of a term t in a segment was computed using Guinaudeau's harmonic mean [ 5 ]: w(t) = ir idft tft + ac act ir + ac Di erent de nitions were explored for the acoustic score (act). In all cases, act was intended to represent the level of salience of t from its surrounding context. In particular, simple multiplications of the maximum loudness and maximum pitch (G-lp), pitch range considering the maximum and minimum pitch (G-pr), and the maximum duration (Gdur) with which t was pronounced in the segment were used as de nitions for act. Values for the free parameters ir and ac were selected to optimise mean reciprocal rank (MRR), mean generalised average presicion (mGAP), and mean average segment precision (MASP) [ 3 ] on the training set for individual ASR transcripts and normalisation type. Speci cally, the runs G-lp, G-pr, and G-dur were optimised for the LIUM, NST-She eld, and LIMSI transcripts respectively.

RESULTS AND CONCLUSIONS

ing and test sets were produced with di erent objectives in mind. This could be another reason why the models presented in this work seem to have over tted the training set.

In future work, an error analysis will be carried out in order to identify queries for which prosodic-based models could have outperformed the baseline. 5.

ACKNOWLEDGMENTS

This work was supported by Science Foundation Ireland (Grant 12/CE/I2267) as part of the Centre for Global Intelligent Content CNGL II project at DCU.

[1]

Chen , H. -M. Wang , and L.-S. Lee . Improved spoken document retrieval by exploring extra acoustic and linguistic cues . In Proceedings Interspeech'01 , pages 299 { 302 , Aalborg , Denmark, 2001 .

[2]

Crestani . Towards the use of prosodic information for spoken document retrieval . In Proceedings ACM SIGIR'01 , pages 420 { 421 , New

Orleans

, LA, USA, 2001 .

[3]

Eskevich ,

Aly ,

Ordelman ,

Chen , and

G. J. F.

Jones . The search and hyperlinking task at MediaEval 2013 . In Proceedings of the MediaEval 2013 Workshop , Barcelona, Spain, 2013 .

[4]

Eskevich ,

Aly ,

D. N.

Racca ,

Ordelman ,

Chen , and

G. J. F.

Jones . The search and hyperlinking task at MediaEval 2014 . In Proceedings of the MediaEval 2014 Multimedia Benchmark Workshop , Barcelona, Spain, 2014 .

[5]

Guinaudeau and

Hirschberg . Accounting for prosodic information to improve ASR-based topic tracking for TV broadcast news . In Proceedings Interspeech'11 , pages 1401 { 1404 , Florence , Italy, 2011 .

[6]

Hirschberg . Communication and prosody: Functional aspects of prosody . Speech Communication , 36 ( 1 ): 31 { 43 , 2002 .

[7]

Ounis ,

Lioma ,

Macdonald , and

Plachouras . Research directions in Terrier: a search engine for advanced retrieval on the web . Novatica/UPGRADE Special Issue on Next Generation Web Search , pages 49 { 56 , 2007 .