1. INTRODUCTION

CUNI at MediaEval 2014 Search and Hyperlinking Task: Visual and Prosodic Features in Hyperlinking

Petra Galušcˇ áková, Pavel Pecina

pecina@ufal.mff.cuni.cz {galuscakova,pecina}@ufal.mff.cuni.cz 1

Martin Kruliš, Jakub Lokocˇ

lokoc@ksi.mff.cuni.cz {krulis,lokoc}@ksi.mff.cuni.cz 0 0 Charles University in Prague, Faculty of Mathematics and Physics, Department of Software Engineering 1 Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics

2014

16 17

In this report, we present our experiments performed for the Hyperlinking part of the Search and Hyperlinking Task in MediaEval Benchmark 2014. Our system successfully combines features from multiple modalities (textual, visual, and prosodic) and confirms the positive effect of our former method for segmentation based on Decision Trees.

1. INTRODUCTION

The main aim of the Hyperlinking sub-task is to find segments similar to a given (query) segment in the collection of audio-visual recordings. Created hyperlinks enable users to browse the collection and thus improve exploratory search ability and add entertainment value to the collection [ 2 ].

The data consists of 1335 hours of BBC Broadcast recordings available for training and 2686 hours available for testing. In our experiments, we exploit subtitles, automatic speech recognition transcripts by LIMSI [ 9 ], LIUM [ 11 ] and NST-Sheffield [ 10 ], visual features (shots and keyframes) [ 5 ], and prosodic features, all available for the task [ 4 ].

SEARCH SYSTEM

Our search system for the Hyperlinking sub-task is identical to the system used in the Search sub-task [ 6 ]. We apply the same retrieval model with the same settings and segmentation methods – the fixed-length segmentation and the segmentation employing Decision Trees (DT). The length of the segment used in the Hyperlinking was tuned on the training data and set to 50 seconds. Similarly to the Search sub-task, we also exploit metadata by appending metadata of the recordings to the text (subtitles/transcripts) of each its segment and apply post-filtering of retrieved segments which partially overlap with another higher ranked segment. In addition, we also remove all retrieved segments which partially overlap with the query segment.

HYPERLINKING

In the Hyperlinking sub-task, we first transformed the query segment into a textual query by including all the words of the subtitles lying within the segment boundary. Then, we extended the segment boundary by including the context surrounding the query segment. The optimal length of the surrounding context was tuned on the training data. We used a 200-seconds-long passage before and after each segment. 3.1

Visual Similarity

The visual modality was employed in the following way. First, we calculated distance between each keyframe in the collection and each query segment keyframe using the Signature Quadratic Form Distance [ 3, 8 ] and Feature Signatures [ 7 ] (the parameter of the method was tuned on the training data). Then, we calculated the V isualSimilarity between each query/segment pair as the maximal similarity (1−distance) between keyframes in the query and keyframes in the segment. The calculated V isualSimilarity was used to modify the final score of the segment in the retrieval for a particular query segment as follows (the W eight parameter was tuned on the training data and Score(segment/query) is the output of the retrieval on the subtitles/transcripts): F inalScore(segment/query) = Score(segment/query) + W eight ∗ V isualSimilarity(segment/query). 3.2

Prosodic Similarity

The eight prosodic features provided in the data (energy, loudness, voice probability, pitch, pitch direction, direction score, voice quality, and harmonics-to-noise ratio) were used to construct 8-dimensional prosodic vectors each 10 ms of the recordings. We took overlapping sequences of 10 vectors appearing up to 1 second from the beginning of the query segment and found the most similar sequence of the vectors in each segment.

Similarity between the vector sequences was calculated as the sum of differences between the corresponding vectors of the sequence. These differences were calculated as the sum of the absolute values of the differences between the corresponding items of the prosodic vectors. To ensure that all prosodic features have equal weights, the difference of each item of the prosodic vector was normalized to have component values between 0 and 1. Due to the computational costs, we only took into account the vector sequences lying at most 1 second far from the beginning of the segment. The final score of each segment was calculated in the same way as the final score for the visual similarity. The W eight for the audio similarity was tuned on the training set. 4.

RESULTS

The results of the Hyperlinking sub-task are displayed in Table 1. We report the following evaluation measures: Mean

Average Precision (MAP), Precision at 5 (P5), Precision at 10 (P10), Precision at 20 (P20), Binned Relevance (MAPbin), and Tolerance to Irrelevance (MAP-tol) [ 1 ].

The highest scores of MAP, MAP-bin and the precisionbased measures are not surprisingly reached in the cases when overlapping segments are preserved in the results [ 6 ]. Unlike in the Search sub-task, the segmentation employing the Decision Trees outperforms the fixed-length segmentation for most of the measures. There is a constant improvement in the case that the visual weights were used and small, but promising improvement in MAP, P20, and MAP-tol measures in the case when prosodic features were used. The concatenation with the context and metadata is also proved to be beneficial; the improvement is on all transcripts and the MAP score raised more than 5 times on the LIUM transcripts when metadata and context were used.

5. ACKNOWLEDGMENTS

This research is supported by the Czech Science Foundation, grant number P103/12/G084, Charles University Grant Agency GA UK, grant number 920913, and by SVV project number 260 104.

[1]

Aly ,

Eskevich ,

Ordelman , and

G. J. F.

Jones . Adapting Binary Information Retrieval Evaluation Metrics for Segment-based Retrieval Tasks . CoRR, abs/1312. 1913 , 2013 .

[2]

Aly ,

R. J. F.

Ordelman ,

Eskevich ,

G. J. F.

Jones , and

Chen . Linking Inside a Video Collection: What and How to Measure? In Proc. of WWW , pages 457 - 460 , Rio de Janeiro, Brazil, 2013 .

[3]

Beecks ,

M. S.

Uysal , and

Seidl . Signature Quadratic Form Distance . In Proc. of CIVR , pages 438 - 445 , Xi'an, China, 2010 .

[4]

Eskevich ,

Aly ,

D. N.

Racca ,

Ordelman ,

Chen , and

G. J. F.

Jones . The Search and Hyperlinking Task at MediaEval 2014 . In Proc. of MediaEval , Barcelona, Spain, 2014 .

[5]

Eyben ,

Weninger ,

Gross , and

Schuller . Recent Developments in openSMILE, the Munich Open-source Multimedia Feature Extractor . In Proc. of ACMMM , pages 835 - 838 , Barcelona, Spain, 2013 .

[6]

Galuščáková and

Pecina . CUNI at MediaEval 2014 Search and Hyperlinking Task: Search Task Experiments . In Proc. of MediaEval , Barcelona, Spain, 2014 .

[7]

Kruliš ,

Lokoč , and

Skopal . Efficient Extraction of Feature Signatures Using Multi-GPU Architecture . In MMM (2) , volume 7733 of LNCS , pages 446 - 456 . Springer, 2013 .

[8]

Kruliš ,

Skopal ,

Lokoč , and

Beecks . Combining CPU and GPU Architectures for Fast Similarity Search . Distributed and Parallel Databases , 30 ( 3-4 ): 179 - 207 , 2012 .

[9]

Lamel and

J.-L.

Gauvain . Speech Processing for Audio Indexing . In Proc. of GoTAL , pages 4 - 15 , Gothenburg, Sweden, 2008 .

[10]

Lanchantin ,

P.-J.

Bell , M.-J.-F. Gales , T.

Hain , X.

Liu , Y.

Long , J.

Quinnell , S.

Renals , O.

Saz , M.- S.

Seigel , P.

Swietojanski , and P.-C.

Woodland . Automatic Transcription of Multi-genre Media Archives . In Proc. of SLAM Workshop , pages 26 - 31 , Marseille, France, 2013 .

[11]

Rousseau ,

Deléglise , and

Estève . Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling And More TED Talks . In Proc. of LREC , pages 3935 - 3939 , Reykjavik, Iceland, 2014 .