MOTIVATION

Multimodal Person Discovery in Broadcast TV at MediaEval 2016

Hervé Bredin

Claude Barras

Camille Guinaudeau

0 0 LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay , F-91405 Orsay , France

2016

20 21

We describe the \Multimodal Person Discovery in Broadcast TV" task of MediaEval 2016 benchmarking initiative. Participants are asked to return the names of people who can be both seen as well as heard in every shot of a collection of videos. The list of people is not known a priori and their names has to be discovered in an unsupervised way from media content using text overlay or speech transcripts for the primary runs. The task is evaluated using information retrieval metrics, based on a posteriori collaborative annotation of the test corpus.

MOTIVATION

TV archives maintained by national institutions such as the French INA, the Netherlands Institute for Sound & Vision, or the British Broadcasting Corporation are rapidly growing in size. The need for applications that make these archives searchable has led researchers to devote concerted e ort to developing technologies that create indexes.

Indexes that represent the location and identity of people in the archive are indispensable for searching archives. Human nature leads people to be very interested in other people. However, when the content is created or broadcasted, it is not always possible to predict which people will be the most important to nd in the future and biometric models may not yet be available at indexing time The goal of this task is thus to address the challenge of indexing people in the archive under real-world conditions, i.e. when there is no pre-set list of people to index.

Started in 2011, the REPERE challenge aimed at supporting research on multimodal person recognition [ 3, 16 ]. Its main goal was to answer the two questions \who speaks when?" and \who appears when?" using any available source of information (including pre-existing biometric models and person names extracted from text overlay and speech transcripts). Thanks to this challenge and the associated multimodal corpus [ 13 ], signi cant progress was achieved in either supervised or unsupervised multimodal person recognition [ 1, 2, 4, 5, 6, 7, 11, 12, 17, 20, 21, 22, 24 ]. After the end of the REPERE challenge in 2014, the rst edition of the \Multimodal Person Discovery in Broadcast TV" task was organized in 2015 [ 19 ]. This year's task is a follow-up of last year edition. 2.

DEFINITION OF THE TASK

Participants are provided with a collection of TV broadcast recordings pre-segmented into shots. Each shot s 2 S has to be automatically tagged with the names of people both speaking and appearing at the same time during the shot.

As last year, the list of persons is not provided a priori, and person biometric models (neither voice nor face) can not be trained on external data in the primary runs. The only way to identify a person is by nding their name n 2 N in the audio (e.g., using speech transcription { ASR) or visual (e.g., using optical character recognition { OCR) streams and associating them to the correct person. This makes the task completely unsupervised (i.e. using algorithms not relying on pre-existing labels or biometric models). The main novelty of this year task is that participants may use their contrastive run to try brave new ideas that may rely on any external data, including textual metadata provided with the test set.

Because person names are detected and transcribed automatically, they may contain transcription errors to a certain extent (more on that later in Section 5). In the following, we denote by N the set of all possible person names in the universe, correctly formatted as firstname_lastname { while N is the set of hypothesized names.

A Mr A

A A

Hello Mrs B

B B blah blah blah blah blah blah A

B B A

INPUT OUTPUT

LEGEND speech transcript text overlay speaking face evidence shot #1 shot #2 shot #3 shot #4 3.

DATASETS

The 2015 test corpus serves as development set for this year's task. It contains 106 hours of video, corresponding to 172 editions of evening broadcast news \Le 20 heures" of the French public channel \France 2", from January 1st 2007 to June 30st 2007. This development set is associated with a posteriori annotations based on last year participants' submissions.

The test set is divided into three datasets: INA, DW and 3-24. The INA dataset contains a full week of broadcast for 3 TV channels and 3 radio channels in French. Only a subset (made of 2 TV video channels for a total duration of 90 hours) needs to be processed. However, participants can process the rest of it if they think it might lead to improved results. Moreover, this dataset is associated with manual metadata provided by INA in the shape of CSV les. The DW dataset [ 14 ] is composed of video downloaded from Deutsche Welle website, in English and German for a total duration of 50 hours. This dataset is also associated with metadata that can be used in contrastive runs. The last dataset contains 13 hours of broadcast from 3/24 Catalan TV news channel.

As the test set comes completely free of any annotation, it will be annotated a posteriori based on participants' submissions. In order to ease this annotation process, participants are asked to justify their assertion. To this end, each hypothesized name n 2 N has to be backed up by a carefully selected and unique shot prooving that the person actually holds this name n: we call this an evidence. In real-world conditions, this evidence would help a human annotator double-check the automatically-generated index, even for people they did not know beforehand.

Two types of evidence are allowed: an image evidence is a time in a video when a person is visible, while his/her name is written on screen; an audio evidence is the time when the name of a person is pronounced, provided that this person is visible in a [time 5s; time+5s] neighborhood. For instance, in Figure 1, shot #1 contains an image evidence for Mr A (because his name and his face are visible simultaneously on screen) while shot #3 contains an audio evidence for Mrs B (because her name is pronounced less than 5 seconds before or after her face is visible on screen).

BASELINE AND METADATA

This task targets researchers from several communities including multimedia, computer vision, speech and natural language processing. Though the task is multimodal by design and necessitates expertise in various domains, the technological barriers to entry is lowered by the provision of a baseline system available partially as open-source software.

For instance, a researcher from the speech processing community can focus its research e orts on improving speaker diarization and automatic speech transcription, while still being able to rely on provided face detection and tracking results to participate to the task. Figure 2 summarizes the available modules. and the correlation tracker proposed by Danelljan et al. [ 10 ]. Each face track is then described by its average FaceNet embedding and compared with all the others using Euclidean distance [ 25 ]. Finally, average-link hierarchical agglomerative clustering is applied. Source code for this module is available in pyannote-video1.

Optical character recognition followed by name detection is contributed by IDIAP [ 8 ] and UPC. UPC detection was performed using LOOV [ 18 ]. Then, text results were ltered using rst and last names gathered from internet and an hand-crafted list of negative words. Due to the large diversity of the test corpus, optical character recognition results are much more noisy than the ones provided in 2015.

Three variants of the name propagation technique proposed in [ 21 ] are proposed. Baseline 1 tags each speaker cluster by the most co-occurring written name. Baseline 2 tags each face cluster by the most co-occurring written name. Baseline 3 is the temporal intersection of both. These fusion techniques are available as open-source software2. 5.

EVALUATION METRIC

Because of limited resources dedicated to collaborative annotation, the test set cannot be fully annotated. Therefore, the task is evaluated indirectly as an information retrieval task, using the folllowing principle.

For each query q 2 Q N (firstname_lastname), returned shots are rst sorted by the edit distance between the hypothesized person name and the query q and then by condence scores. Average precision AP(q) is then computed classically based on the list of relevant shots (according to the groundtruth) and the sorted list of shots. Finally, Mean Average Precision is computed as follows:

MAP = 1

X AP(q) jQj q2Q

Acknowledgment

This work was supported by the French National Agency for Research under grants ANR-12-CHRI-0006-01 and ANR14-CE24-0024. The open source CAMOMILE collaborative annotation platform3 was used extensively throughout the progress of the task: from the run submission script to the automated leaderboard, including a posteriori collaborative annotation of the test corpus. The task builds on Johann Poignant involvement in 2015 task organization. Xavier Trimolet helped design and develop the 2016 annotation interface. We also thank INA, LIUM, UPC and IDIAP for providing datasets and baseline modules. 4.1

Video processing

Face tracking-by-detection is applied within each shot using a detector based on histogram of oriented gradients [ 9 ] 1http://pyannote.github.io 2http://github.com/MediaEvalPersonDiscoveryTask 3http://github.com/camomile-project

[1]

Bechet ,

Bendris ,

Charlet ,

Damnati ,

Favre ,

Rouvier ,

Auguste ,

Bigot ,

Dufour ,

Fredouille , G. Linares,

Martinet , G. Senay, and

Tirilly . Multimodal Understanding for Person Recognition in Video Broadcasts . In INTERSPEECH , 2014 .

[2]

Bendris ,

Favre ,

Charlet , G. Damnati,

Auguste ,

Martinet , and

Senay. Unsupervised Face Identi cation in TV Content using Audio-Visual Sources . In CBMI , 2013 .

[3]

Bernard ,

Galibert , and

Kahn. The First O cial REPERE Evaluation . In SLAM-INTERSPEECH , 2013 .

[4]

Bredin ,

Laurent ,

Sarkar , V. -B. Le , S.

Rosset , and C.

Barras . Person Instance Graphs for Named Speaker Identi cation in TV Broadcast . In Odyssey, 2014 .

[5]

Bredin and

Poignant . Integer Linear Programming for Speaker Diarization and Cross-Modal Identi cation in TV Broadcast . In INTERSPEECH, 2013 .

[6]

Bredin ,

Poignant , G. Fortier,

Tapaswi , V. -B. Le , A.

Sarkar , C.

Barras , S.

Rosset , A.

Roy , Q.

Yang , H.

Gao , A.

Mignon , J.

Verbeek , L.

Besacier , G. Quenot, H. K.

Ekenel , and R.

Stiefelhagen . QCompere at REPERE 2013. In SLAM-INTERSPEECH, 2013 .

[7]

Bredin ,

Roy , V. -B. Le , and C. Barras . Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identi cation in TV broadcast . In IJMIR , 2014 .

[8]

Chen and J.-M. Odobez . Video text recognition using sequential monte carlo and error voting methods . Pattern Recognition Letters , 26 ( 9 ): 1386 { 1403 , 2005 .

[9]

Dalal and

Triggs . Histograms of Oriented Gradients for Human Detection . In IEEE Computer Society Conference on Computer Vision and Pattern Recognition , volume 1 , pages 886 { 893 vol. 1 , June 2005 .

[10]

Danelljan , G. Hager, F. Shahbaz Khan , and M. Felsberg . Accurate Scale Estimation for Robust Visual Tracking . In Proceedings of the British Machine Vision Conference . BMVA Press, 2014 .

[11]

Favre ,

Damnati ,

Bechet ,

Bendris ,

Charlet ,

Auguste ,

Ayache ,

Bigot ,

Delteil ,

Dufour ,

Fredouille , G. Linares,

Martinet , G. Senay, and

Tirilly. PERCOLI: a person identi cation system for the 2013 REPERE challenge . In SLAM-INTERSPEECH , 2013 .

[12]

Gay , G. Dupuy,

Lailler , J.-M. Odobez , S.

Meignier , and P.

Deleglise . Comparison of Two Methods for Unsupervised Person Identi cation in TV Shows . In CBMI, 2014 .

[13]

Giraudel ,

Carre ,

Mapelli ,

Kahn ,

Galibert , and

Quintard . The REPERE Corpus : a Multimodal Corpus for Person Recognition . In LREC , 2012 .

[14]

Grivolla ,

Melero ,

Badia ,

Cabulea ,

Esteve ,

Herder , J.-M. Odobez , S.

Preuss , and R.

Marin . EUMSSI: a Platform for Multimodal Analysis and Recommendation using UIMA . In International Conference on Computational Linguistics (Coling) , 2014 .

[15]

Gupta ,

Deleglise , G. Boulianne,

Esteve ,

Meignier , and

Rousseau . CRIM and LIUM approaches for multi-genre broadcast media transcription . In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) , pages 681 { 686 . IEEE, 2015 .

[16]

Kahn ,

Galibert ,

Quintard ,

Carre ,

Giraudel , and

Joly . A presentation of the REPERE challenge . In CBMI , 2012 .

[17]

Poignant ,

Besacier , and

Quenot. Unsupervised Speaker Identi cation in TV Broadcast Based on Written Names . IEEE/ACM ASLP, 23 ( 1 ), 2015 .

[18]

Poignant ,

Besacier , G. Quenot, and

Thollard . From text detection in videos to person identi cation . In ICME , 2012 .

[19]

Poignant ,

Bredin , and

Barras . Multimodal Person Discovery in Broadcast TV at MediaEval 2015 . In MediaEval 2015 , 2015 .

[20]

Poignant ,

Bredin ,

Besacier , G. Quenot, and

Barras . Towards a better integration of written names for unsupervised speakers identi cation in videos . In SLAM-INTERSPEECH , 2013 .

[21]

Poignant ,

Bredin ,

Le ,

Besacier ,

Barras , and

Quenot . Unsupervised speaker identi cation using overlaid texts in TV broadcast . In INTERSPEECH , 2012 .

[22]

Poignant ,

Fortier ,

Besacier , and

Quenot. Naming multi-modal clusters to identify persons in TV broadcast . MTAP , 2015 .

[23]

Rouvier , G. Dupuy,

Gay , E. Khoury,

Merlin , and

Meignier . An open-source state-of-the-art toolbox for broadcast news diarization . In Interspeech, Lyon (France), 25 - 29 Aug. 2013 .

[24]

Rouvier ,

Favre ,

Bendris ,

Charlet , and

Damnati . Scene understanding for identifying persons in TV shows: beyond face authentication . In CBMI , 2014 .

[25]

Schro ,

Kalenichenko , and J. Philbin. FaceNet: a Uni ed Embedding for Face Recognition and Clustering . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 815 { 823 , 2015 .