Multimodal Person Discovery in Broadcast TV at MediaEval 2016 Hervé Bredin, Claude Barras, Camille Guinaudeau LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay, F-91405 Orsay, France. firstname.lastname@limsi.fr ABSTRACT 2. DEFINITION OF THE TASK We describe the “Multimodal Person Discovery in Broadcast Participants are provided with a collection of TV broad- TV” task of MediaEval 2016 benchmarking initiative. Par- cast recordings pre-segmented into shots. Each shot s ∈ S ticipants are asked to return the names of people who can has to be automatically tagged with the names of people be both seen as well as heard in every shot of a collection both speaking and appearing at the same time during the of videos. The list of people is not known a priori and their shot. names has to be discovered in an unsupervised way from As last year, the list of persons is not provided a priori, media content using text overlay or speech transcripts for and person biometric models (neither voice nor face) can not the primary runs. The task is evaluated using information be trained on external data in the primary runs. The only retrieval metrics, based on a posteriori collaborative anno- way to identify a person is by finding their name n ∈ N in tation of the test corpus. the audio (e.g., using speech transcription – ASR) or visual (e.g., using optical character recognition – OCR) streams and associating them to the correct person. This makes the 1. MOTIVATION task completely unsupervised (i.e. using algorithms not re- TV archives maintained by national institutions such as lying on pre-existing labels or biometric models). The main the French INA, the Netherlands Institute for Sound & Vi- novelty of this year task is that participants may use their sion, or the British Broadcasting Corporation are rapidly contrastive run to try brave new ideas that may rely on any growing in size. The need for applications that make these external data, including textual metadata provided with the archives searchable has led researchers to devote concerted test set. effort to developing technologies that create indexes. Because person names are detected and transcribed auto- Indexes that represent the location and identity of people matically, they may contain transcription errors to a certain in the archive are indispensable for searching archives. Hu- extent (more on that later in Section 5). In the following, we man nature leads people to be very interested in other peo- denote by N the set of all possible person names in the uni- ple. However, when the content is created or broadcasted, verse, correctly formatted as firstname_lastname – while it is not always possible to predict which people will be the N is the set of hypothesized names. most important to find in the future and biometric models Hello blah blah blah LEGEND may not yet be available at indexing time The goal of this Mrs B blah blah blah task is thus to address the challenge of indexing people in speech the archive under real-world conditions, i.e. when there is A A B A B transcript INPUT no pre-set list of people to index. Mr A text overlay Started in 2011, the REPERE challenge aimed at sup- shot #1 shot #2 shot #3 shot #4 speaking face porting research on multimodal person recognition [3, 16]. A B B evidence OUTPUT A Its main goal was to answer the two questions “who speaks when?” and “who appears when?” using any available source of information (including pre-existing biometric models and Figure 1: For each shot, participants have to return person names extracted from text overlay and speech tran- the names of every speaking face. Each name has to scripts). Thanks to this challenge and the associated mul- be backed up by an evidence. timodal corpus [13], significant progress was achieved in ei- ther supervised or unsupervised multimodal person recogni- tion [1, 2, 4, 5, 6, 7, 11, 12, 17, 20, 21, 22, 24]. After the end 3. DATASETS of the REPERE challenge in 2014, the first edition of the The 2015 test corpus serves as development set for this “Multimodal Person Discovery in Broadcast TV” task was year’s task. It contains 106 hours of video, corresponding to organized in 2015 [19]. This year’s task is a follow-up of last 172 editions of evening broadcast news “Le 20 heures” of the year edition. French public channel “France 2”, from January 1st 2007 to June 30st 2007. This development set is associated with a posteriori annotations based on last year participants’ sub- Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- missions. lands. The test set is divided into three datasets: INA, DW and 3-24. The INA dataset contains a full week of broadcast and the correlation tracker proposed by Danelljan et al. [10]. for 3 TV channels and 3 radio channels in French. Only a Each face track is then described by its average FaceNet em- subset (made of 2 TV video channels for a total duration bedding and compared with all the others using Euclidean of 90 hours) needs to be processed. However, participants distance [25]. Finally, average-link hierarchical agglomera- can process the rest of it if they think it might lead to im- tive clustering is applied. Source code for this module is proved results. Moreover, this dataset is associated with available in pyannote-video 1 . manual metadata provided by INA in the shape of CSV files. Optical character recognition followed by name detection The DW dataset [14] is composed of video downloaded from is contributed by IDIAP [8] and UPC. UPC detection was Deutsche Welle website, in English and German for a total performed using LOOV [18]. Then, text results were filtered duration of 50 hours. This dataset is also associated with using first and last names gathered from internet and an metadata that can be used in contrastive runs. The last hand-crafted list of negative words. Due to the large diver- dataset contains 13 hours of broadcast from 3/24 Catalan sity of the test corpus, optical character recognition results TV news channel. are much more noisy than the ones provided in 2015. As the test set comes completely free of any annotation, it will be annotated a posteriori based on participants’ sub- 4.2 Audio processing missions. In order to ease this annotation process, par- Speaker diarization and speech transcription for French, ticipants are asked to justify their assertion. To this end, German and English are contributed by LIUM [23, 15]. Pro- each hypothesized name n ∈ N has to be backed up by a nounced person names are automatically extracted from the carefully selected and unique shot prooving that the per- audio stream using a large list of names gathered from the son actually holds this name n: we call this an evidence. Wikipedia website. In real-world conditions, this evidence would help a human annotator double-check the automatically-generated index, 4.3 Multimodal fusion baseline even for people they did not know beforehand. Three variants of the name propagation technique pro- Two types of evidence are allowed: an image evidence is a posed in [21] are proposed. Baseline 1 tags each speaker clus- time in a video when a person is visible, while his/her name ter by the most co-occurring written name. Baseline 2 tags is written on screen; an audio evidence is the time when the each face cluster by the most co-occurring written name. name of a person is pronounced, provided that this person is Baseline 3 is the temporal intersection of both. These fu- visible in a [time−5s, time+5s] neighborhood. For instance, sion techniques are available as open-source software2 . in Figure 1, shot #1 contains an image evidence for Mr A (because his name and his face are visible simultaneously on 5. EVALUATION METRIC screen) while shot #3 contains an audio evidence for Mrs B Because of limited resources dedicated to collaborative an- (because her name is pronounced less than 5 seconds before notation, the test set cannot be fully annotated. Therefore, or after her face is visible on screen). the task is evaluated indirectly as an information retrieval task, using the folllowing principle. 4. BASELINE AND METADATA For each query q ∈ Q ⊂ N (firstname_lastname), re- This task targets researchers from several communities in- turned shots are first sorted by the edit distance between the cluding multimedia, computer vision, speech and natural hypothesized person name and the query q and then by con- language processing. Though the task is multimodal by de- fidence scores. Average precision AP(q) is then computed sign and necessitates expertise in various domains, the tech- classically based on the list of relevant shots (according to nological barriers to entry is lowered by the provision of a the groundtruth) and the sorted list of shots. Finally, Mean baseline system available partially as open-source software. Average Precision is computed as follows: For instance, a researcher from the speech processing com- 1 X munity can focus its research efforts on improving speaker MAP = AP(q) |Q| q∈Q diarization and automatic speech transcription, while still being able to rely on provided face detection and tracking results to participate to the task. Figure 2 summarizes the Acknowledgment available modules. This work was supported by the French National Agency for Research under grants ANR-12-CHRI-0006-01 and ANR- 14-CE24-0024. The open source CAMOMILE collaborative annotation platform3 was used extensively throughout the progress of the task: from the run submission script to the automated leaderboard, including a posteriori collaborative annotation of the test corpus. The task builds on Johann Poignant involvement in 2015 task organization. Xavier Tri- molet helped design and develop the 2016 annotation in- terface. We also thank INA, LIUM, UPC and IDIAP for Figure 2: Multimodal baseline pipeline. providing datasets and baseline modules. 4.1 Video processing 1 http://pyannote.github.io 2 Face tracking-by-detection is applied within each shot us- http://github.com/MediaEvalPersonDiscoveryTask 3 ing a detector based on histogram of oriented gradients [9] http://github.com/camomile-project 6. REFERENCES Analysis and Recommendation using UIMA. In [1] F. Bechet, M. Bendris, D. Charlet, G. Damnati, International Conference on Computational B. Favre, M. Rouvier, R. Auguste, B. Bigot, Linguistics (Coling), 2014. R. Dufour, C. Fredouille, G. Linarès, J. Martinet, [15] V. Gupta, P. Deléglise, G. Boulianne, Y. Estève, G. Senay, and P. Tirilly. Multimodal Understanding S. Meignier, and A. Rousseau. CRIM and LIUM for Person Recognition in Video Broadcasts. In approaches for multi-genre broadcast media INTERSPEECH, 2014. transcription. In 2015 IEEE Workshop on Automatic [2] M. Bendris, B. Favre, D. Charlet, G. Damnati, Speech Recognition and Understanding (ASRU), pages R. Auguste, J. Martinet, and G. Senay. Unsupervised 681–686. IEEE, 2015. Face Identification in TV Content using Audio-Visual [16] J. Kahn, O. Galibert, L. Quintard, M. Carré, Sources. In CBMI, 2013. A. Giraudel, and P.Joly. A presentation of the [3] G. Bernard, O. Galibert, and J. Kahn. The First REPERE challenge. In CBMI, 2012. Official REPERE Evaluation. In [17] J. Poignant, L. Besacier, and G. Quénot. Unsupervised SLAM-INTERSPEECH, 2013. Speaker Identification in TV Broadcast Based on [4] H. Bredin, A. Laurent, A. Sarkar, V.-B. Le, S. Rosset, Written Names. IEEE/ACM ASLP, 23(1), 2015. and C. Barras. Person Instance Graphs for Named [18] J. Poignant, L. Besacier, G. Quénot, and F. Thollard. Speaker Identification in TV Broadcast. In Odyssey, From text detection in videos to person identification. 2014. In ICME, 2012. [5] H. Bredin and J. Poignant. Integer Linear [19] J. Poignant, H. Bredin, and C. Barras. Multimodal Programming for Speaker Diarization and Person Discovery in Broadcast TV at MediaEval 2015. Cross-Modal Identification in TV Broadcast. In In MediaEval 2015, 2015. INTERSPEECH, 2013. [20] J. Poignant, H. Bredin, L. Besacier, G. Quénot, and [6] H. Bredin, J. Poignant, G. Fortier, M. Tapaswi, V.-B. C. Barras. Towards a better integration of written Le, A. Sarkar, C. Barras, S. Rosset, A. Roy, Q. Yang, names for unsupervised speakers identification in H. Gao, A. Mignon, J. Verbeek, L. Besacier, videos. In SLAM-INTERSPEECH, 2013. G. Quénot, H. K. Ekenel, and R. Stiefelhagen. [21] J. Poignant, H. Bredin, V. Le, L. Besacier, C. Barras, QCompere at REPERE 2013. In and G. Quénot. Unsupervised speaker identification SLAM-INTERSPEECH, 2013. using overlaid texts in TV broadcast. In [7] H. Bredin, A. Roy, V.-B. Le, and C. Barras. Person INTERSPEECH, 2012. instance graphs for mono-, cross- and multi-modal [22] J. Poignant, G. Fortier, L. Besacier, and G. Quénot. person recognition in multimedia data: application to Naming multi-modal clusters to identify persons in speaker identification in TV broadcast. In IJMIR, TV broadcast. MTAP, 2015. 2014. [23] M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin, [8] D. Chen and J.-M. Odobez. Video text recognition and S. Meignier. An open-source state-of-the-art using sequential monte carlo and error voting methods. toolbox for broadcast news diarization. In Interspeech, Pattern Recognition Letters, 26(9):1386 – 1403, 2005. Lyon (France), 25-29 Aug. 2013. [9] N. Dalal and B. Triggs. Histograms of Oriented [24] M. Rouvier, B. Favre, M. Bendris, D. Charlet, and Gradients for Human Detection. In IEEE Computer G. Damnati. Scene understanding for identifying Society Conference on Computer Vision and Pattern persons in TV shows: beyond face authentication. In Recognition, volume 1, pages 886–893 vol. 1, June CBMI, 2014. 2005. [25] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: [10] M. Danelljan, G. Häger, F. Shahbaz Khan, and a Unified Embedding for Face Recognition and M. Felsberg. Accurate Scale Estimation for Robust Clustering. In Proceedings of the IEEE Conference on Visual Tracking. In Proceedings of the British Machine Computer Vision and Pattern Recognition, pages Vision Conference. BMVA Press, 2014. 815–823, 2015. [11] B. Favre, G. Damnati, F. Béchet, M. Bendris, D. Charlet, R. Auguste, S. Ayache, B. Bigot, A. Delteil, R. Dufour, C. Fredouille, G. Linares, J. Martinet, G. Senay, and P. Tirilly. PERCOLI: a person identification system for the 2013 REPERE challenge. In SLAM-INTERSPEECH, 2013. [12] P. Gay, G. Dupuy, C. Lailler, J.-M. Odobez, S. Meignier, and P. Deléglise. Comparison of Two Methods for Unsupervised Person Identification in TV Shows. In CBMI, 2014. [13] A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, and L. Quintard. The REPERE Corpus : a Multimodal Corpus for Person Recognition. In LREC, 2012. [14] J. Grivolla, M. Melero, T. Badia, C. Cabulea, Y. Esteve, E. Herder, J.-M. Odobez, S. Preuss, and R. Marin. EUMSSI: a Platform for Multimodal