Multimodal Person Discovery in Broadcast TV at MediaEval 2015 Johann Poignant, Hervé Bredin, Claude Barras LIMSI - CNRS - Rue John Von Neumann, Orsay, France. firstname.lastname@limsi.fr ABSTRACT to overcome the limitations of monomodal approaches. Its We describe the “Multimodal Person Discovery in Broadcast main goal was to answer the two questions “who speaks TV” task of MediaEval 2015 benchmarking initiative. Par- when?” and “who appears when?” using any available source ticipants were asked to return the names of people who can of information (including pre-existing biometric models and be both seen as well as heard in every shot of a collection person names extracted from text overlay and speech tran- of videos. The list of people was not known a priori and scripts). To assess the technology progress, annual evalua- their names had to be discovered in an unsupervised way tions were organized in 2012, 2013 and 2014. Thanks to this from media content using text overlay or speech transcripts. challenge and the associated multimodal corpus [16], signif- The task was evaluated using information retrieval metrics, icant progress was achieved in either supervised or unsuper- based on a posteriori collaborative annotation of the test vised mulitmodal person recognition [1, 2, 4, 5, 6, 7, 14, 15, corpus. 23, 25, 26, 27, 28]. The REPERE challenge came to an end in 2014 and this task can be seen as a follow-up campaign with a strong focus on unsupervised person recognition. 1. MOTIVATION TV archives maintained by national institutions such as the French INA, the Netherlands Institute for Sound & Vi- 2. DEFINITION OF THE TASK sion, or the British Broadcasting Corporation are rapidly Participants were provided with a collection of TV broad- growing in size. The need for applications that make these cast recordings pre-segmented into shots. Each shot s ∈ S archives searchable has led researchers to devote concerted had to be automatically tagged with the names of people effort to developing technologies that create indexes. both speaking and appearing at the same time during the Indexes that represent the location and identity of peo- shot: this tagging algorithm is denoted by L : S 7→ P(N ) in ple in the archive are indispensable for searching archives. the rest of the paper. The main novelty of the task is that Human nature leads people to be very interested in other the list of persons was not provided a priori, and person bio- people. However, when the content is created or broadcast, metric models (neither voice nor face) could not be trained it is not always possible to predict which people will be the on external data. The only way to identify a person was by most important to find in the future. For this reason, it is finding their name n ∈ N in the audio (e.g. using speech not possible to assume that biometric models will always be transcription – ASR) or visual (e.g. using optical charac- available at indexing time. For some people, such a model ter recognition – OCR) streams and associating them to the may not be available in advance, simply because they are not correct person. This made the task completely unsupervised (yet) famous. In such cases, it is also possible that archivists (i.e. using algorithms not relying on pre-existing labels or annotating content by hand do not even know the name of biometric models). the person. The goal of this task is to address the challenge Because person names were detected and transcribed au- of indexing people in the archive, under real-world condi- tomatically, they could contain transcription errors to a cer- tions (i.e. when there is no pre-set list of people to index). tain extent (more on that later in Section 5). In the follow- Canseco et al. [8, 9] pioneered approaches relying on pro- ing, we denote by N the set of all possible person names in nounced names instead of biometric models for speaker iden- the universe, correctly formatted as firstname_lastname – tification [13, 19, 22, 30]. However, due to relatively high while N is the set of hypothesized names. speech transcription and named entity detection errors, all these audio-only approaches did not achieve good enough Hello blah blah blah LEGEND identification performance. Similarly, for face recognition, Mrs B blah blah blah initial visual-only approaches based on overlaid title box A A B A B speech transcript transcriptions were very dependent on the quality of overlaid INPUT text overlay name transcription [18, 29, 32, 33]. Mr A shot #1 shot #2 shot #3 shot #4 speaking face Started in 2011, the REPERE challenge aimed at sup- A B B evidence OUTPUT porting research on multimodal person recognition [3, 20] A Figure 1: For each shot, participants had to return the names of every speaking face. Each name had Copyright is held by the author/owner(s). MediaEval 2015 Workshop Sept. 14-15, 2015, Wurzen, Germany to be backed up by an evidence. To ensure that participants followed this strict “no biomet- ric supervision” constraint, each hypothesized name n ∈ N had to be backed up by a carefully selected and unique shot prooving that the person actually holds this name n: we call this an evidence and denote it by E : N 7→ S. In real- world conditions, this evidence would help a human anno- tator double-check the automatically-generated index, even for people they did not know beforehand. Two types of evidence were allowed: an image evidence is a shot during which a person is visible, and their name is Figure 2: Multimodal baseline pipeline. Output of written on screen; an audio evidence is a shot during which a greyed out modules is provided to the participants. person is visible, and their name is pronounced at least once during a [shot start time − 5s, shot end time + 5s] neighbor- hood. For instance, in Figure 1, shot #1 is an image evi- mapping between co-occuring faces and speech turns. Writ- dence for Mr A (because his name and his face are visible ten (resp. pronounced) person names were automatically simultaneously on screen) while shot #3 is an audio evi- extracted from the visual stream (resp. the audio stream) dence for Mrs B (because her name is pronounced less than using open source LOOV Optical Character Recognition [24] 5 seconds before or after her face is visble on screen). (resp. Automatic Speech Recognition [21, 12]) followed by Named Entity detection (NE). The fusion module was a two- 3. DATASETS steps algorithm: propagation of written names onto speaker The REPERE corpus – distributed by ELDA – served clusters [26] followed by propagation of speaker names onto as development set. It is composed of various TV shows co-occurring speaking faces. (around news, politics and people) from two French TV channels, for a total of 137 hours. A subset of 50 hours is 5. EVALUATION METRIC manually annotated. Audio annotations are dense and pro- This information retrieval task was evaluated using a vari- vide speech transcripts and identity-labeled speech turns. ant of Mean Average Precision (MAP), that took the qual- Video annotations are sparse (one image every 10 seconds) ity of evidences into account. For each query q ∈ Q ⊂ N and provide overlaid text transcripts and identity-labeled (firstname_lastname), the hypothesized person name nq face segmentation. Both speech and overlaid text transcripts with the highest Levenshtein ratio ρ to the query q is se- are tagged with named entities. The test set – distributed lected (ρ : N × N 7→ [0, 1]) – allowing approximate name by INA – contains 106 hours of video, corresponding to 172 transcription: editions of evening broadcast news “Le 20 heures” of French nq = arg max ρ (q, n) and ρq = ρ (q, nq ) public channel “France 2”, from January 1st 2007 to June n∈N 30st 2007. As the test set came completely free of any annotation, it Average precision AP(q) is then computed classically based was annotated a posteriori based on participants’ submis- on relevant and returned shots: sions. In the following, task groundtruths are denoted by relevant(q) = {s ∈ S | q ∈ L(s)} function L : S 7→ P(N) that maps each shot s to the set of names of every speaking face it contains, and function returned(q) = {s ∈ S | nq ∈ L(s)}sorted by confidence E : S 7→ P(N) that maps each shot s to the set of person names for which it actually is an evidence. Proposed evidence is Correct if name nq is close enough to the query q and if shot E(nq ) actually is an evidence for q: 4. BASELINE AND METADATA ( 1 if ρq > 0.95 and q ∈ E(E(nq )) This task targeted researchers from several communities C(q) = including multimedia, computer vision, speech and natural 0 otherwise language processing. Though the task was multimodal by design and necessitated expertise in various domains, the To ensure participants do provide correct evidences for every technological barriers to entry was lowered by the provi- hypothesized name n ∈ N , standard MAP is altered into sion of a baseline system described in Figure 2 and available EwMAP (Evidence-weighted Mean Average Precision), the as open-source software1 . For instance, a researcher from official metric for the task: the speech processing community could focus its research ef- 1 X forts on improving speaker diarization and automatic speech EwMAP = C(q) · AP(q) |Q| q∈Q transcription, while still being able to rely on provided face detection and tracking results to participate to the task. Acknowledgment. This work was supported by the French The audio stream was segmented into speech turns, while National Agency for Research under grant ANR-12-CHRI- faces were detected and tracked in the visual stream. Speech 0006-01. The open source CAMOMILE collaborative anno- turns (resp. face tracks) were then compared and clus- tation platform2 was used extensively throughout the progress tered based on MFCC and the Bayesian Information Cri- of the task: from the run submission script to the automated terion [10] (resp. HOG [11] and Logistic Discriminant Met- leaderboard, including a posteriori collaborative annotation ric Learning [17] on facial landmarks [31]). The approach of the test corpus. We thank ELDA and INA for supporting proposed in [27] was also used to compute a probabilistic the task by distributing development and test datasets. 1 2 http://github.com/MediaEvalPersonDiscoveryTask http://github.com/camomile-project 6. REFERENCES O. Galibert, and L. Quintard. The REPERE Corpus : [1] F. Bechet, M. Bendris, D. Charlet, G. Damnati, a Multimodal Corpus for Person Recognition. In B. Favre, M. Rouvier, R. Auguste, B. Bigot, LREC, 2012. R. Dufour, C. Fredouille, G. Linarès, J. Martinet, [17] M. Guillaumin, T. Mensink, J. Verbeek, and G. Senay, and P. Tirilly. Multimodal Understanding C. Schmid. Face recognition from caption-based for Person Recognition in Video Broadcasts. In supervision. IJCV, 96(1), 2012. INTERSPEECH, 2014. [18] R. Houghton. Named Faces: Putting Names to Faces. [2] M. Bendris, B. Favre, D. Charlet, G. Damnati, IEEE Intelligent Systems, 14, 1999. R. Auguste, J. Martinet, and G. Senay. Unsupervised [19] V. Jousse, S. Petit-Renaud, S. Meignier, Y. Estève, Face Identification in TV Content using Audio-Visual and C. Jacquin. Automatic named identification of Sources. In CBMI, 2013. speakers using diarization and ASR systems. In [3] G. Bernard, O. Galibert, and J. Kahn. The First ICASSP, 2009. Official REPERE Evaluation. In [20] J. Kahn, O. Galibert, L. Quintard, M. Carré, SLAM-INTERSPEECH, 2013. A. Giraudel, and P.Joly. A presentation of the [4] H. Bredin, A. Laurent, A. Sarkar, V.-B. Le, S. Rosset, REPERE challenge. In CBMI, 2012. and C. Barras. Person Instance Graphs for Named [21] L. Lamel, S. Courcinous, J. Despres, J. Gauvain, Speaker Identification in TV Broadcast. In Odyssey, Y. Josse, K. Kilgour, F. Kraft, V.-B. Le, H. Ney, 2014. M. Nussbaum-Thom, I. Oparin, T. Schlippe, [5] H. Bredin and J. Poignant. Integer Linear R. Schlëter, T. Schultz, T. F. da Silva, S. Stüker, Programming for Speaker Diarization and M. Sundermeyer, B. Vieru, N. Vu, A. Waibel, and Cross-Modal Identification in TV Broadcast. In C. Woehrling. Speech Recognition for Machine INTERSPEECH, 2013. Translation in Quaero. In IWSLT, 2011. [6] H. Bredin, J. Poignant, G. Fortier, M. Tapaswi, V.-B. [22] J. Mauclair, S. Meignier, and Y. Estève. Speaker Le, A. Sarkar, C. Barras, S. Rosset, A. Roy, Q. Yang, diarization: about whom the speaker is talking? In H. Gao, A. Mignon, J. Verbeek, L. Besacier, Odyssey, 2006. G. Quénot, H. K. Ekenel, and R. Stiefelhagen. [23] J. Poignant, L. Besacier, and G. Quénot. Unsupervised QCompere at REPERE 2013. In Speaker Identification in TV Broadcast Based on SLAM-INTERSPEECH, 2013. Written Names. IEEE/ACM ASLP, 23(1), 2015. [7] H. Bredin, A. Roy, V.-B. Le, and C. Barras. Person [24] J. Poignant, L. Besacier, G. Quénot, and F. Thollard. instance graphs for mono-, cross- and multi-modal From text detection in videos to person identification. person recognition in multimedia data: application to In ICME, 2012. speaker identification in TV broadcast. In IJMIR, [25] J. Poignant, H. Bredin, L. Besacier, G. Quénot, and 2014. C. Barras. Towards a better integration of written [8] L. Canseco, L. Lamel, and J.-L. Gauvain. A names for unsupervised speakers identification in Comparative Study Using Manual and Automatic videos. In SLAM-INTERSPEECH, 2013. Transcriptions for Diarization. In ASRU, 2005. [26] J. Poignant, H. Bredin, V. Le, L. Besacier, C. Barras, [9] L. Canseco-Rodriguez, L. Lamel, and J.-L. Gauvain. and G. Quénot. Unsupervised speaker identification Speaker diarization from speech transcripts. In using overlaid texts in TV broadcast. In INTERSPEECH, 2004. INTERSPEECH, 2012. [10] S. Chen and P. Gopalakrishnan. Speaker, Environment [27] J. Poignant, G. Fortier, L. Besacier, and G. Quénot. And Channel Change Detection And Clustering Via Naming multi-modal clusters to identify persons in The Bayesian Information Criterion. In DARPA TV broadcast. MTAP, 2015. Broadcast News Trans. and Under. Workshop, 1998. [28] M. Rouvier, B. Favre, M. Bendris, D. Charlet, and [11] N. Dalal and B. Triggs. Histograms of oriented G. Damnati. Scene understanding for identifying gradients for human detection. In CVPR, 2005. persons in TV shows: beyond face authentication. In [12] M. Dinarelli and S. Rosset. Models Cascade for CBMI, 2014. Tree-Structured Named Entity Detection. In IJCNLP, [29] S. Satoh, Y. Nakamura, and T. Kanade. Name-It: 2011. Naming and Detecting Faces in News Videos. IEEE [13] Y. Estève, S. Meignier, P. Deléglise, and J. Mauclair. Multimedia, 6, 1999. Extracting true speaker identities from transcriptions. [30] S. E. Tranter. WHO REALLY SPOKE WHEN? In INTERSPEECH, 2007. FINDING SPEAKER TURNS AND IDENTITIES IN [14] B. Favre, G. Damnati, F. Béchet, M. Bendris, BROADCAST NEWS AUDIO. In ICASSP, 2006. D. Charlet, R. Auguste, S. Ayache, B. Bigot, [31] M. Uřičář, V. Franc, and V. Hlaváč. Detector of facial A. Delteil, R. Dufour, C. Fredouille, G. Linares, landmarks learned by the structured output SVM. In J. Martinet, G. Senay, and P. Tirilly. PERCOLI: a VISAPP, volume 1, 2012. person identification system for the 2013 REPERE [32] J. Yang and A. G. Hauptmann. Naming every challenge. In SLAM-INTERSPEECH, 2013. individual in news video monologues. In ACM [15] P. Gay, G. Dupuy, C. Lailler, J.-M. Odobez, Multimedia, 2004. S. Meignier, and P. Deléglise. Comparison of Two [33] J. Yang, R. Yan, and A. G. Hauptmann. Multiple Methods for Unsupervised Person Identification in TV instance learning for labeling faces in broadcasting Shows. In CBMI, 2014. news video. In ACM Multimedia, 2005. [16] A. Giraudel, M. Carré, V. Mapelli, J. Kahn,