EUMSSI team at the MediaEval Person Discovery Challenge Nam Le1,2 , Di Wu1 , Sylvain Meignier3 , Jean-Marc Odobez1,2 1 Idiap Research Institute, Martigny, Switzerland 2 École Polytechnique Fédéral de Lausanne, Switzerland 3 LIUM, University of Maine, Le Mans, France {nle, dwu, odobez}@idiap.ch, sylvain.meignier@univ-lemans.fr ABSTRACT We present the results of the EUMSSI team’s participation in the Multimodal Person Discovery task at the MediaEval challenge 2015. The goal is to identify all people who simul- taneously appear and speak in a video corpus, which implic- itly involves both audio stream and visual stream. We em- phasize on improving each modality separately and bench- Figure 1: Architecture of proposed system marking them to analyze their pros and cons. and test their performance to understand their advantages. 1. INTRODUCTION The used system, as illustrated in Fig. 1, consists of 2 main Nowadays, viewers, journalists, or archivists have access stages. The first stage detects and clusters speakers, faces to a vast amount multimedia data. The need for browsing and overlaid person names, including extracting Named En- and retrieval tools of these archives has led researchers to tities (NE). The second one associates speakers to faces using devote effort to developing technologies that create search- co- occurrence statistics and the overlaid person names are able indices [14]. In this view, as humans are very interested propagated to the speakers, or faces, in order to give the in other people while consuming multimedia contents, algo- identities of the persons in the show. rithms indexing identities of people and retrieving their re- 2.1 Speaker diarization spective quotations are indispensable for searching archives. The speaker diarization system (“who speak when?”) is This practical need leads to research problems on how to based on the LIUM Speaker Diarization system[16], freely identify people presence in videos and answer ’who appears distributed1 . This system has achieved the best or second when?’ or ’who speaks when?’. best results in the speaker diarization task on REPERE In particular, in the MediaEval Person Discovery task, French broadcast evaluation campaigns 2012 and 2013 [6]. the goal is the following. Given the raw TV broadcasts, The diarization system is first composed of an acoustic each shot must be automatically tagged with the name(s) of Bayesian Information Criterion (BIC)-based segmentation people who can be both seen as well as heard in the shot. followed by a BIC-based hierarchical clustering. Each clus- The list of people is not known a priori and their names ter represents a speaker and is modeled with a full covari- must be discovered in an unsupervised way from video text ance Gaussian. A Viterbi decoding re-segments the signal overlay or speech transcripts. This situation corresponds to using GMMs with 8 diagonal components learned by EM- cases where at the moment a content is created or broadcast, ML, for each cluster. Segmentation, clustering and decoding some of the appearing people are relatively unknown but are performed with 12 MFCC+E, computed with a 10ms may later on become a trending topic on social networks or frame rate. Music and jingle regions are removed using a search engines. In addition, to ensure high quality indexes, Viterbi decoding with 8 GMMs (trained on french broad- algorithms should also help human annotators double-check cast news data) for music, jingle, silence, and speech (with these indexes by providing an evidence of the claimed iden- wide/narrow band variants for the last two, and clean or tity (especially for people who are not yet famous). noised or musical background variants for wideband speech). 2. PROPOSED SYSTEM In the above steps, features were used unnormalized in order to preserve information on the background environ- The participation of the EUMSSI team was to enable the ment, which may help differentiating between speakers. At assessment of the different modules developed by the authors this point however, each cluster contains the voice of only in the past [11, 7, 8, 17, 4]. In this view, starting from the one speaker, but several clusters can be related to a same baseline provided by the organizer, the goal was to replace speaker. The background environment contribution must baseline components by the team’s components, whenever be removed from each GMM cluster, through feature gaus- they have been made compatible and their processing speed sianization. Finally, the system is completed with clustering was enough to address the data provided in the challenge, method based on the i-vectors paradigm and Integer Linear Programming (ILP). This new clustering method is fully described in [17] and [4]. The ILP clustering along with i- Copyright is held by the author/owner(s). 1 MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany www-lium.univ-lemans.fr/en/content/liumspkdiarization vectors speaker models gives better results than the usual Method EwMAP MAP C #(2485) hierarchical agglomerative clustering based on GMMs and Baseline 49.98 50.32 58.75 617 cross-likelihood distances [1]. SpkDia 65.31 66.70 72.50 2817 FaceDia 66.38 67.98 71.67 1691 2.2 Face diarization Given the video shots, face diarization process consists Table 1: Results on REPERE test 2 (dev set) of (i) face detection, detecting faces appearing within each Method EwMAP MAP C #(21963) shot, (ii) face tracking, extending detections into continuous Baseline 78.35 78.64 92.71 12066 tracks within each shot, and (iii) face clustering, grouping FaceDia 83.04 83.33 90.77 7237 all tracks with the same identity into clusters. SpkDia∗ 89.75 90.14 97.05 30583 Face detection. Detecting faces in broadcasting media can SpkFace 89.53 89.90 96.52 20601 be challenging due to the wide range of media content. Faces ∗ Primary submission can appear in widely different situations with varied illumi- nation and noise such as in studio, during live coverage, or Table 2: Results on INA (test set) during political debate. To overcome these challenges, we employ deformable part-based model (DPM) [5, 12], which entities to maximize the co-occurrences between them. can detect faces at multiple poses and variation. Because, the main disadvantage of DPM is its long running time, face 3. EXPERIMENTS detector is only applied 2 times per second. We evaluated 3 methods: SpkDia, FaceDia, and SpkFace. Face tracking. The goal of this step is to create continuous In SpkDia (primary submission), we apply naming based face tracks in one video shot, which raises the need for asso- on audio information only (this is equivalent to assumption ciation individual detections. Because of long gaps between that all speakers which are associated with a name are vis- detected faces, we exploit long term connectivity using CRF- ible and speaking). This is our primary submission for the based multi-target tracking [10]. This framework relies on challenge. Second, in FaceDia, we apply naming based on the unsupervised learning of time sensitive association costs visual information only, and assume that all visible faces for different features. First, similarities between detections (which are associated with a name) are talking. Third, in are computed based on low level features (color histogram, SpkFace, we apply naming based on audio information only, position, motion, SURF keypoint descriptors) which can be but validate if there exists visible faces during the speech computed fast. Then, for each feature type, the correspond- segments (if not, the segment is discarded). Because our ing pairwise factor of the CRF is defined as the probability of approaches are monomodal and fully unsupervised, we did similarity measurements between pairs of detections under not use the information provided by leaderboard to improve two distinct hypotheses that they correspond to the same performance. label or not. By optimizing a graph labeling posterior, we The results using the challenge performance measures are assign the same label to detections belonging to the same reported in Tab. 1 for the REPERE test 2 data [9] as the face, and different labels to different faces. initial development data and in Tab. 2 for the challenge test- ing part of the INA dataset. SpkDia is the most robust and Face clustering. Given the face tracks across all video performs the best even without any face information, which shots, we hierarchically merge face tracks tracks using match- might be explained by two points. First, there is usually ing and biometric similarity measures [11]. Matching cluster only one speaker at a time, and not much noise in the chal- similarity is calculated based on average of distances be- lenge data. Meanwhile, face diarization can be difficult due tween sparse keypoints of two clusters. Meanwhile, biomet- to multiple faces, facial variation, missed detections, etc. ric model-based similarity measures how densely extracted Hence, speech clusters tend to be more reliable than face features from one cluster are likely to belong to the model of clusters. Second, when a speaker is not visible, it is often the other cluster, as compared to the likelihood given by the the anchor of the show, who is counted as one query equally statistical model, and vice-versa. Face tracks are first clus- to those appearing for short duration. Therefore, SpkDia tered using only feature-based matching, yielding clusters is not penalized much by the visibility of speakers. We can with sufficient data to adapt the biometric models. Then, observe this effect more in the last column of Tab. 2 which model-based similarity is combined with matching similarity shows the number of person presence with names predicted to merge clusters until stopping criteria are met. Similarly to by each scheme. Using faces to filter 1/3 of speech segments speaker diarization, face diarization produces face segments does not help to increase precision because these segments during which distinct identities appear. correspond to a small number of repetitive speakers. Also, 2.3 Person Naming though face diarization gives only 1/3 of possible names, Identity candidate retrieval. OPNs can be more reli- these names are precise person-wise. This interesting fact ably extracted using Optical Character Recognition (OCR) may provide outlook on combining 2 modalities. techniques [2, 13] than from automatic speech transcripts. Therefore, we only exploit name entities detected from OCR 4. FUTURE WORKS by [3] as potential identity candidates. We have presented our system in MediaEval challenge. Direct one-to-one tagging. As mentioned earlier, our The testing result serves as our basis for improving each goal is to benchmark improvements of each modality in the component. We are working on speeding up the tracking system. Hence, there is one assumption that the temporal process as well as investigating alternative face representa- clusters of the diarization processes are trustable. In this tions such as total variability modeling. On another hand, work, we use a simple one-to-one naming method provided current system has not taken full advantage of both audio by [15] which finds the mapping between clusters and named and visual streams, which we plan to proceed in the future. 5. REFERENCES identification using overlaid texts in tv broadcast. In [1] C. Barras, X. Zhu, S. Meignier, and J. Gauvain. Interspeech, page 4p, 2012. Multi-stage speaker diarization of broadcast news. [16] M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin, 14(5):1505–1512, Feb. 2006. and S. Meignier. An open-source state-of-the-art [2] D. Chen and J.-M. Odobez. Video text recognition toolbox for broadcast news diarization. In Interspeech, using sequential monte carlo and error voting methods. Lyon (France), 25-29 Aug. 2013. Pattern Recognition Letters, 26(9):1386–1403, 2005. [17] M. Rouvier and S. Meignier. A global optimization [3] M. Dinarelli and S. Rosset. Models cascade for framework for speaker diarization. In Odyssey tree-structured named entity detection. In IJCNLP, Workshop, Singapore, 2012. pages 1269–1278, 2011. [4] G. Dupuy, S. Meignier, P. Deléglise, and Y. Estève. Recent improvements towards ILP-based clustering for broadcast news speaker diarization. 2014. [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. [6] O. Galibert and J. Kahn. The first official REPERE evaluation. In Interspeech satellite workshop on Speech, Language and Audio in Multimedia (SLAM), Marseille, France, 2013. [7] P. Gay, E. Khoury, S. Meignier, J.-M. Odobez, and P. Deleglise. A Conditional Random Field approach for Audio-Visual people diarization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), 2014. [8] P. Gay, E. Khoury, S. Meignier, J.-M. Odobez, and P. Deleglise. Face identification from overlaid texts using Local Face Recurrent Patterns and CRF models. In IEEE International Conference on Image Processing (ICIP), 2014. [9] A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, and L. Quintard. The repere corpus : a multimodal corpus for person recognition. In N. C. C. Chair), K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may 2012. European Language Resources Association (ELRA). [10] A. Heili, A. Lopez-Mendez, and J.-M. Odobez. Exploiting long-term connectivity and visual motion in crf-based multi-person tracking. IEEE Transactions on Image Processing, 23(7):3040–3056, 2014. [11] E. Khoury, P. Gay, and J.-M. Odobez. Fusing matching and biometric similarity measures for face diarization in video. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, pages 97–104. ACM, 2013. [12] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face detection without bells and whistles. In ECCV, pages 720–735. Springer, 2014. [13] J. Poignant, L. Besacier, G. Quénot, and F. Thollard. From text detection in videos to person identification. In 2012 IEEE International Conference on Multimedia and Expo (ICME), pages 854–859. IEEE, 2012. [14] J. Poignant, H. Bredin, and C. Barras. Multimodal person discovery in broadcast tv at mediaeval 2015. 2015. [15] J. Poignant, H. Bredin, V.-B. Le, L. Besacier, C. Barras, and G. Quénot. Unsupervised speaker