PERCOLATTE: A Multimodal Person Discovery System in TV Broadcast for the MediaEval 2015 Evaluation Campaign Meriem Bendris1 , Delphine Charlet2 , Gregory Senay3 , MinYoung Kim3 , Benoit Favre1 , Mickael Rouvier1 , Frederic Bechet1 ,Géraldine Damnati2 1 Aix Marseille Université, 2 OrangeLabs, 3 Panasonic Silicon Valley Lab ABSTRACT This paper describes the PERCOLATTE participation to MediaEval 2015 task: “Multimodal Person Discovery in Broad- cast TV” which requires developing algorithms for unsuper- vised talking face identification in broadcast news. The pro- posed approach relies on two identity propagation strategies both based on document chaptering and restricted overlaid names propagation rules. The primary submission shows Figure 1: The PERCOLATTE pipeline. Our mod- 10% improvement of Mean Average Precision of the base- ules are outlined in blue. line on the INA corpus. 1. INTRODUCTION 2.1 List of names The Audiovisual National Institute (INA) collects and en- Identifying people in TV broadcasts has had a lot of at- riches broadcast news with metadata such as summary, iden- tention the last decade in the literature. Current trends tity of journalists, etc. We collected the metadata1 from De- aim to combine traditional techniques with high level in- cember 2004 to December 2009 and extracted automatically formation such as prior knowledge on document structure. the list of several journalists and anchors. Indeed, TV program often have regular structure organized in homogeneous sequences. The REPERE Challenge, that ended in 2014, aimed at developing multimodal algorithms 2.2 Overlaid anchor name detection for people identification in TV broadcasts. Our PERCOLA- Anchor names were not detected by the provided OCR TOR system based on scene understanding features ranked system. We developed an anchor name detector relying first on the main task in 2014 [2]. The Mediaeval “Multi- on a Levenshtein-based mapping of OCR results 2 (on ×2 modal Person Discovery in Broadcast TV” task focuses on rescaled frames) and the list of names described previously. unsupervised talking face identification [7] for search engine applications. One novelty of this task is the metadata made 2.3 Speaker clustering available by the organizers allowing expanded participations. The speaker clustering follows the approach described in [1]. This paper describes the PERCOLATTE system submit- First, speech segments are grouped using a BIC clustering. ted at the MediaEval 2015. The system relies on the en- Then, obtained clusters are modelled with GMMs in order richment of broadcast news with video structure features to more accurately compare voices using a Cross-Likelihood such as shot classification (studio/report) and speaker role Criterion (CLR) in a second agglomerative clustering. At recognition. Two identification strategies were developed: each iteration, Viterbi decoding is performed to re-segment the primary is based on chapter-restricted identity propaga- the speech data into speaker turns given the new clusters. tion to shot clusters and the secondary is based on speaker identification and rule-based speaker-face mapping. Figure 2.4 Speaker role classification 1 shows the pipeline of the PERCOLATTE system. Notice We used a simplified version of the speaker role classifi- that no face-related processing (detection/identification) is cation approach described in [3]. First, the anchor is the used in our approach. speaker cluster who speaks the most and regularly. Then, a binary classification reporter/other is performed. As no 2. TOOLS speech transcript was available, in this work, the classifica- The MediaEval 2015 organizers made available different tion relies only on an acoustic GMM classifier. baseline mono-modal tools. In our system, we used the pro- vided Overlaid Person Names (OPN) [6] system. In ad- 2.5 Speaker identification dition, we used the automatic named entities [4] and the Speaker turns are identified by propagating OPNs to the speaking-face mapping to fix the identification scores. speaker turns that maximise temporal overlapping and to it’s cluster within the same chapter. 1 Copyright is held by the author/owner(s). Available on http://www.ina.fr 2 MediaEval 2015 Workshop Sept. 14-15, 2015, Wurzen, Germany https://github.com/meriembendris/ADNVideo 2.6 Shot boundary detection 4. EVALUATION Two systems were used based on RGB histograms peaks [10] Systems were evaluated using the Mean Average Preci- and HSV histogram peaks on sliding window 2 . As the eval- sion (M AP ) metric and the official C and EwM AP metrics uation script needs the provided shot segmentation, a shot described in [7]. Two submission deadlines were fixed: July boundaries mapping was necessary. 1st and 8th. In our submissions, the only difference concerns shot boundary mapping. Indeed, on July the 1st, the map- 2.7 Shot similarity and clustering ping was based on overlapping shots over 0.5s (a rather cure In order to measure the similarity between shots, three strategy) while it was on overlapping coverage above 50% features where extracted: RGB histograms, HOG features for the July the 8th submissions. Four runs were submitted: on resized frames (128×64) and DNN-based frame represen- tation (image embeddings). For the DNN-based features, we • Primary: primary strategy with DNN- and HOG- used the Alexnet DNN [5] to extract feature vectors at the based shot clustering. 3rd fully-connected layer (1000 dimension vectors). Then, • Primary DNNOnly: primary strategy with DNN- shots were grouped using cosine-based distance and Integer based shot clustering. Linear Program clustering (described in [9]). • Primary RGBOnly: primary strategy with RGB- 2.8 Shot classification and chaptering based shot clustering. The shot classifier is trained on external data (8 broad- • Secondary: secondary strategy based on speaker iden- cast news, 4914 shots). Four labels were annotated: studio, tification and speaker-face rule-based mapping. report, mixed and other. First, HOG features on resized frames (128x64) were extracted for each shot. Then, a Lib- Table 1 shows results of the PERCOLATTE runs. The linear3 classifier was trained on three quarters of data. The secondary strategy having similar principles than the base- system reached 99.43% of accuracy on the remaining quar- line [8] shows a MAP improvement of 8%. Indeed, chapter- ter. Finally, successive shots sharing the same label were restricted propagation in addition to simple rule-based speaker- grouped into chapters. face mapping based on shot classification and speaker roles allowed to detect less talking faces with higher precision. 3. TALKING FACE IDENTIFICATION The primary strategy using DNN- and HOG-based shot clus- Participants were asked to provide identified talking faces tering obtains the best MAP of 88.45%. This shows the con- within shots with their confidence scores and evidences jus- sistency of the chapter-constrained propagation strategy in tifying their assertions. Two strategies were developed. broadcast news. Contrastive runs with different features for shot clustering did not show significant differences. Anchor 3.1 The primary strategy names were detected in 93% of shows. However, the pri- The primary strategy relies on the fact that report chap- mary run without anchor-specific modules performs 88.31% ters are independent in broadcast news. The strategy is of MAP. based on a restricted OPN propagation to cluster shots within the same chapter. Precisely, we followed those rules: Metrics EwMAP MAP C Baseline 78.35 78.64 92.71 • Propagate OPN to overlapping shots and their shot Secondary on July 1st 85.89 86.12 97.68 clusters sharing the speaker cluster within a chapter. Secondary on July 8th 86.40 86.61 97.68 • Propagate anchor name to overlapping “studio” shots Primary DNNOnly on July 1st 81.41 81.67 97.63 and their shot clusters without chapters restrictions. Primary DNNOnly on July 8th 87.75 88.01 97.63 • Propagate anchor name if the speaker role is an anchor. Primary RGBOnly on July 1st 81.02 81.28 97.63 For each identified talking face, the score was initialized by Primary RGBOnly on July 8th 87.33 87.60 97.63 the provided OPN score and incrementally increased follow- Primary on July 1st deadline 81.70 81.96 97.63 ing those events: OPN shot overlapping, provided talking- Primary on July 8th 88.19 88.45 97.63 face score > 0.8 and OPN pronounced around the shot(±5s). Table 1: Performances of PERCOLATTE 2015 runs. 3.2 The secondary strategy The secondary strategy is based on a speaker identification followed by speaker-face rule-based mapping. This mapping 5. CONCLUSIONS relies on simple rules based on prior knowledge about broad- In this paper, we described the PERCOLATTE strate- cast news. Precisely, we considered a speaker visible when gies for talking face identification. The system without face- the name appears on the screen (OPN), on studio shots and related processing is based on chapter-restricted propagation on report shots when the role is not a reporter. In this of overlaid names. A significant improvement of the base- strategy, no scores function was developed (score=1). line is achieved on the INA corpus by the primary strat- egy (+10% of MAP). Results show that in structured pro- 3.3 The evidence grams, easy-to-establish features such as shot classification To ensure that identities where detected only in unsu- and prior knowledge about broadcast news allow to improve pervised way, and to help collaborative annotations of the significantly talking faces identification. test set, participants were asked to select one shot per name Acknowledgment. This work has been carried out thanks to the proving his/her identity. For each name, we selected the support of the A*MIDEX project (no ANR-11-IDEX-0001-02) funded by the “Investissements d’Avenir” French Government program, man- provided OPN shot that maximizes the OCR result score. aged by the French National Research Agency (ANR). 3 http://www.csie.ntu.edu.tw/~cjlin/liblinear/ 6. REFERENCES [1] C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain. Multi-stage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech and Language Processing, 2006. [2] F. Bechet, M. Bendris, D. Charlet, G. Damnati, B. Favre, M. Rouvier, R. Auguste, B. Bigot, R. Dufour, C. Fredouille, G. Linares, G. Senay, P. Tirilly, and J. Martinet. Multimodal understanding for person recognition in video broadcasts. In Interspeech, Singapore, 2014. [3] G. Damnati and D. Charlet. Multi-view approach for speaker turn role labeling in tv broadcast news shows. In INTERSPEECH, pages 1285–1288. ISCA, 2011. [4] M. Dinarelli and S. Rosset. Models cascade for tree-structured named entity detection. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 1269–1278, Chiang Mai, Thailand, November 2011. Asian Federation of Natural Language Processing. [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [6] J. Poignant, L. Besacier, G. Quenot, and F. Thollard. From text detection in videos to person identification. In Multimedia and Expo (ICME), 2012 IEEE International Conference on, 2012. [7] J. Poignant, H. Bredin, and C. Barras. Multimodal person discovery in broadcast tv at mediaeval 2015. In MediaEval, 2015. [8] J. Poignant, H. Bredin, V.-B. Le, L. Besacier, C. Barras, and G. Quénot. Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast. In Interspeech 2012 - Conference of the International Speech Communication Association, Portland, OR, United States, 2012. Poster Session: Speaker Recognition III. [9] M. Rouvier and S. Meignier. A global optimization framework for speaker diarization. In Speaker Odyssey, 2012. [10] H. Zhang, R. Hu, and L. Song. A shot boundary detection method based on color feature. In Computer Science and Network Technology (ICCSNT), 2011 International Conference on, 2011.