TokyoTech at MediaEval 2016 Multimodal Person Discovery in Broadcast TV Task Fumito Nishi1 , Nakamasa Inoue1 , Koji Iwano2 , Koichi Shinoda1 1 Tokyo Institute of Technology, Tokyo, Japan 2 Tokyo City University, Kanagawa, Japan {nishi, inoue, shinoda}@ks.cs.titech.ac.jp, iwano@tcu.ac.jp ABSTRACT GMM for a target video segment. An i-vector w is extracted This paper describes our diarization system for the Multi- from it, by assuming that M is modeled as modal Person Discovery in Broadcast TV task of the Media- M = m + T w, Eval 2016 Benchmark evaluation campaign [1]. The goal of this task is naming speakers, who are appearing and speak- where m is a face and channel independent super-vector, ing simultaneously in the video, without prior knowledge. and T is a low rank matrix representing total variability. Our diarization system relies on face diarization approach. The Expectation Maximization (EM) algorithm is used to We extract deep features from a face every 0.5 seconds, make estimate the total variability as proposed in [2]. Note that visual i-vectors, cluster them, and associate results of clus- w is associated with a given video segment. i-vector ws for tering with optical character recognition. segment s is calculated by the following equation ws = (I + T t ΣN (s)T )−1 T t Σ−1 F (s), 1. INTRODUCTION The Multimodal Person Discovery in Broadcast TV task where N (s) and F (s) are the zero, and first order Baum- can be split into subtasks: speaker diarization, face diariza- Welch statistics on the Universal Background Model (UBM) tion, optical character recognition (OCR), speech transcrip- for the current segment s, and Σ is the covariance matrix of tion. We focus on diarization using face identification among the UBM. Each i-vector represents each face track respec- these subtasks. This year, we introduce i-vectors [2] using tively. deep features extracted from FaceNet, which is one of the state-of-the-art neural networks for face recognition. 2.1.3 Attaching person’s name tags Figure 1 shows the overview of our method. First, we To attach a person’s name tag for each face, the provided detect and track faces in a video. Second, deep features tags with time ranges obtained from optical character recog- are extracted from the detected faces at every 0.5 seconds. nition (OCR) are used. First, each tag from OCR is attached Third, i-vectors are made from deep features for each frame. to the face which has the maximum appearance time over- lap. Here, we have tagged and untagged faces. Second, for each untagged face, we find the nearest tagged face to attach 2. APPROACH the same tag. If the distance between the untagged face and nearest tagged face is less than the predefined threshold, the 2.1 Face diarization same tag is attached to the untagged face. Distance between two faces are calculated by 2.1.1 Deep feature w w We employ FaceNet [3] to extract deep features. A deep Dij = 1 − ||wi ||2i ||w j j ||2 feature is extracted from output layer of the network. It can measure similarity between faces. To extract deep features, where wi and wj are i-vectors for tagged and untagged faces, face detection and tracking method in [4] is employed to respectively. obtain face regions in a video. Deep features are extracted from face regions at every 0.5 second. 2.2 Speaker diarization For speaker diarization, Bayesian Information Criterion 2.1.2 Visual i-vectors (BIC) based segmentation with 12 MFCC + E is applied After deep features are extracted, we make visual i-vectors. to obtain audio segments. Music and jingle segments are I-vector is one of the state-of-the-art method for speaker ver- removed by Viterbi decoding. Finally, i-vectors are com- ification. We apply this to the tracking segments. puted for each segment and clustered with Integer Linear Let M be a Gaussian Mixture Model (GMM) super-vector, Programming [5, 6]. which concatenates normalized mean vectors of an estimated 2.3 Multimodal fusion We employed the name propagation technique proposed Copyright is held by the author/owner(s). in [7]. Our multimodal fusion takes intersection of tags ob- MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands. tained from speaker diarization and face diarization. sual streams. Modeling temporal relation between speakers is also needed to improve the performance. 4. CONCLUSION We presented a face diarization based system, which uses visual i-vectors with FaceNet. Development of multimodal fusion methods and using sequential information is our fu- ture work. 5. REFERENCES [1] Hervé Bredin, Camille Gauinaudeau, Claude Barras. Multimodal Person Discovery in Broadcast TV at MediaEval 2016. Proc. of the MediaEval 2016 workshop, Hilversum, Netherlands, Oct. 20-21, 2016. [2] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, No. 4, pp. 788–798, 2011. Figure 1: Overview of the whole system [3] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face Subset Method MAP@1 [%] MAP@10 [%] recognition and clustering. In Proceedings of the IEEE Face Diarization (Primary) 27.7 18.3 Conference on Computer Vision and Pattern Face Diarization (Contrastive) 23.5 11.9 Recognition, pp. 815–823, 2015. Speaker Diarization 13.4 11.3 Leaderboard Multimodal 22.7 15.0 [4] Navneet Dalal and Bill Triggs. Histograms of oriented Face Diarization (Primary) 31.5 20.0 gradients for human detection. In IEEE Computer Face Diarization (Contrastive) 25.7 13.6 Society Conference on Computer Vision and Pattern Speaker Diarization 13.1 11.7 Recognition, Vol. 1, pp. 886–893. IEEE, 2005. Eval Multimodal 29.3 17.3 [5] Mickael Rouvier and Sylvain Meignier. A global Table 1: Mean Average Precision (MAP) in the test optimization framework for speaker diarization. In set Odyssey, pp. 146–150, 2012. [6] Grégor Dupuy, Sylvain Meignier, Paul Deléglise, and Yannick Esteve. Recent improvements on ilp-based clustering for broadcast news speaker diarization. In 3. EXPERIMENTS AND RESULTS Proceedings of Odyssey. Citeseer, 2014. [7] Johann Poignant, Hervé Bredin, Viet-Bac Le, Laurent 3.1 Experimental Settings Besacier, Claude Barras, and Georges Quénot. We use dlib library [8] for face detection and tracking. For Unsupervised speaker identification using overlaid FaceNet, we use OpenFace implementation [9]. The dimen- texts in tv broadcast. In Interspeech 2012-Conference sion of deep features is 128. To extract visual i-vector, we of the International Speech Communication trained UBM with 32 Gaussian mixtures and total-variability Association, p. 4p, 2012. matrix on development set by using ALIZE [10]. The devel- [8] Davis E King. Dlib-ml: A machine learning toolkit. opment set is the INA corpus which is used in the MediaE- Journal of Machine Learning Research, Vol. 10, No. val 2015. We use detected faces for training. The dimension Jul, pp. 1755–1758, 2009. of visual i-vector is 100. For speaker diarization, we use [9] Brandon Amos, Bartosz Ludwiczuk, and Mahadev the LIUM Speaker Diarization system [11]. The number Satyanarayanan. Openface: A general-purpose face of Gaussian mixtures for UBM is 256. The dimension of i- recognition library with mobile applications. Technical vector is 50. We used provided OCR and fusion code to build report, CMU-CS-16-118, CMU School of Computer our system, and implemented all the other components. Science, 2016. 3.2 Experimental Results [10] Bonastre, Jean-François and Wils, Frédéric and Meignier, Sylvain. ALIZE, a free toolkit for speaker Table 1 shows MAP on the test set. Face Diarization recognition. 2005 IEEE International Conference on is used for our submissions. The threshold used for the Acoustics, Speech and Signal Processing, vol. 1, primary submission is adjusted in development set. The pp.737–740, 2005. threshold of the contrastive submission is 0. Speaker Di- [11] Mickael Rouvier, Grégor Dupuy, Paul Gay, Elie arization and Multimodal evaluated by using the evaluation Khoury, Teva Merlin, and Sylvain Meignier. An tool were not in our submission. Face Diarization is bet- open-source state-of-the-art toolbox for broadcast ter than Speaker Diarization. It was effective for identi- news diarization. Technical report, Idiap, 2013. fying speakers with short utterances. However, as we can see, Multimodal is worse than Face Diarization. To improve multimodal fusion system, we need to introduce multimodal features that can capture correlation between audio and vi-