GTM-UVigo Systems for Person Discovery Task at MediaEval 2015 Paula Lopez-Otero, Rosalía Barros, Laura Docio-Fernandez, Elisardo González-Agulla, José Luis Alba-Castro, Carmen Garcia-Mateo AtlantTIC Research Center {plopez,rbarros,ldocio,eli,jalba,carmen}@gts.uvigo.es ABSTRACT After performing speech activity detection, the speech seg- In this paper, we present the systems developed by GTM- ments are further divided into speaker turns following the UVigo team for the Multimedia Person Discovery in Broad- approach described in [7]. First, Mel-frequency cepstral co- cast TV task at MediaEval 2015. The systems propose two efficients (MFCCs) plus energy are extracted from the wave- different strategies for person discovery in audio through form. After this, the Bayesian Information Criterion (BIC) speaker diarization (one based on an online clustering strat- based segmentation approach described in [2] is employed, egy with error correction using OCR information and the performing a coarse segmentation to find candidates followed other based on agglomerative hierarchical clustering) as well by a refinement step. A false alarm rejection strategy is ap- as intrashot and intershot strategies for face clustering. plied in the latter step so as to reject change-points that are suspicious of being false alarms [6]. 1. INTRODUCTION 2.3 Speaker Clustering The Person Discovery in Broadcast TV task at MediaE- Two different approaches for speaker diarization were as- val 2015 aims at finding out the names of people who can sessed, one working in online mode, used in the primary sys- be both seen as well as heard in every shot of a collection tem, and another working in offline mode. A feature they of videos [10]. This paper describes the audio, video and have in common is the use of the iVector paradigm [3] for multimodal approaches developed by GTM-UVigo team to speaker turn representation. address this task1 . 2.3.1 Online approach This clustering strategy consists in comparing the iVectors 2. AUDIO-BASED PERSON DISCOVERY of the speaker models with the iVector of a given speaker The audio approaches can be divided in three stages: speech turn by computing its dot product and, if the maximum dot activity detection, division of speech regions in speaker turns product exceeds a predefined threshold, the speaker turn is and, lastly, speaker clustering. assigned to the speaker model; else, it is considered as a new speaker. Every time a new segment is assigned to a 2.1 Speech Activity Detection speaker, its model is refined by computing the mean of all A Deep Neural Network (DNN) based speech activity de- the iVectors assigned to that speaker model. tector (SAD) was used. The acoustic features used were 26 A novel feature introduced in this online clustering scheme log-mel-filterbank outputs, and a window of 31 frames was is the use of written names obtained from OCR [9] for auto- used to predict the label of the central frame. The DNN matic error correction. To that end, the speaker assignment has the following architecture: 806 unit input layer, 4 hid- using these written names is considered as more reliable than den layers, each containing 32 tanh activation units, and an the clustering assignment, so anytime the clustering and the output layer consisting of two softmax units. The output written name approach make a different decision, the writ- layer generates a posterior probability for the presence or ten name will prevail over the clustering decision. non-presence of speech, and the ratio of both output poste- riors is used as a confidence measure about speech activity 2.3.2 Offline approach over time. This confidence is median filtered to produce a The proposed offline clustering strategy relies on an ag- smoothed estimate of speech presence and, finally, a frame glomerative hierarchical clustering scheme. First, a similar- is classified as speech if this smoothed value is greater than ity matrix was obtained by computing the dot product be- a threshold. tween all the pairwise combinations of the iVectors of each speaker turn, and this matrix was used to obtain a den- 2.2 Speaker Segmentation drogram.The C-score stopping criterion described in [8] was 1 used to select the number of clusters. The code of GTM-UVigo systems will be released at https://github.com/gtm-uvigo/Mediaeval_PersonDiscovery 3. VIDEO-BASED PERSON DISCOVERY The video-based strategies encompass three different steps: Copyright is held by the author/owner(s). face detection and tracking, visual speech activity detection MediaEval 2015 Workshop, Sept. 14–15, 2015, Wurzen, Germany. and face clustering. 3.1 Face detection and Tracking Table 1. The results achieved using the baseline metadata Face detection is based on histogram of oriented gradient (b) are also shown for comparison. features (HOG) and a linear SVM classifier implemented in dlib library [5]. For each detected person, a face tracking and Table 1: Summary of the submitted systems landmark detection method based on CLNF models are used System Spk. clustering Face clustering [1]; every time a person stops being visible on screen, a model Primary (p) online intrashot that has information about presence, speech intervals and Contrastive1 (c1) online intershot the highest quality face templates is stored in a database. Contrastive2 (c2) offline intrashot To reduce the false alarm rate, face tracks that have a short Contrastive3 (c3) offline intershot time duration and a low quality score are rejected; this score is calculated with a weighted sum of face symmetry and Table 2 shows that the two speaker diarization strategies sharpness values. are almost equally suitable for this task as they achieve very similar results; however, the online strategy shows a better 3.2 Visual Speech Activity Detection performance, probably due to the use of the OCR informa- The proposed visual speech activity detection method is tion for error correction. With respect to the face cluster- based on the relative mouth movements which are generally ing strategies, the intrashot method obtained better results, small in silence sections, whereas variations of lip shape are probably because the intershot combination led to an exces- usually stronger during speech [12]. Using face landmarks sive combination of faces, making the system miss speakers obtained from the previous step, mouth openness and lips by erroneously combining them with others. height variance over time are computed. A variable thresh- old based on face size is applied in order to make the decision Table 2: Results on development and test datasets at each frame and a low-pass filter is used to smooth results. corresponding to July 1st deadline. 3.3 Face clustering REPERE INA The face clustering strategies consist in a face recognition EwMAP MAP C EwMAP MAP C system so that every time a face track is going to be inserted p 75.76% 77.10% 78.03% 80.34% 80.61% 92.42% in the database, a score is computed in order to add it as a c1 74.90% 75.80% 77.58% 75.42% 75.69% 85.99% new person or to merge it with an existing one. First, Gabor c2 75.76% 77.10% 77.58% 80.21% 80.49% 92.32% features are extracted from the highest-quality templates of c3 75.54% 76.43% 77.58% 75.26% 75.54% 85.89% a person and matching scores are obtained using the hy- per cosine distance [4]. Second, the final score to compare b 63.58% 63.93% 71.75% 78.35% 78.64% 92.71% with the merging threshold is computed as the maximum of all the matching scores obtained from the two sets of face The development of the audio-based person discovery ap- images. In the intrashot strategy, only models that appear proaches showed us that a lower speaker diarization error within the same shot are compared, aiming at correcting rate do not lead to a higher EwMAP, as overclustering re- presence intervals when the tracking method fails. The in- sults in incorrect person detections. Also, we have to in- tershot strategy allows to merge all the person appearances crease our efforts in TV programmes featuring challenging in a video. acoustic conditions, which are the ones who had a more de- graded performance. Lastly, we realised that adding writ- ten names obtained from OCR information to the speaker 4. MULTIMODAL PERSON DISCOVERY diarization algorithm led to an improvement of the perfor- Multimodal person discovery was performed using four mance, so this type of fusion will be studied in more depth. different sources of information: speaker diarization (SD) The proposed video-based person discovery approaches using the techniques described in Section 2; face detection showed us that the intrashot strategy performed better than (FD) and video-based speech activity detection (VVAD) as the intershot strategy, probably because of the overcluster- described in Section 3; and written names (WN) extracted ing issue mentioned above. The most challenging aspects, using the strategy described in [9]. First, the set of evi- that will have to be addressed in the future, were the varia- dences is defined as proposed in the baseline fusion strategy tions in pose, scale and illumination, as they made it difficult provided by the organizers. Given a shot, a person is consid- to develop a robust face matching strategy. ered to appear in it if the same name is present in SD, FD GTM-UVigo team got into this task by developing au- and VVAD within the time interval that defines the shot. A dio and face modules and combining them through a simple late naming strategy was used to assign names to the differ- decision-level fusion but, in future work, audiovisual fusion ent sources of information [11]. For each hypothesized name, in earlier stages of the system will be researched in order to a confidence is computed as proposed in the baseline strat- exploit all the potential of multimodal person discovery. egy, but those hypotheses with confidence lower than 1 are discarded, as they correspond to situations of non-overlap 6. ACKNOWLEDGEMENTS between the evidence and the hypothesized name. This research was funded by the Spanish Government (’SpeechTech4All Project’ TEC2012-38939-C03-01), the Gali- 5. RESULTS AND DISCUSSION cian Government through the research contract GRC2014/024 Table 2 shows the results achieved by the submitted sys- (Modalidade: Grupos de Referencia Competitiva 2014) and tems both in REPERE (partition ’test2’) and INA datasets; ’AtlantTIC Project’ CN2012/160, and also by the Spanish these systems are combinations of the two proposed speaker Government and the European Regional Development Fund diarization and face clustering strategies as summarized in (ERDF) under project TACTICA. 7. REFERENCES [7] P. Lopez-Otero, L. Docio-Fernandez, and [1] T. Baltrusaitis, P. Robinson, and L. Morency. C. Garcia-Mateo. GTM-UVigo system for Albayzin Constrained local neural fields for robust facial 2014 audio segmentation evaluation. In Iberspeech landmark detection in the wild. In IEEE International 2014: VIII Jornadas en Tecnologı́a del Habla and IV Conference on Computer Vision Workshops SLTech Workshop, 2014. (ICCVW), pages 354–361, 2013. [8] P. Lopez-Otero, L. Docio-Fernandez, and [2] M. Cettolo and M. Vescovi. Efficient audio C. Garcia-Mateo. A novel method for selecting the segmentation algorithms based on the BIC. In number of clusters in a speaker diarization system. In Proceedings of ICASSP, volume VI, pages 537–540, Proceedings of EUSIPCO, pages 656–660, 2014. 2003. [9] J. Poignant, L. Besacier, G. Quénot, and F. Thollard. [3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and From text detection in videos to person identification. P. Ouellet. Front end factor analysis for speaker In Proceedings of IEEE International Conference on verification. IEEE Transactions on Audio, Speech and Multimedia and Expo (ICME), 2012. Language Processing, 2010. [10] J. Poignant, H. Bredin, and C. Barras. Multimodal [4] E. González-Agulla, E. Argones-Rua, J. Alba-Castro, Person Discovery in Broadcast TV at MediaEval 2015. D. González-Jiménez, and L. Anido-Rifón. In Proceedings of the MediaEval 2015 Workshop, 2015. Multimodal biometrics-based student attendance [11] J. Poignant, H. Bredin, V. Le, L. Besacier, C. Barras, measurement in learning management systems. In and G. Quénot. Unsupervised speaker identification IEEE International Symposium on Multimedia (ISM), using overlaid texts in TV broadcast. In Proceedings of pages 699–704, 2009. Interspeech, 2012. [5] D. King. Dlib-ml: A machine learning toolkit. The [12] B. Rivet, L. Girin, and C. Jutten. Visual voice activity Journal of Machine Learning Research, 10:1755–1758, detection as a help for speech source separation from 2009. convolutive mixtures. Speech Communication, [6] P. Lopez-Otero. Improved Strategies for Speaker 49(7):667–677, 2007. Segmentation and Emotional State Detection. PhD thesis, Universidade de Vigo, 2015.