GTM-UVigo System for Multimodal Person Discovery in Broadcast TV Task at MediaEval 2016 Paula Lopez-Otero, Laura Docio-Fernandez, Carmen Garcia-Mateo Multimedia Technologies Group (GTM), AtlantTIC Research Center, University of Vigo E.E. Telecomunicación, Campus Universitario S/N, 36310 Vigo {plopez,ldocio,carmen}@gts.uvigo.es ABSTRACT First, features were extracted from the waveform; specif- In this paper, we present the system developed by GTM- ically, 19 Mel-frequency cepstral coefficients (MFCCs) in- UVigo team for the Multimedia Person Discovery in Broad- cluding energy were extracted every 10 ms using a 25 ms cast TV task at MediaEval 2016. The proposed approach sliding window. A dynamic normalisation of the cepstral consists in a novel strategy for person discovery which is not mean was applied using a sliding window of 300 ms. These based on speaker and face diarisation as in previous works. features were extracted using the Kaldi toolkit [12]. Then, In this system, the task is approached as a person recogni- for each person name detected by the OCR: tion problem: there is an enrolment stage, where the voice • The time interval (tstart , tend ) in which the name of and face of each discovered person are detected and, for each the speaker spk appears is taken as a starting point. shot, the most suitable voice and face are assigned using the A strategy to enlarge this time interval in order to ob- i-vector paradigm. These two biometric modalities are com- tain more data to enrol the speaker is applied: given bined by decision fusion. the time intervals Sleft = (tstart − 10, tend ) and Sright = (tstart , tend +10), a change point is searched within each 1. INTRODUCTION of these intervals using the Bayesian information crite- The Person Discovery in Broadcast TV task at MediaEval rion algorithm (BIC) for speaker segmentation, having 2016 aims at finding out the names of people who can be the restriction that the change point has to be in the both seen as well as heard in every shot of a collection of intervals (tstart − 10, tstart ) and (tend , tend + 10), respec- videos [2]. This paper describes a novel approach that is not tively. If no change point was found within the interval based on speaker and face diarisation as is usually done in Sleft then tleft is set to tstart − 10 and, similarly, if no this task [6, 7, 8, 10]; instead, the task is approached as a change point was found within the interval Sright then person recognition problem. tright is set to tend + 10. Then, speaker spk is assumed to be speaking in the interval Sspk = (tleft , tright ). In case speaker spk appears several times in the OCR 2. SYSTEM DESCRIPTION output, a segment is computed for each occurrence. The proposed system can be divided in an enrolment and a search stage. For each person name detected by optical char- • Speech activity detection (SAD) was performed in or- acter recognition (OCR), the most likely interval of speech der to remove the non-speech parts. To do so, the and face presence are detected and used for enrolment. Once energy-based SAD approach implemented in the Kaldi the detected people are enrolled, speaker and face recogni- toolkit was applied. tion are performed for each shot in order to assign labels • An i-vector [5] was extracted for speaker spk using the to that shot. A decision fusion strategy is implemented in Kaldi toolkit. In case several segments were obtained order to combine the speech and video labels. The details in the first step, their features were concatenated and of the system are described below. all the segments were treated as a single one. In this step, the 19 MFCCs were augmented with their delta 2.1 Name detection and acceleration coefficients. The person names were obtained from the video using the baseline system provided by the organisers. Specifically, 2.3 Face enrolment the UPC OCR approach using LOOV was used [11]. Since When dealing with faces, the first step consisted in per- the output of the OCR module had errors such as including forming face tracking using the baseline approach based on additional words in the person name, a naı̈ve filtering of the histogram of oriented gradients [3] and the correlation tracker OCR output was performed by removing those names that proposed in [4]. Then, for each person name detected by the had more than four words. OCR: 2.2 Speech enrolment • The faces detected by the face tracker in the interval (tstart , tend ) in which the name of the speaker spk ap- pears are considered. Given that only one face was Copyright is held by the author/owner(s). detected, the whole presence interval of that face is MediaEval 2016 Workshop, Oct. 20–21, 2016, Hilversum, Netherlands. taken. In case more than one face was detected, the Table 1: Results achieved on the whole test data and on each partition. All 3-24 DW INA MAP@1 MAP@10 MAP@100 MAP@1 MAP@10 MAP@100 MAP@1 MAP@10 MAP@100 MAP@1 MAP@10 MAP@100 p 0.315 0.236 0.211 0.538 0.394 0.366 0.242 0.185 0.185 0.358 0.265 0.208 c1 0.293 0.182 0.168 0.487 0.338 0.314 0.242 0.157 0.157 0.314 0.178 0.146 c2 0.245 0.199 0.177 0.333 0.303 0.286 0.116 0.088 0.088 0.302 0.170 0.132 b 0.363 0.273 0.247 0.667 0.477 0.462 0.251 0.186 0.186 0.440 0.341 0.276 one that appeared in more frames was assigned to the Table 1 shows the results achieved with the audio+video speaker, assuming that was the dominant face in the fusion system (p), the audio system only (c1), the video sys- given time interval. tem only (c2) and the baseline provided by the organisers (b). The main conclusions that can be extracted from the • Features were extracted in the time interval obtained Table are: (1) the audio and video systems are complemen- in the previous step. To do so, first face detection was tary, since their combination leads to an improvement of the performed, and a geometric normalisation was done. individual results; (2) the audio results are better than the After that, photometric enhancement of the image us- video results, especially in the DW database; and (3) the ing the Tan&Triggs algorithm [13] was applied. Fi- worst results were obtained in the DW database, while the nally, discrete cosine transform features (DCT) [9] were best ones were achieved in the 3-24 database. The reason extracted using blocks of size 12 with 50% overlap and why 3-24 results are, in general, better, might be caused 45 DCT components. The feature extraction stage was by the small number of queries in the evaluation data corre- performed using the Bob toolkit [1]. sponding to this database (only 15 queries out of 693), which leads to results that are not significative. In the case of DW • Once the features were obtained, an i-vector repre- database, 606 queries were evaluated; this, combined with senting that face was obtained using the Kaldi toolkit. the fact that the OCR approach used in this system did not As done when dealing with speech, if there were sev- find person names in 612 out of 757 files in the database, led eral time intervals where the face of the speaker was to poor results in DW data. present, the features obtained in all the segments were The aim of this system was to assess a novel approach concatenated. for person discovery that is not based on speaker and face diarisation as in most state-of-art strategies. The achieved 2.4 Search results are promising, and the experiments performed in this The procedure to decide which speaker was present in each evaluation allowed the detection of the main weak points of shot consisted in, for each shot: the system that will be improved in the future: • In order to detect whether the shot includes speech, • The quality of the OCR output had a huge impact speech detection was performed: perceptual linear pre- on the results, since this is the starting point of the diction coefficients plus pitch features were extracted whole enrolment stage, which leads to a degradation of from the time interval defined by the shot, an i-vector performance on the whole system. A simple approach, was extracted and a logistic regression approach was based on natural language processing, for filtering the used to classify the segment as speech or non-speech. OCR output in order to remove everything that were Non-speech segments were straightforwardly discarded. not person names was assessed in this framework with In case speech was present in the shot, SAD was per- no success, but further experiments on this topic will formed, an i-vector was extracted, and this shot i- be done in the future. vector was compared with the enrolment i-vectors com- puting the dot scoring. The speaker that achieved the • All face-based steps relied on the baseline approach highest score was assigned to the shot. for face tracking, and its output was fed to the fea- ture extraction module; however, only the information • The faces detected by the face tracker within the shot about presence was used, but not the bounding boxes were identified, and the one that appeared in more where the faces appeared. This probably led to incon- frames was chosen as the most representative face of sistencies in the feature extraction stage and, there- the shot. An i-vector was extracted and the same deci- fore, on the face enrolment procedure. This issue will sion procedure described for the speech data was per- be addressed in order to improve the quality of the formed. face-based approach. • Once a decision was made for both speech and face data, the following fusion approach was implemented: Acknowledgements. given a shot, it is assigned to a speaker if the person detected by the face and speech detectors had the same This research was funded by the Spanish Government un- name and if the sum of their scores was greater than der the project TEC2015-65345-P, the Galician Government a threshold. through the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014) and ‘AtlantTIC Project’ CN2012/160, and by the European Regional De- 3. RESULTS AND DISCUSSION velopment Fund (ERDF). 4. REFERENCES [1] A. Anjos, L. E. Shafey, R. Wallace, M. Günther, C. McCool, and S. Marcel. Bob: a free signal processing and machine learning toolbox for researchers. In 20th ACM Conference on Multimedia Systems (ACMMM), 2012. [2] H. Bredin, C. Barras, and C. Guinaudeau. Multimodal Person Discovery in Broadcast TV at MediaEval 2016. In Proceedings of the MediaEval 2016 Workshop, 2016. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893, 2005. [4] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference (BMVC), 2014. [5] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front end factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 2010. [6] M. India, D. Varas, V. Vilaplana, J. Morros, and J. Hernando. UPC system for the 2015 MediaEval multimodal person discovery in broadcast TV task. In Proceedings of the MediaEval 2015 Workshop, 2015. [7] N. Le, D. Wu, S. Meignier, and J.-M. Odobez. EUMSSI team at the MediaEval person discovery challenge. In Proceedings of the MediaEval 2015 Workshop, 2015. [8] P. Lopez-Otero, R. Barros, L. Docio-Fernandez, E. Gonzalez-Agulla, J. Alba-Castro, and C. Garcia-Mateo. GTM-UVigo systems for person discovery task at MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, 2015. [9] C. McCool and S. Marcel. Parts-based face verification using local frequency bands. In Proceedings of IEEE/IAPR international conference on biometrics, 2009. [10] F. Nishi, N. Inoue, and K. Shinoda. Combining audio features and visual i-vector @ MediaEval 2015 multimodal person discovery in broadcast TV. In Proceedings of the MediaEval 2015 Workshop, 2015. [11] J. Poignant, L. Besacier, G. Quénot, and F. Thollard. From text detection in videos to person identification. In Proceedings of IEEE International Conference on Multimedia and Expo (ICME), 2012. [12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE Signal Processing Society, 2011. [13] X. Tan and B. Triggs. Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Transactions on Image Processing, 19(6):1635–1650, 2010.