UPC System for the 2016 MediaEval Multimodal Person Discovery in Broadcast TV task ∗ Miquel India, Gerad Martí, Carla Cortillas, Giorgos Bouritsas Elisa Sayrol, Josep Ramon Morros, Javier Hernando Universistat Politècnica de Catalunya ABSTRACT SS not overlapping with an overlaid name) to one of the pre- The UPC system works by extracting monomodal signal seg- vious classes. For each unlabeled interval, the signal is com- ments (face tracks, speech segments) that overlap with the pared against all models and the one with better likelihood person names overlaid in the video signal. These segments is selected. An additional ’Unknown’ class is implicitly con- are assigned directly with the name of the person and used sidered, corresponding to the cases where the face track or as a reference to compare against the non-overlapping (unas- speech segment correspond to a person that is never named signed) signal segments. This process is performed indepen- (i.e. none of the appearances of this person in the video do dently both on the speech and video signals. A simple fusion overlap with a detected name). scheme is used to combine both monomodal annotations into At the end of this process we have two different sets of a single one. annotations, one for speech and one for video. The two results are fused, as described in section 5 to obtain the final annotation. 1. INTRODUCTION This paper describes the UPC system for the 2016 Mul- 2. TEXT DETECTION timodal Person Discovery in Broadcast TV task [2] in the We have used the two baseline detections with some addi- 2016 MediaEval evaluations. The system detects face tracks tional post-processing. The first one (TB1) was generated by (FT), detects speech segments (SS) and also detects the per- our team and distributed to all participants. The LOOV [6] son names overlaid in the video signal. Both the video and text detection tool was used to detect and track (define the the speech signals are processed independently. For each temporal intervals where a given text appears) text. Detec- modality, we aim to construct a classifier that can determine tions were filtered by comparing against list of first names if a given FT or SS belongs or not to one of the persons ap- and last names downloaded from the internet. We also used pearing on the scene with an assigned overlaid name. As lists of neutral particles (’van’, ’von’, ’del’, etc.) and nega- the system is unsupervised, we will use the detected person tive names (’boulevard’, etc.). All names were normalized to names to identify the persons appearing on the program. contain only alphabetic ASCII characters, without accents Thus, we assume that the FT of SS that overlap with a de- nor special characters and in lower case. For a given de- tected person name are true representations of this person. tected text to be considered as name it had to contain at The signal intervals that overlap with an overlaid person least one first name and one last name. The percentage of name are extracted and used for unsupervised enrollment, positive matches for these two classes was used as a score. defining a model for each detected name. This way, a set of Matches from the neutral class did not penalize the percent- classes corresponding to the different persons in the detected age. Additionally, if the first word in the detected text was names is defined. These intervals are directly labeled by included in the negative list, the text was discarded. To assigning the identity corresponding to the overlaid name. construct TB1 we had access to the test videos before than For each modality, a joint identification verification algo- the rest of participants. However, we only used this data for rithm is used to assign each unlabeled signal interval (FT or this purpose and we did not perform any test of the rest of ∗ This work has been developed in the framework of the our system before the official release. projects TEC2013-43935-R, TEC2012-38939-C03-02 and The second set of annotations, TB2 was provided by the PCIN- 2013-067. It has been financed by the Spanish Min- organizers [2]. These annotations had a large quantity of isterio de Economı́a y Competitividad and the European false positives. We applied the above described filtering to Regional Development Fund (ERDF). TB2 and we combined the result with TB1, as the detectors resulted to be partly complementary. 3. VIDEO SYSTEM For face tracking, the 2015 baseline code [7] was used. Filtering was applied to remove tracks shorter than a fixed Copyright is held by the author/owner(s). time or with too small face size. MediaEval 2016 workshop Octobber 19–21, 2016, Hilversum, Netherlands The VGG-face [8] Convolutional Neural Network (CNN) was used for feature extraction. We extracted the features from the activation of the last fully connected layer. The System MAP1(%) MAP5(%) MAP10(%) Baseline 1 13.1 12 11.7 Spk Tracking 43.3 30.6 28.8 Baseline 2 37 30.3 29.2 Face Tracking 61.3 47.9 45.5 Baseline 3 36.3 29.3 27.3 Intersection 47.9 34 32 Figure 1: Diagram of the verification system Union 63.0 50.5 48.4 Table 1: MAP Evaluation network was trained using a triplet network arquitecture [5]. The features from the detected faces in each track are ex- tracted using this network, and then averaged to obtain a cation method. For the feature extraction, 20 Mel Frequency feature vector for each track, of size 1024. Cepstral Coefficients (MFCC) plus ∆ and ∆∆ coefficients A face verification algorithm was used to compare and were extracted. Using the Alize toolkit[4, 1], a total vari- classify the tracks. First, the tracks that were overlapped ability matrix has been trained per show. I-vectors have with a detected name were named by assigning that identity. been extracted from 3 seconds segments with a 0.5 second To reduce wrong assignations, the name was only assigned shift and the baseline speaker diarization was used to select if it overlapped with a single track. Then, using the set of speaker turn segments to extract the i-vector queries. The named tracks from the full video corpus, a Gaussian Naive identification was performed evaluating the cosine distance Bayes (GNB) binary classifier model was trained, using the of the i-vectors with each query i-vector. The query with the euclidean distance between pairs of samples from the named lowest distance was assigned to the segment. A global dis- tracks. Then, for each specific video, each unnamed track tance threshold was previously trained with the development was compared with all the named tracks of the video, com- database so as to discard assignations with high distances. puting the euclidean distance between the respective feature vectors of the tracks (see Figure 1). This value was classified using the GNB to either being a intra-class distance (both 5. FUSION SYSTEM AND RESULTS tracks belong to the same identity) or an inter-class distance Starting off with the speaker and face tracking shot la- (the tracks are not from the same person). The probability beling, two fusion methods were implemented. The first of the distance being intra-class was used as the confidence method was the intersection of the shots of both tracking score. The unnamed track was assigned the identity of the systems, averaging the confidence of the intersected shots. most similar named track. A threshold on the confidence In the second method, the union strategy was implemented score (0.75) was used to discard tracks not corresponding to relying on the intersected shots of both modalities and re- any named track. ducing the confidence of those not intersected. The shots of both video and speaker systems were merged, averaging the 4. SPEAKER TRACKING confidence score if both systems detect the same identity in a shot, or reducing the confidence by a 0.5 factor if only one Speaker information was extracted using an i-vector based of the systems detected a query. speaker tracking system. Assuming that text names are Four different experiments were performed which are shown temporarily overlapped with their speaker and face iden- in Table 1. Baseline 1 refers to the fusion between the base- tities, speaker models were created using the data of those line speaker diarization and OCR, Baseline 2 refers to the text tracks. Speaker tracking was performed evaluating the fusion between the face detection and the OCR and Baseline cosine distance between model i-vectors and i-vectors ex- 3 is the intersection of the both previous baselines. Initially, tracted for each frame of the signal. speaker and face tracking have been evaluated separately. Speaker modelling was implemented using i-vectors [3]. The intersection and the union of both tracking systems were An i-vector is a low rank vector, typically between 400 and implemented as fusion strategies. 600, representing a speech utterance. Feature vectors of As is shown in Table 1, both monomodal systems improve the speech signal are modeled by a set of Gaussian Mix- the baseline performances by a great margin. The union tures (GMM) adapted from a Universal Background Model strategy has shown a better performance than the intersec- (UBM). The mean vectors of the adapted GMM are stacked tion strategy but this fusion does not show a significative to build the M supervector, wich can be written as: performance increase over the individual modalities. M = mu + T ω (1) By analysing the results, we believe that failures at text detection was the main factor impacting the final scores. where mu is the speaker- and session-independent mean su- pervector from UBM, T is the total variability matrix, and ω is a hidden variable. The mean of the posterior distri- 6. CONCLUSIONS bution of ω is referred to as i-vector. This posterior distri- Speaker and face tracking have been combined using dif- bution is conditioned on the Baum-Welch statistics of the ferent fusion strategies. This year, our idea was to focus given speech utterance. The T matrix is trained using the only on the overlaid names to develop tracking systems in- Expectation-Maximization (EM) algorithm given the cen- stead of performing diarization systems merged with text tralized Baum-Welch statistics from background speech ut- dectection. Tracking systems have shown a better perfor- terances. More details can be found in [3]. mance than the diarization based ones of the baseline. For The speaker tracking system has been implemented as a fusion, the union strategy has shown higher results than the speaker identification system with a segmentation by classifi- intersection method. 7. REFERENCES [1] J.-F. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille, A. Larcher, A. Preti, G. Pouchoulin, N. Evans, B. Fauve, and J. Mason. ALIZE/SpkDet: a state-of-the-art open source software for speaker recognition. In Proc. Odyssey: the Speaker and Language Recognition Workshop, 2008. [2] H. Bredin, C. Barras, and C. Guinaudeau. Multimodal person discovery in broadcast tv at mediaeval 2016. In Working Notes Proceedings of the MediaEval 2016 Workshop, 2016. [3] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, May 2011. [4] A. Larcher, J.-F. Bonastre, B. Fauve, K. A. Lee, H. L. Christophe Lévy, J. S.D, Mason, and J.-Y. Parfait. ALIZE 3.0 - Open Source Toolkit for State-of-the-Art Speaker Recognition. In Annual Conference of the International Speech Communication Association (Interspeech), 2013. [5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proceedings of the British Machine Vision Conference (BMVC), 2015. [6] J. Poignant, L. Besacier, G. Quénot, and F. Thollard. From text detection in videos to person identification. In ICME 2012, 2012. [7] J. Poignant, H. Bredin, and C. Barras. Multimodal person discovery in broadcast tv at mediaeval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, 2015. [8] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.