=Paper=
{{Paper
|id=Vol-1436/Paper54
|storemode=property
|title=UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper54.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/IndiaVVMH15
}}
==UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task==
UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task M. India, D. Varas, V. Vilaplana, J.R. Morros, J. Hernando Universitat Politecnica de Catalunya, Spain ABSTRACT the diarization only in those parts where we assume that This paper describes a system to identify people in broad- someone in the video must be speaking. cast TV shows in a purely unsupervised manner. The system outputs the identity of people that appear, talk and can be 3. VIDEO SYSTEM identified by using information appearing in the show (in our For face tracking, the baseline code was used (tracking case, text with person names). Three types of monomodal by detection using the Kanade-Lucas-Tomasi algorithm [18, technologies are used: speech diarization, video diarization 10, 16]). For feature extraction we used the technique in and text detection / named entity recognition. These tech- the baseline (HOG [5] features on facial locations[19], con- nologies are combined using a linear programming approach catenated and projected using LDML [8]). While in the where some restrictions are imposed. baseline a single descriptor was selected for each track, we used several vectors, by uniform temporal sampling of the track faces. We expect this approach to better capture the 1. INTRODUCTION variations in pose/expression. The 2015 Multimodal Person Discovery in Broadcast TV We used agglomerative hierarchical clustering. A binary [13] goal is to identify people appearing and speaking in TV hierarchical tree is created by fusing tracks according to the shows in a purely unsupervised manner. This paper de- minimum distance between track vectors. The number of scribes the UPC contribution, which is based on combining clusters may vary between videos and has to be determined. speech diarization, video-based face diarization and text de- It is estimated by evaluating the CalinskiHarabasz [3] and tection plus Named Entity Recognition (NER). We did not Silhouette [14] criteria in the range [50, 80] clusters and av- make use of the names present in speech transcriptions. eraging the maximum results. The number of resulting clus- ters is the average of the maximum result for both methods. 2. AUDIO SYSTEM To improve the diarization, spatio-temporal restrictions were introduced. We assume that a person can not appear Speaker information was extracted using an Agglomera- twice in a frame so tracks with temporal overlapping should tive Hierarchical Clustering diarization system based in Hid- represent different persons and are prevented to merge into den Markov Models [21, 20, 2, 11]. It uses energy-based the same cluster. Also, as we use a multi-vector representa- speech activity detection , Mel Frequency Cepstral Coeffi- tion for each track, vectors in the same track must be part cients voice features and initial uniform segmentation. of the same cluster. Restrictions are modeled using a ma- Speaker clusters are modeled with Gaussian Mixture Mod- trix expressing the relationship between the feature vectors. els (GMM). The complexity selection of the models is based Entries for vectors in different tracks were assigned a value on the amount of data per cluster and the cluster complexity of 1, entries for vectors in the same track were assigned a ratio which fixes the amount of speech per Gaussian. Hid- value 0 < v 1, and entries for vectors on temporally co- den Markov Model (HMM) training and cluster realignment occurring tracks received a very large value v 1. This by Viterbi decoding is based on maximum likelihood. In the matrix is used to point-wise multiply the vector-to-vector decoding stage, a minimum speaker segment duration of 3 distance matrix used for clustering. seconds is imposed to deal with too short segments. For the cluster merging, the most likely pair of clusters are se- lected in each iteration. This likelihood is calculated using a 4. TEXT SYSTEM modified Bayesian information criterion (BIC) [4, 1] metric We used the person names provided in the baseline [6, among clusters. 12] and our own technology for obtaining person names (in This system has been used with two different kind of in- different runs). From the input image a segmentation is puts for each show. In one hand, diarization is run with created with a Binary Partition Tree [15] using color and each audio file without any constraint. In the other hand, stroke width [7]. A partition is built were each charac- using a face-tracking system, segments without tracked faces ter is a connected component while background regions are are discarded. The purpose of this second method is to run merged. Next, regions are filtered by a sequence of binary classifiers that reject non-character components. Compo- nents accepted by the classifiers as character candidates are Copyright is held by the author/owner(s). combined into pairs and pairs are combined into chains. MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany A post-processing stage is applied to find missing compo- Exp. System Audio Input NER MAP 1 2 facetrack Baseline 22.6 2 1 facetrack Baseline 27.1 3 2 - Baseline 33.5 4 1 - Baseline 41.6 5 1 - UPC system 32.6 Table 1: MAP Evaluation 6. RESULTS Five different experiments were performed, which are shown Figure 1: System block diagram in Table 1. These experiments were evaluated with the train- ing database and evaluated using the mean average precision nents wrongly rejected as false positives in the filtering stage. metric (MAP). In the experiments we tested several varia- Tesseract OCR Engine [17] provides one transcription for tions: the order of the fusions, the input of the audio di- each text chain and Stanford Name Entity Recognizer [9] is arization and the text system used. In Table 1, System 1 used to automatically detect person names in the text. refers to the architecture shown in Figure 1 where the first fusion combines text and video, and System 2 refers to first combining video and audio and later fusing text. f acetrack 5. FUSION indicates that the audio diarization is performed using only Our system combines the previous information sources to audio tracks where there are faces detected. The null case obtain the final person recognition labelling. Speaker di- means performing the diarization using the whole audio in- arization and video diarization are performed first in an put. While the first four experiments use the baseline names, independent manner. In order to fuse this information to in the fifth one the system described in section 4 was used. create a final labelling, the development database was ana- The best performance was achieved in experiment 4 by the lyzed. Some assumptions were made: System 1, without filtering the audio input for the diariza- tion and using the Baseline person names. There is a clear • Speaker is not always related with who is shown in the evidence that the system works better when the diarization screen. So it is important to weigh accurately the tem- is run with the whole audio input. Referring to the fusion poral overlaps between each speaker with its different order in the algorithm, results indicate that mixing video possible face identity assignments. • Some speakers do not come into view any time in the and text tracks first, provides a better performance. show and there are other people who are shown in the The five experiments were run on the test data. Exper- screen but do not speak. Both should be discarded. iments 1-4 were submitted on July 1st and experiment 5 • Text identities are more related with who is shown on July 8th. The best set-up in the training data (Exp.4 rather than with who is speaking. So text is better in Table 1) was uploaded as our primary submission. Af- combined with video than with speech. ter evaluating this primary submission with the final set of annotations, the following results were obtained: EwMAP According to these assumptions, an algorithm was designed = 54.1%, MAP = 54.36% and C = 69.71%. Experiment 5 based in weighting temporal overlaps between tracks (Fig- was submitted on July 8th. It is similar to experiment 4 ure 1). This algorithm considers two different fusion modal- but using our own technology to obtain person names. We ities (Video/Text and Video/Audio) and combines both to had low performance with the OCR and NER and thus the obtain a final track file. Firstly, text and video are fused. results were worse than expected. Their overlapped tracks are selected, and the temporary overlaps of their identities are weigthed to set the constraints of an ILP system (IBM CPLEX). 7. CONCLUSIONS XX Speaker diarization, face recognition, and text detection max( αij βij ) (1) with named entity recogniton have been combined using the αij i j integer linear programming approach. Our idea was to first perform monomodal speech and video diarizations, using as X αij ≤ 1 (2) j much restrictions as possible to improve the results and then use ILP to combine these diarizations along with the per- (αij : assignment between i text identity with a j video iden- sons name information. Several architectures for this combi- tity; βij : weight of assignment). Equation 2 establishes that nation and several constrains of the integer linear program- each text identity must only have one face identity assigned. ming algorithm were considered. The architecture which The next step is to combine the speech diarization tracks combines video and audio modalities after the fusion with with the face tracks that have a text identity assigned. The the text stream has provided the best results. same method based on ILP is used. Finally, using the rela- tion between text, face and speaker identities and the over- lapped tracks in the second fusion, the final labeling output 8. ACKNOWLEDGMENTS was obtained. A second algorithm was implemented chang- This work has been developed in the framework of the ing the order of the fusions. In this case, audio were fused projects TEC2013-43935-R, TEC2012-38939-C03-02 and PCIN- with the video and the result was combined with the text 2013-067. It has been financed by the Spanish Ministerio de identities. Thus, only the face identities with a speaker as- Economı́a y Competitividad and the European Regional De- signed were considered. velopment Fund (ERDF). 9. REFERENCES [18] C. Tomasi and T. Kanade. Detection and tracking of [1] J. Ajmera and C. Wooters. A robust speaker point features. Technical report, International Journal clustering algorithm. Proc. ASRU, 2003. of Computer Vision, 1991. [2] X. Anguera, C. Wooters, and J. Hernando. Acoustic [19] M. Uricar, V. Franc, and V. Hlavac. Facial landmarks beamforming for speaker diarization of meetings. detector learned by the structured output svm. In IEEE Transactions on Audio, Speech, and Language G. Csurka, M. Kraus, R. Laramee, P. Richard, and Processing, 15(7):2011–2022, 2007. J. Braz, editors, Computer Vision, Imaging and [3] T. Caliński and J. Harabasz. A dendrite method for Computer Graphics. Theory and Application, volume cluster analysis. Communications in 359 of Communications in Computer and Information Statistics-Simulation and Computation, 3(1):1–27, Science, pages 383–398. Springer Berlin Heidelberg, 1974. 2013. [4] S. S. Chen and P. Gopalakrishnan. Clustering via the [20] M. Zelenak and J. Hernando. The detection of bayesian information criterion with applications in overlapping speech with prosodic features for speaker speech recognition. Proc. ICASSP, 20:645–648, 1998. diarization. Proc. Interspeech, 2011. [5] N. Dalal and B. Triggs. Histograms of oriented [21] M. Zelenak, C. Segura, J. Luque, and J. Hernando. gradients for human detection. In CVPR 2015, 2005. Simultaneous speech detection with spatial features [6] M. Dinarelli and S. Rosset. Models cascade for for speaker diarization. IEEE Transactions on Audio, tree-structured named entity detection. In Proceedings Speech, and Language Processing, 20(2):436–446, 2012. of 5th International Joint Conference on Natural Language Processing, pages 1269–1278, Chiang Mai, Thailand, November 2011. Asian Federation of Natural Language Processing. [7] B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natural scenes with stroke width transform. In Proc. of Computer Vision and Pattern Recognition CVPR2010, pages 2963–2970, 2010. [8] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Face recognition from caption-based supervision. IJCV, 96(1), 2012. [9] J.Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proc. of the 43nd Annual Meeting of the Association for Computational Linguistics, pages 363–370, 2005. [10] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. pages 674–679, 1981. [11] J. Luque, X. Anguera, A. Temko, and J. Hernando. Speaker diarization for conference room. The UPC RT07s evaluation system. Multimodal Technologies for Perception of Humans., pages 543–553, 2008. [12] J. Poignant, L. Besacier, G. Quenot, and F. Thollard. From text detection in videos to person identification. In Multimedia and Expo (ICME), 2012 IEEE International Conference on, pages 854–859, July 2012. [13] J. Poignant, H. Bredin, and C. Barras. Multimodal person discovery in broadcast tv at mediaeval 2015. In Proceedings of MediaEval 2015, September 2015. [14] P. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20(1):53–65, Nov. 1987. [15] P. Salembier and L. Garrido. Binary partition tree as an efficient representation for image processing, segmentation and information retrieval. IEEE TIP, 9(4):561–575, April 2000. [16] J. Shi and C. Tomasi. Good features to track. pages 593–600, 1994. [17] R. Smith and G. Inc. An overview of the tesseract ocr engine. In Proc. 9th IEEE Intl. Conf. on Document Analysis and Recognition (ICDAR, pages 629–633, 2007.