=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_46
|storemode=property
|title=EUMSSI Team at the MediaEval Person Discovery Challenge 2016
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_46.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LeMO16
}}
==EUMSSI Team at the MediaEval Person Discovery Challenge 2016==
EUMSSI team at the MediaEval Person Discovery Challenge 2016 Nam Le1,2 , Sylvain Meignier3 , Jean-Marc Odobez1,2 1 Idiap Research Institute, Martigny, Switzerland 2 École Polytechnique Fédéral de Lausanne, Switzerland 3 LIUM, University of Maine, Le Mans, France {nle, odobez}@idiap.ch, sylvain.meignier@univ-lemans.fr for text recognition in videos, and on [3, 15] for text recog- nition and indexing. In brief, given an input video, two main steps are applied: first the video is preprocessed with a motion filtering to reduce noise, and individual frames are processed to localize and binarize the text regions for text recognition. As compared to printed documents, OCR in Figure 1: Architecture of our system TV news videos encounters several challenges: low resolu- tion of text regions, sequence of different texts continuously ABSTRACT displayed, or small amount of text to be recognized etc. To tackle these, multiple image segmentations of the same text We present the results of the EUMSSI team’s participation region are decoded, and then all results are compared and in the Multimodal Person Discovery task. The goal is to aggregated over time to produce several hypotheses. The identify all people who simultaneously appear and speak in best hypothesis is used to extract people names for identi- a video corpus. In the proposed system, besides improv- fication. To recognize names from texts, we use the MITIE ing each modality, we emphasize on the ranking of multiple open library 1 , which provides state-of-the-art NER tool. To results from both audio stream and visual stream. improve the raw MITIE results, a heuristics preprocessing step identifies names of editorial staff based on their roles 1. INTRODUCTION (cameraman, editor, or writer) because they do not appear As the retrieval of information on people in videos is of within the video, thus are not useful for identification. high interest for users, algorithms indexing identities of peo- ple and retrieving their respective quotations are indispens- 2.2 Face diarization able for searching archives. This practical need leads to re- Given the video shots, face diarization process consists of search problems on how to identify people presence in videos. (i) face detection, (ii) face tracking, and (iii) face clustering. Given the raw TV broadcasts, each shot must be automat- Detection & tracking. Detecting and associating faces ically tagged with the name(s) of people who can be both can be challenging due to the wide range of media content, seen as well as heard in the shot along with the confident where faces can appear with varied illumination and noise. score. The list of people is not known apriori and their To overcome these challenges, we use a fast version of de- names must be discovered from video text overlay or speech formable part-based model (DPM) [5, 11, 4] to detect faces transcripts [6]. To this end, a video must be segmented in at multiple poses and variation. Tracking is performed using an unsupervised way into homogeneous segments according the CRF-based multi-target tracking framework [7], which to person identity, like speaker diarization and face diariza- relies on the unsupervised learning of time sensitive associ- tion, to be combined with the extracted names. Our goal ation costs for different features. Because the bottle-neck of is to benchmark our recent improvements in all components the system is detection, the detector is only applied 4 times and address the fusion of multimodal results. per second. We also trained an explicit false alarm classifier at the track level to efficiently filter out false tracks. Further 2. PROPOSED SYSTEM details can be found in [9]. The system we proposed is illustrated in Fig. 1. It con- Face clustering. We hierarchically merge face tracks across sists of 4 main parts: video optical character recognition all shots using matching and biometric similarity measures (OCR) and named entity recognition (NER), face diaria- similarly to [8] with two improvements: shot-constrained tion, speaker diarization, and fusion naming. face clustering (SCFC) and the use of total variability mod- eling (TVM). SCFC is a divide-and-conquer strategy. Face 2.1 Video OCR and NER clustering is first applied limiting within each group of sim- To detect OCR segments in videos and exploit them for ilar shots. Then all resulting face clusters, which are now retrieval, we first relied on the approaches described in [2, 1] much fewer in quantity, are hierarchically merged. TVM is a state-of-the-art biometrics method that can represent faces Copyright is held by the author/owner(s). which can appear in widely different contexts and sessions MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- 1 lands https://github.com/mit-nlp/MITIE [17, 16]. To compute similarity between face clusters, we Algorithm 1 Ranking names within shots simply use the average distance between all pairs of faces 1: for sk ∈ S do using the cosine distance between i-vectors. 2: Qsk = ∅ 2.3 Speaker diarization 3: Face naming(sk ) ⇒ (NiF , t(NiF )) 4: Speaker naming(sk ) ⇒ (NjA , 1.0) The speaker diarization system is based on the LIUM 5: for each NiF do Speaker Diarization system[14], which is publicly distributed2 . 6: if ∃NjA /NjA = NiF then It is provided to all participants as the baseline method. 7: Qsk = Qsk ∪ {(NiF , t(NiF ) + 2.0)} 2.4 Identification and result ranking 8: else 9: Qsk = Qsk ∪ {(NiF , t(NiF ) + 1.0)} After obtaining homogeneous clusters during which dis- tinct identities speak or appear, one needs to assign each 10: for each NjA do name output from NER module to the correct clusters. How- 11: if not ∃NiF /NiF = NjA then ever, associating auditory voices with visual person clusters 12: Qsk = Qsk ∪ {(NjA , 1.0)} or names has two major difficulties. The visible person may not be the current speaker and the speaking person can be MAP@1 MAP@10 MAP@100 dubbed by a narrator in a different language. Although we Sub. (1) 30.3 22.0 21.0 have introduced a temporal learning method to solve the Sub. (2) 58.6 42.9 42.0 dubbing problem [10], incorporating it into an AV diariza- tion system is still an open question. Because of these prob- Sub. (3) 64.2 53.1 52.1 lems of AV association, we use a direct naming method [13] Sub. (4) 68.3 56.2 54.7 which finds the mapping between clusters and names to Sub. (5) 79.2 65.2 63.4 maximize the co-occurrences between them. Table 1: Benchmarking results of our submissions. Identification. Names are propagated based on the out- Details of each submission in the text. puts of face diarization and speaker diarization indepen- dently. The direct naming method is applied to speaker Each of our 5 submissions (Sub.) is as following: clusters to produce a mapping between names and clusters. • Sub. (1) and Sub. (2) used our face naming without All shots which overlap with the clusters are tagged with talking score with baseline OCR-NER (1) or with our the corresponding names with equal confident scores. The OCR-NER (2). same direct method is applied to face clusters to produce • Sub. (3) used our face naming with talking score. a set of named clusters. Unlike speaker naming, for one • Sub. (4) used the combination of talking face naming shot, a name coming from face naming is ranked based on in sub. (3) with speaker naming. the talking score of the cluster’s segment within that shot. • And sub. (5) used the combination of sub. (4) with The talking score is predicted using lip motion and tempo- other systems using baseline OCR-NER or baseline ral modeling with LSTM [10]. Based on the two results, we face diarization. This is also our primary submission. propose a strategy to appropriately combine them. When comparing sub. (1) and sub. (2), one can observe Ranking. Let S = {sk } be the list of testing shots. Within that our OCR-NER outperforms the baseline OCR-NER by each shot, {NiF , t(NiF )} is the set of names returned by face a large margin. This may be contributed by the high re- naming and the corresponding talking scores and {NiA , 1.0} call of our system. Because the metric is averaged over is the set of names returned by speaker naming, each is all queries, any missing name can significantly decrease the ranked equally with score 1.0. The names which the two overall MAP. On the other hand, false names are less prob- methods agree on are ranked highest. Then, names from lematic because of two reasons: they may not be associated face naming are ranked higher than speaker naming because with any clusters and they are not queried at all. In sub. we found that face naming is more reliable in empirical ex- (3), using talking face detection with LSTM, we can further periments. Alternative strategies that rank speaker naming improve by 5.6%. By combining face naming and speaker equal or higher than face naming gave inferior results. Our naming, we manage to increase the precision. This shows the ranking strategy is described in Algo. 1. potential for further research of better audio-visual naming. In our primary submission (5), the result are greatly boosted Further fusion. Finally, replacing individual component in when other methods are added. From this we can note that our system with baseline NER [12] and face diarization 3 can these methods are complementary to each other and how to produce complementary results. Therefore, these results are exploit their advantages is an open question in the future. added to our final submission with lower confident scores. 3. EVALUATION 4. CONCLUSION We have presented our system in MediaEval challenge Participants are scored based on a set of queries. Each 2016. This system consists of our recent advances in video query is a person name in the corpus, each participant has processing and temporal modeling. Although each modal- to return all shots when that person appears and talks. The ity shows positive performance, the current system has not metric is Mean Average Precision (MAP) over all queries. In taken full advantage of both audio and visual streams. There- Tab. 1, we report our result on the test set as of 24/09/2016 4 . fore, the testing results serve as the basis for us to work 2 further in this direction. www-lium.univ-lemans.fr/en/content/liumspkdiarization 3 http://pyannote.github.io/ Acknowledgement This research was supported by the 4 The groundtruth is still updated by a collaborative anno- European Union project EUMSSI (FP7-611057). tation process. 5. REFERENCES 2011 International Joint Conference on, pages 1–8. [1] D. Chen and J.-M. Odobez. Video text recognition IEEE, 2011. using sequential monte carlo and error voting methods. Pattern Recognition Letters, 26(9):1386–1403, 2005. [2] D. Chen, J.-M. Odobez, and H. Bourlard. Text detection and recognition in images and video frames. Pattern Recognition, 37(3):595–608, 2004. [3] N. Daddaoua, J.-M. Odobez, and A. Vinciarelli. Ocr based slide retrieval. In Eighth International Conference on Document Analysis and Recognition (ICDAR’05), pages 945–949. IEEE, 2005. [4] C. Dubout and F. Fleuret. Deformable part models with individual part scaling. In BMVC, 2013. [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. [6] C. B. H. Bredin, C. Guinaudeau. Multimodal person discovery in broadcast tv at mediaeval 2016. In Proc. of the MediaEval 2016 Workshop, Hilversum, Netherlands, Oct. 2016. [7] A. Heili, A. Lopez-Mendez, and J.-M. Odobez. Exploiting long-term connectivity and visual motion in crf-based multi-person tracking. IEEE Transactions on Image Processing, 23(7):3040–3056, 2014. [8] E. Khoury, P. Gay, and J.-M. Odobez. Fusing Matching and Biometric Similarity Measures for Face Diarization in Video. In ACM ICMR, 2013. [9] N. Le, A. Heili, D. Wu, and J.-M. Odobez. Temporally subsampled detection for accurate and efficient face tracking and diarization. In International Conference on Pattern Recognition. IEEE, Dec. 2016. [10] N. Le and J.-M. Odobez. Learning multimodal temporal representation for dubbing detection in broadcast media. In ACM Multimedia. ACM, Oct. 2016. [11] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face detection without bells and whistles. In ECCV, pages 720–735. Springer, 2014. [12] J. Poignant, L. Besacier, G. Quénot, and F. Thollard. From text detection in videos to person identification. In 2012 IEEE International Conference on Multimedia and Expo (ICME), pages 854–859. IEEE, 2012. [13] J. Poignant, H. Bredin, V.-B. Le, L. Besacier, C. Barras, and G. Quénot. Unsupervised speaker identification using overlaid texts in tv broadcast. In Interspeech, page 4p, 2012. [14] M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin, and S. Meignier. An open-source state-of-the-art toolbox for broadcast news diarization. In Interspeech, Lyon (France), 25-29 Aug. 2013. [15] A. Vinciarelli and J.-M. Odobez. Application of information retrieval technologies to presentation slides. IEEE Transactions on Multimedia, 8(5):981–995, 2006. [16] R. Wallace and M. McLaren. Total variability modelling for face verification. Biometrics, IET, 1(4):188–199, 2012. [17] R. Wallace, M. McLaren, C. McCool, and S. Marcel. Inter-session variability modelling and joint factor analysis for face authentication. In Biometrics (IJCB),