=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_49
|storemode=property
|title=HCMUS team at the Multimodal Person Discovery in Broadcast TV Task of MediaEval 2016
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_49.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/NguyenNCNLNT16
}}
==HCMUS team at the Multimodal Person Discovery in Broadcast TV Task of MediaEval 2016==
HCMUS team at the Multimodal Person Discovery in Broadcast TV Task of MediaEval 2016 Vinh-Tiep Nguyen, Manh-Tien H. Nguyen, Quoc-Huu Che, Van-Tu Ninh, Tu-Khiem Le, Thanh-An Nguyen, Minh-Triet Tran Faculty of Information Technology University of Science, Vietnam National University-Ho Chi Minh city nvtiep@fit.hcmus.edu.vn, {nhmtien, cqhuu, nvtu, ltkhiem}@apcs.vn, 1312016@student.hcmus.edu.vn, tmtriet@fit.hcmus.edu.vn ABSTRACT test video shot, we classify it with the trained SVM models. We present the method of the HCMUS team participating in Multimodal Person Discovery in Broadcast TV Task at the MediaEval Challenge 2016. There are two main processes in our 2. IDENTIFY CHARACTERS OF method. First we identify a list of potential characters of interest INTEREST from all video clips. Each potential character is defined as a pair There are many persons in video clips. However, we only of face track, a sequence of face patches, and a name. We use focus on main characters – person appearing clearly on a frame OCR results and face detection to find potential characters. We with names explicitly introduced either by caption text or speech. also apply several simple techniques to check the consistency of The process to identify characters of interest includes three linking a name with a face track to reduce potential wrong main phases: detect names, identifypairs, and verify matching pairs. Then we detect face patches from test video shots consistency of pairs. with cascade DPM, extract deep features from face patches using In name detection phase (c.f. Figure 1), caption text is first a very deep Convolutional Neural Network, and classify faces extracted from video frames with OCR module. Main Text using SVM. Verification step is used to filter text phrases that may not be used with high confidence as character names. We apply simple techniques for name detection and finally get a list of names and 1. INTRODUCTION corresponding timestamps (in video shots). The objective of the Multimodal Person Discovery in Broadcast TV Task is to automatically find the appearance of main characters from a large dataset of broadcast TV clips [1]. Name of a person can be introduced by text in caption, or via speech in conversation. Figure 1. Name Detection Process In our approach to solve this task, we use two types of data: In Main Text Verification module, as the name of a character text from caption and visual data. There are two main processes in is usually displayed with a large font, we eliminate text phrases our method: (i) identify characters of interest and their names; and with small size. Only text phrases with the largest font in a frame (ii) discover person using face recognition with deep features. are selected (Figure 2a). Besides, we also discard frames having In the first process, we propose the Main Text Verification multiple text phrases with the same largest size because we may step based on font size to select text phrases that may be used with not link a name with a detected main face in such frames with high confidence as character names. If there are multiple main high confidence (Figure 2b and c). This situation often occurs in phrases detected in a single frame, we consider it as an ambiguous the introduction of a show, in a scrolling list of person at the end frame and eliminate it. Then we extract faces at the timestamps of a film, or in multiple-choice question of a gameshow. corresponding to a found name with OCR. Only faces with significant size are chosen to link with names. Besides, a frame containing multiple faces with large size are discarded as we may not associate a face correctly with a name. Finally, if a group of faces associated with a single name has different large sub-groups of faces of multiple persons, this group is not reliable and should be discarded. After the first process, we obtain a list of evidence entries containing names and associated faces. In the second process, we (a) Main text phrase (b) Multi-main text phrases use VGG-Face [2] to extract deep features from face patches and train SVM models with these 4096-dimension features to classify all main characters found in the first process. We use Cascade DPM [3,4] to detect and crop facial areas from frames to boost the (c) Names in a question of gameshow accuracy of SVM classifiers. For each face patch extracted from a Figure 2. Main text verification Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands. Face patches are extracted from each frame corresponding to track. We apply Cascade DPM detector [3,4] on image region the timestamp of a potential name. As the name of a character near the face track of metadata to extract face patches which then usually appears when the face of that person can be seen clearly in be transferred to VGG-Face network. Using this detector instead frontal pose, we use Viola-Jones face detector[5] in this step. In of other face detectors, such as Viola-Jones one, improves the the Main Face Verification step, a face is considered as a main performance because the network is trained on face patches face if its size is large enough. In our experiment, we choose the extracted by this algorithm. threshold value for a main face as 7% of the total frame’s area. In In testing phase, each face candidate detected in a frame of a this step, a frame with more than one main face is discarded as we shot is transferred to the face classifier and scored by a confidence may not link correctly a name with a main face (Figure 5a). The value. A positive value means candidate face is likely to be similar output of this step is a list of pairs. to the person of trained on classifier and vice versa. Figure 3. Pair Identification Process In Face Consistency Verification process (Figure 4), if there Figure 6. Person Discovery with Face Recognition are many different faces associated with a given name, we discard such pairs. An example of such situation is that the 4. CONCLUSION AND FUTURE WORK name is of a program, not a person and the name appears In our current approach, we focus our effort in refining throughout the program (Figure 5b). names of main characters appearing in broadcast TV clips, and applying VGG-Face and SVM to recognize face patches extracted with cascade DPM. As audio data is not utilized in our method, we miss information about a person from speech in conversation. Thus, we will continue to exploit this data type to obtain potential Figure 4. Face Consistency Verification information about person introduction. Besides, in our method, as we want to boost the accuracy of the trained SVM classifiers, we use cascade DPM to extract face patches. This process, together with extracting deep features with VGG-Face, is time consuming. Therefore, by the end of the challenge, we still have shots to be processed. Currently we are revising our method and will use it to process again the whole data from this challenge. (a) Multiple main faces (b) Program’s Name and Multiple Person Figure 5. Difficult Cases for Pairs 5. ACKNOWLEDGEMENT We would like to express our appreciation to Multimedia 3. FACE RECOGNITION USING DEEP Processing Lab, University of Information Technology, VNU- HCM, and Computational Imaging Group at University of Illinois FEATURES at Urbana-Champaign for their supports of computing We propose to use face recognition to find shots containing infrastructure to our team in this challenge. person whose names were recognized via caption text using OCR algorithm. Starting from an evidence entry of a video that contains only one person (to make sure that the face and name are 6. REFERENCES associated), we keep track face bounding boxes of next video [1] Bredin, H., Barras, C., Guinaudeau, C. 2016. Multimodal frames. For this face track, we extract deep features from face Person Discovery in Broadcast TV at MediaEval 2016. In patches using a very deep Convolutional Neural Network (VGG- Proc. of the MediaEval 2016 Workshop, Hilversum, Face [2]) Netherlands, Oct. 20-21, 2016. After this module, each face patch will be represented by a [2] Parkhi, O. M., Vedaldi, A., Zisserman, A. 2015. Deep Face 4096-dimensional feature vector. Although this feature is Recognition. In Prof. of British Machine Vision Conference designed to best fit with L2 distance metric, there still is a big gap (BMVC) 2015 in performance. This could be explained by the fact that the face feature vector does not have the same weight for all components. [3] Wolf, L., Hassner, T., Maoz, I. 2011. Face Recognition in For each face, the weights of components are different. Therefore, Unconstrained Videos with Matched Background Similarity. we propose to learn these features by a large margin classifier. All 2011. In Proc. of Computer Vision and Pattern Recognition features of an evidence face are collected for training a Support (CVPR) 2011. Vector Machine (SVM) algorithm with linear kernel. [4] Mathias, M., Benenson, R., Pedersoli, M., Van Gool, L. For negative examples, we collect face features from other 2014. Face detection without bells and whistles. In Proc. of persons in the evidence file. To further improve the recognition European Conference on Computer Vision (ECCV) 2014. performance, we use cross-validation method with k=5 folds. After this step, each person of an evidence entry will be [5] Viola, P., Jones, M. 2001. Rapid object detection using a represented by a face classifier. This classifier is used to recognize boosted cascade of simple features. In Proc. of Computer person appearing in all test video shots counted from current face Vision and Pattern Recognition (CVPR) 2001.