=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_49
|storemode=property
|title=HCMUS team at the Multimodal Person Discovery in Broadcast TV Task of MediaEval 2016
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_49.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/NguyenNCNLNT16
}}
==HCMUS team at the Multimodal Person Discovery in Broadcast TV Task of MediaEval 2016==
<pdf width="1500px">https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_49.pdf</pdf>
<pre>
        HCMUS team at the Multimodal Person Discovery in
             Broadcast TV Task of MediaEval 2016

                 Vinh-Tiep Nguyen, Manh-Tien H. Nguyen, Quoc-Huu Che, Van-Tu Ninh,
                            Tu-Khiem Le, Thanh-An Nguyen, Minh-Triet Tran
                                               Faculty of Information Technology
                              University of Science, Vietnam National University-Ho Chi Minh city
                     nvtiep@fit.hcmus.edu.vn, {nhmtien, cqhuu, nvtu, ltkhiem}@apcs.vn,
                         1312016@student.hcmus.edu.vn, tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                               test video shot, we classify it with the trained SVM models.
We present the method of the HCMUS team participating in
Multimodal Person Discovery in Broadcast TV Task at the
MediaEval Challenge 2016. There are two main processes in our
                                                                       2. IDENTIFY CHARACTERS OF
method. First we identify a list of potential characters of interest   INTEREST
from all video clips. Each potential character is defined as a pair         There are many persons in video clips. However, we only
of face track, a sequence of face patches, and a name. We use          focus on main characters – person appearing clearly on a frame
OCR results and face detection to find potential characters. We        with names explicitly introduced either by caption text or speech.
also apply several simple techniques to check the consistency of            The process to identify characters of interest includes three
linking a name with a face track to reduce potential wrong             main phases: detect names, identify <name, face> pairs, and verify
matching pairs. Then we detect face patches from test video shots      consistency of <name, face> pairs.
with cascade DPM, extract deep features from face patches using             In name detection phase (c.f. Figure 1), caption text is first
a very deep Convolutional Neural Network, and classify faces           extracted from video frames with OCR module. Main Text
using SVM.                                                             Verification step is used to filter text phrases that may not be used
                                                                       with high confidence as character names. We apply simple
                                                                       techniques for name detection and finally get a list of names and
1. INTRODUCTION                                                        corresponding timestamps (in video shots).
      The objective of the Multimodal Person Discovery in
Broadcast TV Task is to automatically find the appearance of
main characters from a large dataset of broadcast TV clips [1].
Name of a person can be introduced by text in caption, or via
speech in conversation.                                                                Figure 1. Name Detection Process
      In our approach to solve this task, we use two types of data:
                                                                             In Main Text Verification module, as the name of a character
text from caption and visual data. There are two main processes in
                                                                       is usually displayed with a large font, we eliminate text phrases
our method: (i) identify characters of interest and their names; and
                                                                       with small size. Only text phrases with the largest font in a frame
(ii) discover person using face recognition with deep features.
                                                                       are selected (Figure 2a). Besides, we also discard frames having
      In the first process, we propose the Main Text Verification
                                                                       multiple text phrases with the same largest size because we may
step based on font size to select text phrases that may be used with
                                                                       not link a name with a detected main face in such frames with
high confidence as character names. If there are multiple main
                                                                       high confidence (Figure 2b and c). This situation often occurs in
phrases detected in a single frame, we consider it as an ambiguous
                                                                       the introduction of a show, in a scrolling list of person at the end
frame and eliminate it. Then we extract faces at the timestamps
                                                                       of a film, or in multiple-choice question of a gameshow.
corresponding to a found name with OCR. Only faces with
significant size are chosen to link with names. Besides, a frame
containing multiple faces with large size are discarded as we may
not associate a face correctly with a name. Finally, if a group of
faces associated with a single name has different large sub-groups
of faces of multiple persons, this group is not reliable and should
be discarded.
      After the first process, we obtain a list of evidence entries
containing names and associated faces. In the second process, we              (a) Main text phrase        (b) Multi-main text phrases
use VGG-Face [2] to extract deep features from face patches and
train SVM models with these 4096-dimension features to classify
all main characters found in the first process. We use Cascade
DPM [3,4] to detect and crop facial areas from frames to boost the                   (c) Names in a question of gameshow
accuracy of SVM classifiers. For each face patch extracted from a
                                                                                        Figure 2. Main text verification
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
      Face patches are extracted from each frame corresponding to        track. We apply Cascade DPM detector [3,4] on image region
the timestamp of a potential name. As the name of a character            near the face track of metadata to extract face patches which then
usually appears when the face of that person can be seen clearly in      be transferred to VGG-Face network. Using this detector instead
frontal pose, we use Viola-Jones face detector[5] in this step. In       of other face detectors, such as Viola-Jones one, improves the
the Main Face Verification step, a face is considered as a main          performance because the network is trained on face patches
face if its size is large enough. In our experiment, we choose the       extracted by this algorithm.
threshold value for a main face as 7% of the total frame’s area. In           In testing phase, each face candidate detected in a frame of a
this step, a frame with more than one main face is discarded as we       shot is transferred to the face classifier and scored by a confidence
may not link correctly a name with a main face (Figure 5a). The          value. A positive value means candidate face is likely to be similar
output of this step is a list of <Name, Face> pairs.                     to the person of trained on classifier and vice versa.


      Figure 3. <Name, Face> Pair Identification Process
     In Face Consistency Verification process (Figure 4), if there              Figure 6. Person Discovery with Face Recognition
are many different faces associated with a given name, we discard
such <Name, Face> pairs. An example of such situation is that the        4. CONCLUSION AND FUTURE WORK
name is of a program, not a person and the name appears                       In our current approach, we focus our effort in refining
throughout the program (Figure 5b).                                      names of main characters appearing in broadcast TV clips, and
                                                                         applying VGG-Face and SVM to recognize face patches extracted
                                                                         with cascade DPM. As audio data is not utilized in our method,
                                                                         we miss information about a person from speech in conversation.
                                                                         Thus, we will continue to exploit this data type to obtain potential
             Figure 4. Face Consistency Verification                     information about person introduction.
                                                                              Besides, in our method, as we want to boost the accuracy of
                                                                         the trained SVM classifiers, we use cascade DPM to extract face
                                                                         patches. This process, together with extracting deep features with
                                                                         VGG-Face, is time consuming. Therefore, by the end of the
                                                                         challenge, we still have shots to be processed. Currently we are
                                                                         revising our method and will use it to process again the whole
                                                                         data from this challenge.
(a) Multiple main faces (b) Program’s Name and Multiple Person
        Figure 5. Difficult Cases for <Name,Face> Pairs                  5. ACKNOWLEDGEMENT
                                                                              We would like to express our appreciation to Multimedia
3. FACE RECOGNITION USING DEEP                                           Processing Lab, University of Information Technology, VNU-
                                                                         HCM, and Computational Imaging Group at University of Illinois
FEATURES                                                                 at Urbana-Champaign for their supports of computing
     We propose to use face recognition to find shots containing
                                                                         infrastructure to our team in this challenge.
person whose names were recognized via caption text using OCR
algorithm. Starting from an evidence entry of a video that contains
only one person (to make sure that the face and name are
                                                                         6. REFERENCES
associated), we keep track face bounding boxes of next video             [1] Bredin, H., Barras, C., Guinaudeau, C. 2016. Multimodal
frames. For this face track, we extract deep features from face              Person Discovery in Broadcast TV at MediaEval 2016. In
patches using a very deep Convolutional Neural Network (VGG-                 Proc. of the MediaEval 2016 Workshop, Hilversum,
Face [2])                                                                    Netherlands, Oct. 20-21, 2016.
     After this module, each face patch will be represented by a         [2] Parkhi, O. M., Vedaldi, A., Zisserman, A. 2015. Deep Face
4096-dimensional feature vector. Although this feature is                    Recognition. In Prof. of British Machine Vision Conference
designed to best fit with L2 distance metric, there still is a big gap       (BMVC) 2015
in performance. This could be explained by the fact that the face
feature vector does not have the same weight for all components.         [3] Wolf, L., Hassner, T., Maoz, I. 2011. Face Recognition in
For each face, the weights of components are different. Therefore,           Unconstrained Videos with Matched Background Similarity.
we propose to learn these features by a large margin classifier. All         2011. In Proc. of Computer Vision and Pattern Recognition
features of an evidence face are collected for training a Support            (CVPR) 2011.
Vector Machine (SVM) algorithm with linear kernel.                       [4] Mathias, M., Benenson, R., Pedersoli, M., Van Gool, L.
     For negative examples, we collect face features from other              2014. Face detection without bells and whistles. In Proc. of
persons in the evidence file. To further improve the recognition             European Conference on Computer Vision (ECCV) 2014.
performance, we use cross-validation method with k=5 folds.
After this step, each person of an evidence entry will be                [5] Viola, P., Jones, M. 2001. Rapid object detection using a
represented by a face classifier. This classifier is used to recognize       boosted cascade of simple features. In Proc. of Computer
person appearing in all test video shots counted from current face           Vision and Pattern Recognition (CVPR) 2001.

</pre>