TokyoTech at MediaEval 2016 Multimodal Person
                   Discovery in Broadcast TV Task

                         Fumito Nishi1 , Nakamasa Inoue1 , Koji Iwano2 , Koichi Shinoda1
                                      1
                                     Tokyo Institute of Technology, Tokyo, Japan
                                      2
                                        Tokyo City University, Kanagawa, Japan
                             {nishi, inoue, shinoda}@ks.cs.titech.ac.jp, iwano@tcu.ac.jp


ABSTRACT                                                             GMM for a target video segment. An i-vector w is extracted
This paper describes our diarization system for the Multi-           from it, by assuming that M is modeled as
modal Person Discovery in Broadcast TV task of the Media-                                    M = m + T w,
Eval 2016 Benchmark evaluation campaign [1]. The goal of
this task is naming speakers, who are appearing and speak-           where m is a face and channel independent super-vector,
ing simultaneously in the video, without prior knowledge.            and T is a low rank matrix representing total variability.
Our diarization system relies on face diarization approach.          The Expectation Maximization (EM) algorithm is used to
We extract deep features from a face every 0.5 seconds, make         estimate the total variability as proposed in [2]. Note that
visual i-vectors, cluster them, and associate results of clus-       w is associated with a given video segment. i-vector ws for
tering with optical character recognition.                           segment s is calculated by the following equation
                                                                                ws = (I + T t ΣN (s)T )−1 T t Σ−1 F (s),
1. INTRODUCTION
   The Multimodal Person Discovery in Broadcast TV task              where N (s) and F (s) are the zero, and first order Baum-
can be split into subtasks: speaker diarization, face diariza-       Welch statistics on the Universal Background Model (UBM)
tion, optical character recognition (OCR), speech transcrip-         for the current segment s, and Σ is the covariance matrix of
tion. We focus on diarization using face identification among        the UBM. Each i-vector represents each face track respec-
these subtasks. This year, we introduce i-vectors [2] using          tively.
deep features extracted from FaceNet, which is one of the
state-of-the-art neural networks for face recognition.
                                                                     2.1.3    Attaching person’s name tags
   Figure 1 shows the overview of our method. First, we                To attach a person’s name tag for each face, the provided
detect and track faces in a video. Second, deep features             tags with time ranges obtained from optical character recog-
are extracted from the detected faces at every 0.5 seconds.          nition (OCR) are used. First, each tag from OCR is attached
Third, i-vectors are made from deep features for each frame.         to the face which has the maximum appearance time over-
                                                                     lap. Here, we have tagged and untagged faces. Second, for
                                                                     each untagged face, we find the nearest tagged face to attach
2. APPROACH                                                          the same tag. If the distance between the untagged face and
                                                                     nearest tagged face is less than the predefined threshold, the
2.1 Face diarization                                                 same tag is attached to the untagged face. Distance between
                                                                     two faces are calculated by
2.1.1 Deep feature
                                                                                                          w w
   We employ FaceNet [3] to extract deep features. A deep                               Dij = 1 − ||wi ||2i ||w
                                                                                                              j
                                                                                                                j ||2
feature is extracted from output layer of the network. It can
measure similarity between faces. To extract deep features,          where wi and wj are i-vectors for tagged and untagged faces,
face detection and tracking method in [4] is employed to             respectively.
obtain face regions in a video. Deep features are extracted
from face regions at every 0.5 second.
                                                                     2.2     Speaker diarization
                                                                       For speaker diarization, Bayesian Information Criterion
2.1.2 Visual i-vectors                                               (BIC) based segmentation with 12 MFCC + E is applied
   After deep features are extracted, we make visual i-vectors.      to obtain audio segments. Music and jingle segments are
I-vector is one of the state-of-the-art method for speaker ver-      removed by Viterbi decoding. Finally, i-vectors are com-
ification. We apply this to the tracking segments.                   puted for each segment and clustered with Integer Linear
   Let M be a Gaussian Mixture Model (GMM) super-vector,             Programming [5, 6].
which concatenates normalized mean vectors of an estimated
                                                                     2.3     Multimodal fusion
                                                                       We employed the name propagation technique proposed
Copyright is held by the author/owner(s).                            in [7]. Our multimodal fusion takes intersection of tags ob-
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.   tained from speaker diarization and face diarization.
                                                                         sual streams. Modeling temporal relation between speakers
                                                                         is also needed to improve the performance.

                                                                         4.   CONCLUSION
                                                                           We presented a face diarization based system, which uses
                                                                         visual i-vectors with FaceNet. Development of multimodal
                                                                         fusion methods and using sequential information is our fu-
                                                                         ture work.

                                                                         5.   REFERENCES
                                                                          [1] Hervé Bredin, Camille Gauinaudeau, Claude Barras.
                                                                              Multimodal Person Discovery in Broadcast TV at
                                                                              MediaEval 2016. Proc. of the MediaEval 2016
                                                                              workshop, Hilversum, Netherlands, Oct. 20-21, 2016.
                                                                          [2] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre
                                                                              Dumouchel, and Pierre Ouellet. Front-end factor
                                                                              analysis for speaker verification. IEEE Transactions
                                                                              on Audio, Speech, and Language Processing, Vol. 19,
                                                                              No. 4, pp. 788–798, 2011.
        Figure 1: Overview of the whole system
                                                                          [3] Florian Schroﬀ, Dmitry Kalenichenko, and James
                                                                              Philbin. Facenet: A unified embedding for face
 Subset        Method                           MAP@1 [%]   MAP@10 [%]        recognition and clustering. In Proceedings of the IEEE
               Face Diarization (Primary)            27.7         18.3        Conference on Computer Vision and Pattern
               Face Diarization (Contrastive)        23.5         11.9        Recognition, pp. 815–823, 2015.
               Speaker Diarization                   13.4         11.3
 Leaderboard   Multimodal                            22.7         15.0    [4] Navneet Dalal and Bill Triggs. Histograms of oriented
               Face Diarization (Primary)            31.5         20.0        gradients for human detection. In IEEE Computer
               Face Diarization (Contrastive)        25.7         13.6        Society Conference on Computer Vision and Pattern
               Speaker Diarization                   13.1         11.7        Recognition, Vol. 1, pp. 886–893. IEEE, 2005.
 Eval          Multimodal                            29.3         17.3
                                                                          [5] Mickael Rouvier and Sylvain Meignier. A global
Table 1: Mean Average Precision (MAP) in the test                             optimization framework for speaker diarization. In
set                                                                           Odyssey, pp. 146–150, 2012.
                                                                          [6] Grégor Dupuy, Sylvain Meignier, Paul Deléglise, and
                                                                              Yannick Esteve. Recent improvements on ilp-based
                                                                              clustering for broadcast news speaker diarization. In
3. EXPERIMENTS AND RESULTS                                                    Proceedings of Odyssey. Citeseer, 2014.
                                                                          [7] Johann Poignant, Hervé Bredin, Viet-Bac Le, Laurent
3.1 Experimental Settings                                                     Besacier, Claude Barras, and Georges Quénot.
   We use dlib library [8] for face detection and tracking. For               Unsupervised speaker identification using overlaid
FaceNet, we use OpenFace implementation [9]. The dimen-                       texts in tv broadcast. In Interspeech 2012-Conference
sion of deep features is 128. To extract visual i-vector, we                  of the International Speech Communication
trained UBM with 32 Gaussian mixtures and total-variability                   Association, p. 4p, 2012.
matrix on development set by using ALIZE [10]. The devel-                 [8] Davis E King. Dlib-ml: A machine learning toolkit.
opment set is the INA corpus which is used in the MediaE-                     Journal of Machine Learning Research, Vol. 10, No.
val 2015. We use detected faces for training. The dimension                   Jul, pp. 1755–1758, 2009.
of visual i-vector is 100. For speaker diarization, we use                [9] Brandon Amos, Bartosz Ludwiczuk, and Mahadev
the LIUM Speaker Diarization system [11]. The number                          Satyanarayanan. Openface: A general-purpose face
of Gaussian mixtures for UBM is 256. The dimension of i-                      recognition library with mobile applications. Technical
vector is 50. We used provided OCR and fusion code to build                   report, CMU-CS-16-118, CMU School of Computer
our system, and implemented all the other components.                         Science, 2016.
3.2 Experimental Results                                                 [10] Bonastre, Jean-François and Wils, Frédéric and
                                                                              Meignier, Sylvain. ALIZE, a free toolkit for speaker
   Table 1 shows MAP on the test set. Face Diarization
                                                                              recognition. 2005 IEEE International Conference on
is used for our submissions. The threshold used for the
                                                                              Acoustics, Speech and Signal Processing, vol. 1,
primary submission is adjusted in development set. The
                                                                              pp.737–740, 2005.
threshold of the contrastive submission is 0. Speaker Di-
                                                                         [11] Mickael Rouvier, Grégor Dupuy, Paul Gay, Elie
arization and Multimodal evaluated by using the evaluation
                                                                              Khoury, Teva Merlin, and Sylvain Meignier. An
tool were not in our submission. Face Diarization is bet-
                                                                              open-source state-of-the-art toolbox for broadcast
ter than Speaker Diarization. It was eﬀective for identi-
                                                                              news diarization. Technical report, Idiap, 2013.
fying speakers with short utterances. However, as we can
see, Multimodal is worse than Face Diarization. To improve
multimodal fusion system, we need to introduce multimodal
features that can capture correlation between audio and vi-