UPC System for the 2016 MediaEval Multimodal Person
               Discovery in Broadcast TV task                                                          ∗


                          Miquel India, Gerad Martí, Carla Cortillas, Giorgos Bouritsas
                             Elisa Sayrol, Josep Ramon Morros, Javier Hernando
                                                Universistat Politècnica de Catalunya


ABSTRACT                                                               SS not overlapping with an overlaid name) to one of the pre-
The UPC system works by extracting monomodal signal seg-               vious classes. For each unlabeled interval, the signal is com-
ments (face tracks, speech segments) that overlap with the             pared against all models and the one with better likelihood
person names overlaid in the video signal. These segments              is selected. An additional ’Unknown’ class is implicitly con-
are assigned directly with the name of the person and used             sidered, corresponding to the cases where the face track or
as a reference to compare against the non-overlapping (unas-           speech segment correspond to a person that is never named
signed) signal segments. This process is performed indepen-            (i.e. none of the appearances of this person in the video do
dently both on the speech and video signals. A simple fusion           overlap with a detected name).
scheme is used to combine both monomodal annotations into                 At the end of this process we have two different sets of
a single one.                                                          annotations, one for speech and one for video. The two
                                                                       results are fused, as described in section 5 to obtain the
                                                                       final annotation.
1.   INTRODUCTION
   This paper describes the UPC system for the 2016 Mul-               2.   TEXT DETECTION
timodal Person Discovery in Broadcast TV task [2] in the
                                                                          We have used the two baseline detections with some addi-
2016 MediaEval evaluations. The system detects face tracks
                                                                       tional post-processing. The first one (TB1) was generated by
(FT), detects speech segments (SS) and also detects the per-
                                                                       our team and distributed to all participants. The LOOV [6]
son names overlaid in the video signal. Both the video and
                                                                       text detection tool was used to detect and track (define the
the speech signals are processed independently. For each
                                                                       temporal intervals where a given text appears) text. Detec-
modality, we aim to construct a classifier that can determine
                                                                       tions were filtered by comparing against list of first names
if a given FT or SS belongs or not to one of the persons ap-
                                                                       and last names downloaded from the internet. We also used
pearing on the scene with an assigned overlaid name. As
                                                                       lists of neutral particles (’van’, ’von’, ’del’, etc.) and nega-
the system is unsupervised, we will use the detected person
                                                                       tive names (’boulevard’, etc.). All names were normalized to
names to identify the persons appearing on the program.
                                                                       contain only alphabetic ASCII characters, without accents
Thus, we assume that the FT of SS that overlap with a de-
                                                                       nor special characters and in lower case. For a given de-
tected person name are true representations of this person.
                                                                       tected text to be considered as name it had to contain at
   The signal intervals that overlap with an overlaid person
                                                                       least one first name and one last name. The percentage of
name are extracted and used for unsupervised enrollment,
                                                                       positive matches for these two classes was used as a score.
defining a model for each detected name. This way, a set of
                                                                       Matches from the neutral class did not penalize the percent-
classes corresponding to the different persons in the detected
                                                                       age. Additionally, if the first word in the detected text was
names is defined. These intervals are directly labeled by
                                                                       included in the negative list, the text was discarded. To
assigning the identity corresponding to the overlaid name.
                                                                       construct TB1 we had access to the test videos before than
   For each modality, a joint identification verification algo-
                                                                       the rest of participants. However, we only used this data for
rithm is used to assign each unlabeled signal interval (FT or
                                                                       this purpose and we did not perform any test of the rest of
∗
  This work has been developed in the framework of the                 our system before the official release.
projects TEC2013-43935-R, TEC2012-38939-C03-02 and                        The second set of annotations, TB2 was provided by the
PCIN- 2013-067. It has been financed by the Spanish Min-               organizers [2]. These annotations had a large quantity of
isterio de Economı́a y Competitividad and the European                 false positives. We applied the above described filtering to
Regional Development Fund (ERDF).                                      TB2 and we combined the result with TB1, as the detectors
                                                                       resulted to be partly complementary.

                                                                       3.   VIDEO SYSTEM
                                                                          For face tracking, the 2015 baseline code [7] was used.
                                                                       Filtering was applied to remove tracks shorter than a fixed
Copyright is held by the author/owner(s).                              time or with too small face size.
MediaEval 2016 workshop Octobber 19–21, 2016, Hilversum, Netherlands      The VGG-face [8] Convolutional Neural Network (CNN)
                                                                       was used for feature extraction. We extracted the features
                                                                       from the activation of the last fully connected layer. The
                                                                          System       MAP1(%)      MAP5(%)      MAP10(%)
                                                                         Baseline 1      13.1          12          11.7
                                                                       Spk Tracking      43.3         30.6         28.8
                                                                         Baseline 2       37          30.3         29.2
                                                                       Face Tracking     61.3         47.9         45.5
                                                                         Baseline 3      36.3         29.3         27.3
                                                                        Intersection     47.9          34           32
     Figure 1: Diagram of the verification system                         Union         63.0         50.5          48.4

                                                                                  Table 1: MAP Evaluation
network was trained using a triplet network arquitecture [5].
The features from the detected faces in each track are ex-
tracted using this network, and then averaged to obtain a         cation method. For the feature extraction, 20 Mel Frequency
feature vector for each track, of size 1024.                      Cepstral Coefficients (MFCC) plus ∆ and ∆∆ coefficients
   A face verification algorithm was used to compare and          were extracted. Using the Alize toolkit[4, 1], a total vari-
classify the tracks. First, the tracks that were overlapped       ability matrix has been trained per show. I-vectors have
with a detected name were named by assigning that identity.       been extracted from 3 seconds segments with a 0.5 second
To reduce wrong assignations, the name was only assigned          shift and the baseline speaker diarization was used to select
if it overlapped with a single track. Then, using the set of      speaker turn segments to extract the i-vector queries. The
named tracks from the full video corpus, a Gaussian Naive         identification was performed evaluating the cosine distance
Bayes (GNB) binary classifier model was trained, using the        of the i-vectors with each query i-vector. The query with the
euclidean distance between pairs of samples from the named        lowest distance was assigned to the segment. A global dis-
tracks. Then, for each specific video, each unnamed track         tance threshold was previously trained with the development
was compared with all the named tracks of the video, com-         database so as to discard assignations with high distances.
puting the euclidean distance between the respective feature
vectors of the tracks (see Figure 1). This value was classified
using the GNB to either being a intra-class distance (both        5.     FUSION SYSTEM AND RESULTS
tracks belong to the same identity) or an inter-class distance       Starting off with the speaker and face tracking shot la-
(the tracks are not from the same person). The probability        beling, two fusion methods were implemented. The first
of the distance being intra-class was used as the confidence      method was the intersection of the shots of both tracking
score. The unnamed track was assigned the identity of the         systems, averaging the confidence of the intersected shots.
most similar named track. A threshold on the confidence           In the second method, the union strategy was implemented
score (0.75) was used to discard tracks not corresponding to      relying on the intersected shots of both modalities and re-
any named track.                                                  ducing the confidence of those not intersected. The shots of
                                                                  both video and speaker systems were merged, averaging the
4.    SPEAKER TRACKING                                            confidence score if both systems detect the same identity in
                                                                  a shot, or reducing the confidence by a 0.5 factor if only one
   Speaker information was extracted using an i-vector based      of the systems detected a query.
speaker tracking system. Assuming that text names are                Four different experiments were performed which are shown
temporarily overlapped with their speaker and face iden-          in Table 1. Baseline 1 refers to the fusion between the base-
tities, speaker models were created using the data of those       line speaker diarization and OCR, Baseline 2 refers to the
text tracks. Speaker tracking was performed evaluating the        fusion between the face detection and the OCR and Baseline
cosine distance between model i-vectors and i-vectors ex-         3 is the intersection of the both previous baselines. Initially,
tracted for each frame of the signal.                             speaker and face tracking have been evaluated separately.
   Speaker modelling was implemented using i-vectors [3].         The intersection and the union of both tracking systems were
An i-vector is a low rank vector, typically between 400 and       implemented as fusion strategies.
600, representing a speech utterance. Feature vectors of             As is shown in Table 1, both monomodal systems improve
the speech signal are modeled by a set of Gaussian Mix-           the baseline performances by a great margin. The union
tures (GMM) adapted from a Universal Background Model             strategy has shown a better performance than the intersec-
(UBM). The mean vectors of the adapted GMM are stacked            tion strategy but this fusion does not show a significative
to build the M supervector, wich can be written as:               performance increase over the individual modalities.
                       M = mu + T ω                        (1)       By analysing the results, we believe that failures at text
                                                                  detection was the main factor impacting the final scores.
where mu is the speaker- and session-independent mean su-
pervector from UBM, T is the total variability matrix, and
ω is a hidden variable. The mean of the posterior distri-         6.     CONCLUSIONS
bution of ω is referred to as i-vector. This posterior distri-       Speaker and face tracking have been combined using dif-
bution is conditioned on the Baum-Welch statistics of the         ferent fusion strategies. This year, our idea was to focus
given speech utterance. The T matrix is trained using the         only on the overlaid names to develop tracking systems in-
Expectation-Maximization (EM) algorithm given the cen-            stead of performing diarization systems merged with text
tralized Baum-Welch statistics from background speech ut-         dectection. Tracking systems have shown a better perfor-
terances. More details can be found in [3].                       mance than the diarization based ones of the baseline. For
  The speaker tracking system has been implemented as a           fusion, the union strategy has shown higher results than the
speaker identification system with a segmentation by classifi-    intersection method.
7.   REFERENCES
[1] J.-F. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille,
    A. Larcher, A. Preti, G. Pouchoulin, N. Evans,
    B. Fauve, and J. Mason. ALIZE/SpkDet: a
    state-of-the-art open source software for speaker
    recognition. In Proc. Odyssey: the Speaker and
    Language Recognition Workshop, 2008.
[2] H. Bredin, C. Barras, and C. Guinaudeau. Multimodal
    person discovery in broadcast tv at mediaeval 2016. In
    Working Notes Proceedings of the MediaEval 2016
    Workshop, 2016.
[3] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and
    P. Ouellet. Front-end factor analysis for speaker
    verification. IEEE Transactions on Audio, Speech, and
    Language Processing, May 2011.
[4] A. Larcher, J.-F. Bonastre, B. Fauve, K. A. Lee, H. L.
    Christophe Lévy, J. S.D, Mason, and J.-Y. Parfait.
    ALIZE 3.0 - Open Source Toolkit for State-of-the-Art
    Speaker Recognition. In Annual Conference of the
    International Speech Communication Association
    (Interspeech), 2013.
[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face
    recognition. In Proceedings of the British Machine
    Vision Conference (BMVC), 2015.
[6] J. Poignant, L. Besacier, G. Quénot, and F. Thollard.
    From text detection in videos to person identification.
    In ICME 2012, 2012.
[7] J. Poignant, H. Bredin, and C. Barras. Multimodal
    person discovery in broadcast tv at mediaeval 2015. In
    Working Notes Proceedings of the MediaEval 2015
    Workshop, 2015.
[8] K. Simonyan and A. Zisserman. Very deep
    convolutional networks for large-scale image
    recognition. CoRR, abs/1409.1556, 2014.