GTM-UVigo Systems for Person Discovery Task at
                         MediaEval 2015

                      Paula Lopez-Otero, Rosalía Barros, Laura Docio-Fernandez,
                Elisardo González-Agulla, José Luis Alba-Castro, Carmen Garcia-Mateo
                                                     AtlantTIC Research Center
                               {plopez,rbarros,ldocio,eli,jalba,carmen}@gts.uvigo.es


ABSTRACT                                                               After performing speech activity detection, the speech seg-
In this paper, we present the systems developed by GTM-             ments are further divided into speaker turns following the
UVigo team for the Multimedia Person Discovery in Broad-            approach described in [7]. First, Mel-frequency cepstral co-
cast TV task at MediaEval 2015. The systems propose two             efficients (MFCCs) plus energy are extracted from the wave-
different strategies for person discovery in audio through          form. After this, the Bayesian Information Criterion (BIC)
speaker diarization (one based on an online clustering strat-       based segmentation approach described in [2] is employed,
egy with error correction using OCR information and the             performing a coarse segmentation to find candidates followed
other based on agglomerative hierarchical clustering) as well       by a refinement step. A false alarm rejection strategy is ap-
as intrashot and intershot strategies for face clustering.          plied in the latter step so as to reject change-points that are
                                                                    suspicious of being false alarms [6].

1. INTRODUCTION                                                     2.3     Speaker Clustering
  The Person Discovery in Broadcast TV task at MediaE-                Two different approaches for speaker diarization were as-
val 2015 aims at finding out the names of people who can            sessed, one working in online mode, used in the primary sys-
be both seen as well as heard in every shot of a collection         tem, and another working in offline mode. A feature they
of videos [10]. This paper describes the audio, video and           have in common is the use of the iVector paradigm [3] for
multimodal approaches developed by GTM-UVigo team to                speaker turn representation.
address this task1 .
                                                                    2.3.1    Online approach
                                                                       This clustering strategy consists in comparing the iVectors
2. AUDIO-BASED PERSON DISCOVERY                                     of the speaker models with the iVector of a given speaker
  The audio approaches can be divided in three stages: speech       turn by computing its dot product and, if the maximum dot
activity detection, division of speech regions in speaker turns     product exceeds a predefined threshold, the speaker turn is
and, lastly, speaker clustering.                                    assigned to the speaker model; else, it is considered as a
                                                                    new speaker. Every time a new segment is assigned to a
2.1 Speech Activity Detection                                       speaker, its model is refined by computing the mean of all
   A Deep Neural Network (DNN) based speech activity de-            the iVectors assigned to that speaker model.
tector (SAD) was used. The acoustic features used were 26              A novel feature introduced in this online clustering scheme
log-mel-filterbank outputs, and a window of 31 frames was           is the use of written names obtained from OCR [9] for auto-
used to predict the label of the central frame. The DNN             matic error correction. To that end, the speaker assignment
has the following architecture: 806 unit input layer, 4 hid-        using these written names is considered as more reliable than
den layers, each containing 32 tanh activation units, and an        the clustering assignment, so anytime the clustering and the
output layer consisting of two softmax units. The output            written name approach make a different decision, the writ-
layer generates a posterior probability for the presence or         ten name will prevail over the clustering decision.
non-presence of speech, and the ratio of both output poste-
riors is used as a confidence measure about speech activity         2.3.2    Offline approach
over time. This confidence is median filtered to produce a             The proposed offline clustering strategy relies on an ag-
smoothed estimate of speech presence and, finally, a frame          glomerative hierarchical clustering scheme. First, a similar-
is classified as speech if this smoothed value is greater than      ity matrix was obtained by computing the dot product be-
a threshold.                                                        tween all the pairwise combinations of the iVectors of each
                                                                    speaker turn, and this matrix was used to obtain a den-
2.2 Speaker Segmentation                                            drogram.The C-score stopping criterion described in [8] was
1
                                                                    used to select the number of clusters.
 The code of GTM-UVigo systems will be released                at
https://github.com/gtm-uvigo/Mediaeval_PersonDiscovery
                                                                    3.    VIDEO-BASED PERSON DISCOVERY
                                                                      The video-based strategies encompass three different steps:
Copyright is held by the author/owner(s).                           face detection and tracking, visual speech activity detection
MediaEval 2015 Workshop, Sept. 14–15, 2015, Wurzen, Germany.        and face clustering.
3.1 Face detection and Tracking                                   Table 1. The results achieved using the baseline metadata
   Face detection is based on histogram of oriented gradient      (b) are also shown for comparison.
features (HOG) and a linear SVM classifier implemented in
dlib library [5]. For each detected person, a face tracking and        Table 1: Summary of the submitted systems
landmark detection method based on CLNF models are used                      System         Spk. clustering Face clustering
[1]; every time a person stops being visible on screen, a model           Primary (p)             online         intrashot
that has information about presence, speech intervals and               Contrastive1 (c1)         online         intershot
the highest quality face templates is stored in a database.             Contrastive2 (c2)         offline        intrashot
To reduce the false alarm rate, face tracks that have a short           Contrastive3 (c3)         offline        intershot
time duration and a low quality score are rejected; this score
is calculated with a weighted sum of face symmetry and               Table 2 shows that the two speaker diarization strategies
sharpness values.                                                 are almost equally suitable for this task as they achieve very
                                                                  similar results; however, the online strategy shows a better
3.2 Visual Speech Activity Detection                              performance, probably due to the use of the OCR informa-
  The proposed visual speech activity detection method is         tion for error correction. With respect to the face cluster-
based on the relative mouth movements which are generally         ing strategies, the intrashot method obtained better results,
small in silence sections, whereas variations of lip shape are    probably because the intershot combination led to an exces-
usually stronger during speech [12]. Using face landmarks         sive combination of faces, making the system miss speakers
obtained from the previous step, mouth openness and lips          by erroneously combining them with others.
height variance over time are computed. A variable thresh-
old based on face size is applied in order to make the decision   Table 2: Results on development and test datasets
at each frame and a low-pass filter is used to smooth results.    corresponding to July 1st deadline.

3.3 Face clustering                                                              REPERE                           INA
   The face clustering strategies consist in a face recognition        EwMAP      MAP         C        EwMAP      MAP          C
system so that every time a face track is going to be inserted     p 75.76% 77.10% 78.03% 80.34% 80.61%                      92.42%
in the database, a score is computed in order to add it as a       c1 74.90% 75.80% 77.58% 75.42% 75.69%                     85.99%
new person or to merge it with an existing one. First, Gabor       c2 75.76% 77.10% 77.58% 80.21% 80.49%                     92.32%
features are extracted from the highest-quality templates of
                                                                   c3 75.54% 76.43% 77.58% 75.26% 75.54%                     85.89%
a person and matching scores are obtained using the hy-
per cosine distance [4]. Second, the final score to compare        b    63.58%   63.93%     71.75%      78.35%   78.64%      92.71%
with the merging threshold is computed as the maximum of
all the matching scores obtained from the two sets of face           The development of the audio-based person discovery ap-
images. In the intrashot strategy, only models that appear        proaches showed us that a lower speaker diarization error
within the same shot are compared, aiming at correcting           rate do not lead to a higher EwMAP, as overclustering re-
presence intervals when the tracking method fails. The in-        sults in incorrect person detections. Also, we have to in-
tershot strategy allows to merge all the person appearances       crease our efforts in TV programmes featuring challenging
in a video.                                                       acoustic conditions, which are the ones who had a more de-
                                                                  graded performance. Lastly, we realised that adding writ-
                                                                  ten names obtained from OCR information to the speaker
4. MULTIMODAL PERSON DISCOVERY                                    diarization algorithm led to an improvement of the perfor-
   Multimodal person discovery was performed using four           mance, so this type of fusion will be studied in more depth.
different sources of information: speaker diarization (SD)           The proposed video-based person discovery approaches
using the techniques described in Section 2; face detection       showed us that the intrashot strategy performed better than
(FD) and video-based speech activity detection (VVAD) as          the intershot strategy, probably because of the overcluster-
described in Section 3; and written names (WN) extracted          ing issue mentioned above. The most challenging aspects,
using the strategy described in [9]. First, the set of evi-       that will have to be addressed in the future, were the varia-
dences is defined as proposed in the baseline fusion strategy     tions in pose, scale and illumination, as they made it difficult
provided by the organizers. Given a shot, a person is consid-     to develop a robust face matching strategy.
ered to appear in it if the same name is present in SD, FD           GTM-UVigo team got into this task by developing au-
and VVAD within the time interval that defines the shot. A        dio and face modules and combining them through a simple
late naming strategy was used to assign names to the differ-      decision-level fusion but, in future work, audiovisual fusion
ent sources of information [11]. For each hypothesized name,      in earlier stages of the system will be researched in order to
a confidence is computed as proposed in the baseline strat-       exploit all the potential of multimodal person discovery.
egy, but those hypotheses with confidence lower than 1 are
discarded, as they correspond to situations of non-overlap        6.   ACKNOWLEDGEMENTS
between the evidence and the hypothesized name.
                                                                     This research was funded by the Spanish Government
                                                                  (’SpeechTech4All Project’ TEC2012-38939-C03-01), the Gali-
5. RESULTS AND DISCUSSION                                         cian Government through the research contract GRC2014/024
  Table 2 shows the results achieved by the submitted sys-        (Modalidade: Grupos de Referencia Competitiva 2014) and
tems both in REPERE (partition ’test2’) and INA datasets;         ’AtlantTIC Project’ CN2012/160, and also by the Spanish
these systems are combinations of the two proposed speaker        Government and the European Regional Development Fund
diarization and face clustering strategies as summarized in       (ERDF) under project TACTICA.
7. REFERENCES                                                [7] P. Lopez-Otero, L. Docio-Fernandez, and
[1] T. Baltrusaitis, P. Robinson, and L. Morency.                C. Garcia-Mateo. GTM-UVigo system for Albayzin
    Constrained local neural fields for robust facial            2014 audio segmentation evaluation. In Iberspeech
    landmark detection in the wild. In IEEE International        2014: VIII Jornadas en Tecnologı́a del Habla and IV
    Conference on Computer Vision Workshops                      SLTech Workshop, 2014.
    (ICCVW), pages 354–361, 2013.                            [8] P. Lopez-Otero, L. Docio-Fernandez, and
[2] M. Cettolo and M. Vescovi. Efficient audio                   C. Garcia-Mateo. A novel method for selecting the
    segmentation algorithms based on the BIC. In                 number of clusters in a speaker diarization system. In
    Proceedings of ICASSP, volume VI, pages 537–540,             Proceedings of EUSIPCO, pages 656–660, 2014.
    2003.                                                    [9] J. Poignant, L. Besacier, G. Quénot, and F. Thollard.
[3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and           From text detection in videos to person identification.
    P. Ouellet. Front end factor analysis for speaker            In Proceedings of IEEE International Conference on
    verification. IEEE Transactions on Audio, Speech and         Multimedia and Expo (ICME), 2012.
    Language Processing, 2010.                              [10] J. Poignant, H. Bredin, and C. Barras. Multimodal
[4] E. González-Agulla, E. Argones-Rua, J. Alba-Castro,         Person Discovery in Broadcast TV at MediaEval 2015.
    D. González-Jiménez, and L. Anido-Rifón.                  In Proceedings of the MediaEval 2015 Workshop, 2015.
    Multimodal biometrics-based student attendance          [11] J. Poignant, H. Bredin, V. Le, L. Besacier, C. Barras,
    measurement in learning management systems. In               and G. Quénot. Unsupervised speaker identification
    IEEE International Symposium on Multimedia (ISM),            using overlaid texts in TV broadcast. In Proceedings of
    pages 699–704, 2009.                                         Interspeech, 2012.
[5] D. King. Dlib-ml: A machine learning toolkit. The       [12] B. Rivet, L. Girin, and C. Jutten. Visual voice activity
    Journal of Machine Learning Research, 10:1755–1758,          detection as a help for speech source separation from
    2009.                                                        convolutive mixtures. Speech Communication,
[6] P. Lopez-Otero. Improved Strategies for Speaker              49(7):667–677, 2007.
    Segmentation and Emotional State Detection. PhD
    thesis, Universidade de Vigo, 2015.