GTM-UVigo System for Multimodal Person Discovery in
             Broadcast TV Task at MediaEval 2016

                     Paula Lopez-Otero, Laura Docio-Fernandez, Carmen Garcia-Mateo
                 Multimedia Technologies Group (GTM), AtlantTIC Research Center, University of Vigo
                            E.E. Telecomunicación, Campus Universitario S/N, 36310 Vigo
                                            {plopez,ldocio,carmen}@gts.uvigo.es


ABSTRACT                                                                First, features were extracted from the waveform; specif-
In this paper, we present the system developed by GTM-               ically, 19 Mel-frequency cepstral coefficients (MFCCs) in-
UVigo team for the Multimedia Person Discovery in Broad-             cluding energy were extracted every 10 ms using a 25 ms
cast TV task at MediaEval 2016. The proposed approach                sliding window. A dynamic normalisation of the cepstral
consists in a novel strategy for person discovery which is not       mean was applied using a sliding window of 300 ms. These
based on speaker and face diarisation as in previous works.          features were extracted using the Kaldi toolkit [12]. Then,
In this system, the task is approached as a person recogni-          for each person name detected by the OCR:
tion problem: there is an enrolment stage, where the voice              • The time interval (tstart , tend ) in which the name of
and face of each discovered person are detected and, for each             the speaker spk appears is taken as a starting point.
shot, the most suitable voice and face are assigned using the             A strategy to enlarge this time interval in order to ob-
i-vector paradigm. These two biometric modalities are com-                tain more data to enrol the speaker is applied: given
bined by decision fusion.                                                 the time intervals Sleft = (tstart − 10, tend ) and Sright =
                                                                          (tstart , tend +10), a change point is searched within each
1.    INTRODUCTION                                                        of these intervals using the Bayesian information crite-
  The Person Discovery in Broadcast TV task at MediaEval                  rion algorithm (BIC) for speaker segmentation, having
2016 aims at finding out the names of people who can be                   the restriction that the change point has to be in the
both seen as well as heard in every shot of a collection of               intervals (tstart − 10, tstart ) and (tend , tend + 10), respec-
videos [2]. This paper describes a novel approach that is not             tively. If no change point was found within the interval
based on speaker and face diarisation as is usually done in               Sleft then tleft is set to tstart − 10 and, similarly, if no
this task [6, 7, 8, 10]; instead, the task is approached as a             change point was found within the interval Sright then
person recognition problem.                                               tright is set to tend + 10. Then, speaker spk is assumed
                                                                          to be speaking in the interval Sspk = (tleft , tright ). In
                                                                          case speaker spk appears several times in the OCR
2.    SYSTEM DESCRIPTION                                                  output, a segment is computed for each occurrence.
   The proposed system can be divided in an enrolment and a
search stage. For each person name detected by optical char-            • Speech activity detection (SAD) was performed in or-
acter recognition (OCR), the most likely interval of speech               der to remove the non-speech parts. To do so, the
and face presence are detected and used for enrolment. Once               energy-based SAD approach implemented in the Kaldi
the detected people are enrolled, speaker and face recogni-               toolkit was applied.
tion are performed for each shot in order to assign labels              • An i-vector [5] was extracted for speaker spk using the
to that shot. A decision fusion strategy is implemented in                Kaldi toolkit. In case several segments were obtained
order to combine the speech and video labels. The details                 in the first step, their features were concatenated and
of the system are described below.                                        all the segments were treated as a single one. In this
                                                                          step, the 19 MFCCs were augmented with their delta
2.1    Name detection                                                     and acceleration coefficients.
  The person names were obtained from the video using
the baseline system provided by the organisers. Specifically,        2.3    Face enrolment
the UPC OCR approach using LOOV was used [11]. Since                   When dealing with faces, the first step consisted in per-
the output of the OCR module had errors such as including            forming face tracking using the baseline approach based on
additional words in the person name, a naı̈ve filtering of the       histogram of oriented gradients [3] and the correlation tracker
OCR output was performed by removing those names that                proposed in [4]. Then, for each person name detected by the
had more than four words.                                            OCR:
2.2    Speech enrolment                                                 • The faces detected by the face tracker in the interval
                                                                          (tstart , tend ) in which the name of the speaker spk ap-
                                                                          pears are considered. Given that only one face was
Copyright is held by the author/owner(s).                                 detected, the whole presence interval of that face is
MediaEval 2016 Workshop, Oct. 20–21, 2016, Hilversum, Netherlands.        taken. In case more than one face was detected, the
                       Table 1: Results achieved on the whole test data and on each partition.
                   All                  3-24                 DW                   INA
           MAP@1 MAP@10 MAP@100 MAP@1 MAP@10 MAP@100 MAP@1 MAP@10 MAP@100 MAP@1 MAP@10 MAP@100
      p     0.315   0.236     0.211     0.538    0.394     0.366   0.242     0.185     0.185     0.358     0.265     0.208
      c1    0.293   0.182     0.168     0.487    0.338     0.314   0.242     0.157     0.157     0.314     0.178     0.146
      c2    0.245   0.199     0.177     0.333    0.303     0.286   0.116     0.088     0.088     0.302     0.170     0.132
      b     0.363   0.273     0.247     0.667    0.477     0.462   0.251     0.186     0.186     0.440     0.341     0.276


       one that appeared in more frames was assigned to the           Table 1 shows the results achieved with the audio+video
       speaker, assuming that was the dominant face in the         fusion system (p), the audio system only (c1), the video sys-
       given time interval.                                        tem only (c2) and the baseline provided by the organisers
                                                                   (b). The main conclusions that can be extracted from the
     • Features were extracted in the time interval obtained       Table are: (1) the audio and video systems are complemen-
       in the previous step. To do so, first face detection was    tary, since their combination leads to an improvement of the
       performed, and a geometric normalisation was done.          individual results; (2) the audio results are better than the
       After that, photometric enhancement of the image us-        video results, especially in the DW database; and (3) the
       ing the Tan&Triggs algorithm [13] was applied. Fi-          worst results were obtained in the DW database, while the
       nally, discrete cosine transform features (DCT) [9] were    best ones were achieved in the 3-24 database. The reason
       extracted using blocks of size 12 with 50% overlap and      why 3-24 results are, in general, better, might be caused
       45 DCT components. The feature extraction stage was         by the small number of queries in the evaluation data corre-
       performed using the Bob toolkit [1].                        sponding to this database (only 15 queries out of 693), which
                                                                   leads to results that are not significative. In the case of DW
     • Once the features were obtained, an i-vector repre-         database, 606 queries were evaluated; this, combined with
       senting that face was obtained using the Kaldi toolkit.     the fact that the OCR approach used in this system did not
       As done when dealing with speech, if there were sev-        find person names in 612 out of 757 files in the database, led
       eral time intervals where the face of the speaker was       to poor results in DW data.
       present, the features obtained in all the segments were        The aim of this system was to assess a novel approach
       concatenated.                                               for person discovery that is not based on speaker and face
                                                                   diarisation as in most state-of-art strategies. The achieved
2.4        Search                                                  results are promising, and the experiments performed in this
  The procedure to decide which speaker was present in each        evaluation allowed the detection of the main weak points of
shot consisted in, for each shot:                                  the system that will be improved in the future:
     • In order to detect whether the shot includes speech,           • The quality of the OCR output had a huge impact
       speech detection was performed: perceptual linear pre-           on the results, since this is the starting point of the
       diction coefficients plus pitch features were extracted          whole enrolment stage, which leads to a degradation of
       from the time interval defined by the shot, an i-vector          performance on the whole system. A simple approach,
       was extracted and a logistic regression approach was             based on natural language processing, for filtering the
       used to classify the segment as speech or non-speech.            OCR output in order to remove everything that were
       Non-speech segments were straightforwardly discarded.            not person names was assessed in this framework with
       In case speech was present in the shot, SAD was per-             no success, but further experiments on this topic will
       formed, an i-vector was extracted, and this shot i-              be done in the future.
       vector was compared with the enrolment i-vectors com-
       puting the dot scoring. The speaker that achieved the          • All face-based steps relied on the baseline approach
       highest score was assigned to the shot.                          for face tracking, and its output was fed to the fea-
                                                                        ture extraction module; however, only the information
     • The faces detected by the face tracker within the shot
                                                                        about presence was used, but not the bounding boxes
       were identified, and the one that appeared in more
                                                                        where the faces appeared. This probably led to incon-
       frames was chosen as the most representative face of
                                                                        sistencies in the feature extraction stage and, there-
       the shot. An i-vector was extracted and the same deci-
                                                                        fore, on the face enrolment procedure. This issue will
       sion procedure described for the speech data was per-
                                                                        be addressed in order to improve the quality of the
       formed.
                                                                        face-based approach.
     • Once a decision was made for both speech and face
       data, the following fusion approach was implemented:        Acknowledgements.
       given a shot, it is assigned to a speaker if the person
       detected by the face and speech detectors had the same      This research was funded by the Spanish Government un-
       name and if the sum of their scores was greater than        der the project TEC2015-65345-P, the Galician Government
       a threshold.                                                through the research contract GRC2014/024 (Modalidade:
                                                                   Grupos de Referencia Competitiva 2014) and ‘AtlantTIC
                                                                   Project’ CN2012/160, and by the European Regional De-
3.     RESULTS AND DISCUSSION                                      velopment Fund (ERDF).
4.   REFERENCES
 [1] A. Anjos, L. E. Shafey, R. Wallace, M. Günther,
     C. McCool, and S. Marcel. Bob: a free signal
     processing and machine learning toolbox for
     researchers. In 20th ACM Conference on Multimedia
     Systems (ACMMM), 2012.
 [2] H. Bredin, C. Barras, and C. Guinaudeau. Multimodal
     Person Discovery in Broadcast TV at MediaEval 2016.
     In Proceedings of the MediaEval 2016 Workshop, 2016.
 [3] N. Dalal and B. Triggs. Histograms of oriented
     gradients for human detection. In IEEE Computer
     Society Conference on Computer Vision and Pattern
     Recognition (CVPR), volume 1, pages 886–893, 2005.
 [4] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg.
     Accurate scale estimation for robust visual tracking.
     In Proceedings of the British Machine Vision
     Conference (BMVC), 2014.
 [5] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and
     P. Ouellet. Front end factor analysis for speaker
     verification. IEEE Transactions on Audio, Speech and
     Language Processing, 2010.
 [6] M. India, D. Varas, V. Vilaplana, J. Morros, and
     J. Hernando. UPC system for the 2015 MediaEval
     multimodal person discovery in broadcast TV task. In
     Proceedings of the MediaEval 2015 Workshop, 2015.
 [7] N. Le, D. Wu, S. Meignier, and J.-M. Odobez.
     EUMSSI team at the MediaEval person discovery
     challenge. In Proceedings of the MediaEval 2015
     Workshop, 2015.
 [8] P. Lopez-Otero, R. Barros, L. Docio-Fernandez,
     E. Gonzalez-Agulla, J. Alba-Castro, and
     C. Garcia-Mateo. GTM-UVigo systems for person
     discovery task at MediaEval 2015. In Proceedings of
     the MediaEval 2015 Workshop, 2015.
 [9] C. McCool and S. Marcel. Parts-based face verification
     using local frequency bands. In Proceedings of
     IEEE/IAPR international conference on biometrics,
     2009.
[10] F. Nishi, N. Inoue, and K. Shinoda. Combining audio
     features and visual i-vector @ MediaEval 2015
     multimodal person discovery in broadcast TV. In
     Proceedings of the MediaEval 2015 Workshop, 2015.
[11] J. Poignant, L. Besacier, G. Quénot, and F. Thollard.
     From text detection in videos to person identification.
     In Proceedings of IEEE International Conference on
     Multimedia and Expo (ICME), 2012.
[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget,
     O. Glembek, N. Goel, M. Hannemann, P. Motlicek,
     Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and
     K. Vesely. The Kaldi speech recognition toolkit. In
     IEEE 2011 Workshop on Automatic Speech
     Recognition and Understanding (ASRU). IEEE Signal
     Processing Society, 2011.
[13] X. Tan and B. Triggs. Enhanced local texture feature
     sets for face recognition under difficult lighting
     conditions. IEEE Transactions on Image Processing,
     19(6):1635–1650, 2010.