PERCOLATTE: A Multimodal Person Discovery System in
TV Broadcast for the MediaEval 2015 Evaluation Campaign

       Meriem Bendris1 , Delphine Charlet2 , Gregory Senay3 , MinYoung Kim3 , Benoit Favre1 ,
                    Mickael Rouvier1 , Frederic Bechet1 ,Géraldine Damnati2
                           1
                               Aix Marseille Université, 2 OrangeLabs, 3 Panasonic Silicon Valley Lab


ABSTRACT
This paper describes the PERCOLATTE participation to
MediaEval 2015 task: “Multimodal Person Discovery in Broad-
cast TV” which requires developing algorithms for unsuper-
vised talking face identification in broadcast news. The pro-
posed approach relies on two identity propagation strategies
both based on document chaptering and restricted overlaid
names propagation rules. The primary submission shows               Figure 1: The PERCOLATTE pipeline. Our mod-
10% improvement of Mean Average Precision of the base-              ules are outlined in blue.
line on the INA corpus.

1.   INTRODUCTION                                                   2.1     List of names
                                                                       The Audiovisual National Institute (INA) collects and en-
   Identifying people in TV broadcasts has had a lot of at-
                                                                    riches broadcast news with metadata such as summary, iden-
tention the last decade in the literature. Current trends
                                                                    tity of journalists, etc. We collected the metadata1 from De-
aim to combine traditional techniques with high level in-
                                                                    cember 2004 to December 2009 and extracted automatically
formation such as prior knowledge on document structure.
                                                                    the list of several journalists and anchors.
Indeed, TV program often have regular structure organized
in homogeneous sequences. The REPERE Challenge, that
ended in 2014, aimed at developing multimodal algorithms
                                                                    2.2     Overlaid anchor name detection
for people identification in TV broadcasts. Our PERCOLA-              Anchor names were not detected by the provided OCR
TOR system based on scene understanding features ranked             system. We developed an anchor name detector relying
first on the main task in 2014 [2]. The Mediaeval “Multi-           on a Levenshtein-based mapping of OCR results 2 (on ×2
modal Person Discovery in Broadcast TV” task focuses on             rescaled frames) and the list of names described previously.
unsupervised talking face identification [7] for search engine
applications. One novelty of this task is the metadata made         2.3     Speaker clustering
available by the organizers allowing expanded participations.         The speaker clustering follows the approach described in [1].
   This paper describes the PERCOLATTE system submit-               First, speech segments are grouped using a BIC clustering.
ted at the MediaEval 2015. The system relies on the en-             Then, obtained clusters are modelled with GMMs in order
richment of broadcast news with video structure features            to more accurately compare voices using a Cross-Likelihood
such as shot classification (studio/report) and speaker role        Criterion (CLR) in a second agglomerative clustering. At
recognition. Two identification strategies were developed:          each iteration, Viterbi decoding is performed to re-segment
the primary is based on chapter-restricted identity propaga-        the speech data into speaker turns given the new clusters.
tion to shot clusters and the secondary is based on speaker
identification and rule-based speaker-face mapping. Figure          2.4     Speaker role classification
1 shows the pipeline of the PERCOLATTE system. Notice                  We used a simplified version of the speaker role classifi-
that no face-related processing (detection/identification) is       cation approach described in [3]. First, the anchor is the
used in our approach.                                               speaker cluster who speaks the most and regularly. Then,
                                                                    a binary classification reporter/other is performed. As no
2.   TOOLS                                                          speech transcript was available, in this work, the classifica-
  The MediaEval 2015 organizers made available different            tion relies only on an acoustic GMM classifier.
baseline mono-modal tools. In our system, we used the pro-
vided Overlaid Person Names (OPN) [6] system. In ad-                2.5     Speaker identification
dition, we used the automatic named entities [4] and the               Speaker turns are identified by propagating OPNs to the
speaking-face mapping to fix the identification scores.             speaker turns that maximise temporal overlapping and to
                                                                    it’s cluster within the same chapter.
                                                                    1
Copyright is held by the author/owner(s).                               Available on http://www.ina.fr
                                                                    2
MediaEval 2015 Workshop Sept. 14-15, 2015, Wurzen, Germany              https://github.com/meriembendris/ADNVideo
2.6     Shot boundary detection                                    4.    EVALUATION
  Two systems were used based on RGB histograms peaks [10]            Systems were evaluated using the Mean Average Preci-
and HSV histogram peaks on sliding window 2 . As the eval-         sion (M AP ) metric and the official C and EwM AP metrics
uation script needs the provided shot segmentation, a shot         described in [7]. Two submission deadlines were fixed: July
boundaries mapping was necessary.                                  1st and 8th. In our submissions, the only difference concerns
                                                                   shot boundary mapping. Indeed, on July the 1st, the map-
2.7     Shot similarity and clustering                             ping was based on overlapping shots over 0.5s (a rather cure
  In order to measure the similarity between shots, three          strategy) while it was on overlapping coverage above 50%
features where extracted: RGB histograms, HOG features             for the July the 8th submissions. Four runs were submitted:
on resized frames (128×64) and DNN-based frame represen-
tation (image embeddings). For the DNN-based features, we               • Primary: primary strategy with DNN- and HOG-
used the Alexnet DNN [5] to extract feature vectors at the                based shot clustering.
3rd fully-connected layer (1000 dimension vectors). Then,               • Primary DNNOnly: primary strategy with DNN-
shots were grouped using cosine-based distance and Integer                based shot clustering.
Linear Program clustering (described in [9]).
                                                                        • Primary RGBOnly: primary strategy with RGB-
2.8     Shot classification and chaptering                                based shot clustering.
   The shot classifier is trained on external data (8 broad-            • Secondary: secondary strategy based on speaker iden-
cast news, 4914 shots). Four labels were annotated: studio,               tification and speaker-face rule-based mapping.
report, mixed and other. First, HOG features on resized
frames (128x64) were extracted for each shot. Then, a Lib-            Table 1 shows results of the PERCOLATTE runs. The
linear3 classifier was trained on three quarters of data. The      secondary strategy having similar principles than the base-
system reached 99.43% of accuracy on the remaining quar-           line [8] shows a MAP improvement of 8%. Indeed, chapter-
ter. Finally, successive shots sharing the same label were         restricted propagation in addition to simple rule-based speaker-
grouped into chapters.                                             face mapping based on shot classification and speaker roles
                                                                   allowed to detect less talking faces with higher precision.
3.     TALKING FACE IDENTIFICATION                                 The primary strategy using DNN- and HOG-based shot clus-
   Participants were asked to provide identified talking faces     tering obtains the best MAP of 88.45%. This shows the con-
within shots with their confidence scores and evidences jus-       sistency of the chapter-constrained propagation strategy in
tifying their assertions. Two strategies were developed.           broadcast news. Contrastive runs with different features for
                                                                   shot clustering did not show significant differences. Anchor
3.1     The primary strategy                                       names were detected in 93% of shows. However, the pri-
  The primary strategy relies on the fact that report chap-        mary run without anchor-specific modules performs 88.31%
ters are independent in broadcast news. The strategy is            of MAP.
based on a restricted OPN propagation to cluster shots within
the same chapter. Precisely, we followed those rules:               Metrics                              EwMAP        MAP        C
                                                                    Baseline                              78.35       78.64    92.71
     • Propagate OPN to overlapping shots and their shot
                                                                    Secondary on July 1st                 85.89       86.12    97.68
       clusters sharing the speaker cluster within a chapter.
                                                                    Secondary on July 8th                 86.40       86.61    97.68
     • Propagate anchor name to overlapping “studio” shots          Primary DNNOnly on July 1st           81.41       81.67    97.63
       and their shot clusters without chapters restrictions.       Primary DNNOnly on July 8th           87.75       88.01    97.63
     • Propagate anchor name if the speaker role is an anchor.      Primary RGBOnly on July 1st           81.02       81.28    97.63
  For each identified talking face, the score was initialized by    Primary RGBOnly on July 8th           87.33       87.60    97.63
the provided OPN score and incrementally increased follow-          Primary on July 1st deadline          81.70       81.96    97.63
ing those events: OPN shot overlapping, provided talking-           Primary on July 8th                   88.19       88.45    97.63
face score > 0.8 and OPN pronounced around the shot(±5s).
                                                                   Table 1: Performances of PERCOLATTE 2015 runs.
3.2     The secondary strategy
   The secondary strategy is based on a speaker identification
followed by speaker-face rule-based mapping. This mapping
                                                                   5.    CONCLUSIONS
relies on simple rules based on prior knowledge about broad-          In this paper, we described the PERCOLATTE strate-
cast news. Precisely, we considered a speaker visible when         gies for talking face identification. The system without face-
the name appears on the screen (OPN), on studio shots and          related processing is based on chapter-restricted propagation
on report shots when the role is not a reporter. In this           of overlaid names. A significant improvement of the base-
strategy, no scores function was developed (score=1).              line is achieved on the INA corpus by the primary strat-
                                                                   egy (+10% of MAP). Results show that in structured pro-
3.3     The evidence                                               grams, easy-to-establish features such as shot classification
  To ensure that identities where detected only in unsu-           and prior knowledge about broadcast news allow to improve
pervised way, and to help collaborative annotations of the         significantly talking faces identification.
test set, participants were asked to select one shot per name      Acknowledgment. This work has been carried out thanks to the
proving his/her identity. For each name, we selected the           support of the A*MIDEX project (no ANR-11-IDEX-0001-02) funded
                                                                   by the “Investissements d’Avenir” French Government program, man-
provided OPN shot that maximizes the OCR result score.             aged by the French National Research Agency (ANR).
3
    http://www.csie.ntu.edu.tw/~cjlin/liblinear/
6.    REFERENCES
 [1] C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain. Multi-stage
     speaker diarization of broadcast news. IEEE Transactions on
     Audio, Speech and Language Processing, 2006.
 [2] F. Bechet, M. Bendris, D. Charlet, G. Damnati, B. Favre,
     M. Rouvier, R. Auguste, B. Bigot, R. Dufour, C. Fredouille,
     G. Linares, G. Senay, P. Tirilly, and J. Martinet. Multimodal
     understanding for person recognition in video broadcasts. In
     Interspeech, Singapore, 2014.
 [3] G. Damnati and D. Charlet. Multi-view approach for speaker
     turn role labeling in tv broadcast news shows. In
     INTERSPEECH, pages 1285–1288. ISCA, 2011.
 [4] M. Dinarelli and S. Rosset. Models cascade for tree-structured
     named entity detection. In Proceedings of 5th International
     Joint Conference on Natural Language Processing, pages
     1269–1278, Chiang Mai, Thailand, November 2011. Asian
     Federation of Natural Language Processing.
 [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
     classification with deep convolutional neural networks. In
     F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors,
     Advances in Neural Information Processing Systems 25, pages
     1097–1105. Curran Associates, Inc., 2012.
 [6] J. Poignant, L. Besacier, G. Quenot, and F. Thollard. From
     text detection in videos to person identification. In Multimedia
     and Expo (ICME), 2012 IEEE International Conference on,
     2012.
 [7] J. Poignant, H. Bredin, and C. Barras. Multimodal person
     discovery in broadcast tv at mediaeval 2015. In MediaEval,
     2015.
 [8] J. Poignant, H. Bredin, V.-B. Le, L. Besacier, C. Barras, and
     G. Quénot. Unsupervised Speaker Identification using Overlaid
     Texts in TV Broadcast. In Interspeech 2012 - Conference of
     the International Speech Communication Association,
     Portland, OR, United States, 2012. Poster Session: Speaker
     Recognition III.
 [9] M. Rouvier and S. Meignier. A global optimization framework
     for speaker diarization. In Speaker Odyssey, 2012.
[10] H. Zhang, R. Hu, and L. Song. A shot boundary detection
     method based on color feature. In Computer Science and
     Network Technology (ICCSNT), 2011 International
     Conference on, 2011.