1. INTRODUCTION

PERCOLATTE: A Multimodal Person Discovery System in TV Broadcast for the MediaEval 2015 Evaluation Campaign

Meriem Bendris

Delphine Charlet

Gregory Senay

MinYoung Kim

Benoit Favre

Mickael Rouvier

Frederic Bechet

Géraldine Damnati

0 Aix Marseille Université 1 Panasonic Silicon Valley Lab

2015

14 15

This paper describes the PERCOLATTE participation to MediaEval 2015 task: \Multimodal Person Discovery in Broadcast TV" which requires developing algorithms for unsupervised talking face identi cation in broadcast news. The proposed approach relies on two identity propagation strategies both based on document chaptering and restricted overlaid names propagation rules. The primary submission shows 10% improvement of Mean Average Precision of the baseline on the INA corpus.

1. INTRODUCTION

Identifying people in TV broadcasts has had a lot of attention the last decade in the literature. Current trends aim to combine traditional techniques with high level information such as prior knowledge on document structure. Indeed, TV program often have regular structure organized in homogeneous sequences. The REPERE Challenge, that ended in 2014, aimed at developing multimodal algorithms for people identi cation in TV broadcasts. Our PERCOLATOR system based on scene understanding features ranked rst on the main task in 2014 [ 2 ]. The Mediaeval \Multimodal Person Discovery in Broadcast TV" task focuses on unsupervised talking face identi cation [ 7 ] for search engine applications. One novelty of this task is the metadata made available by the organizers allowing expanded participations.

This paper describes the PERCOLATTE system submitted at the MediaEval 2015. The system relies on the enrichment of broadcast news with video structure features such as shot classi cation (studio/report) and speaker role recognition. Two identi cation strategies were developed: the primary is based on chapter-restricted identity propagation to shot clusters and the secondary is based on speaker identi cation and rule-based speaker-face mapping. Figure 1 shows the pipeline of the PERCOLATTE system. Notice that no face-related processing (detection/identi cation) is used in our approach.

TOOLS

The MediaEval 2015 organizers made available di erent baseline mono-modal tools. In our system, we used the provided Overlaid Person Names (OPN) [ 6 ] system. In addition, we used the automatic named entities [ 4 ] and the speaking-face mapping to x the identi cation scores. 2.1

List of names

The Audiovisual National Institute (INA) collects and enriches broadcast news with metadata such as summary, identity of journalists, etc. We collected the metadata1 from December 2004 to December 2009 and extracted automatically the list of several journalists and anchors. 2.2

Overlaid anchor name detection

Anchor names were not detected by the provided OCR system. We developed an anchor name detector relying on a Levenshtein-based mapping of OCR results 2 (on 2 rescaled frames) and the list of names described previously. 2.3

Speaker clustering

The speaker clustering follows the approach described in [ 1 ]. First, speech segments are grouped using a BIC clustering. Then, obtained clusters are modelled with GMMs in order to more accurately compare voices using a Cross-Likelihood Criterion (CLR) in a second agglomerative clustering. At each iteration, Viterbi decoding is performed to re-segment the speech data into speaker turns given the new clusters. 2.4

Speaker role classification

We used a simpli ed version of the speaker role classi cation approach described in [ 3 ]. First, the anchor is the speaker cluster who speaks the most and regularly. Then, a binary classi cation reporter/other is performed. As no speech transcript was available, in this work, the classi cation relies only on an acoustic GMM classi er. 2.5

Speaker identification

Speaker turns are identi ed by propagating OPNs to the speaker turns that maximise temporal overlapping and to it's cluster within the same chapter. 1Available on http://www.ina.fr 2https://github.com/meriembendris/ADNVideo 2.6

Shot boundary detection

Two systems were used based on RGB histograms peaks [ 10 ] and HSV histogram peaks on sliding window 2. As the evaluation script needs the provided shot segmentation, a shot boundaries mapping was necessary. 2.7

Shot similarity and clustering

In order to measure the similarity between shots, three features where extracted: RGB histograms, HOG features on resized frames (128 64) and DNN-based frame representation (image embeddings). For the DNN-based features, we used the Alexnet DNN [ 5 ] to extract feature vectors at the 3rd fully-connected layer (1000 dimension vectors). Then, shots were grouped using cosine-based distance and Integer Linear Program clustering (described in [ 9 ]). 2.8

Shot classification and chaptering

The shot classi er is trained on external data (8 broadcast news, 4914 shots). Four labels were annotated: studio, report, mixed and other. First, HOG features on resized frames (128x64) were extracted for each shot. Then, a Liblinear3 classi er was trained on three quarters of data. The system reached 99.43% of accuracy on the remaining quarter. Finally, successive shots sharing the same label were grouped into chapters. 3.

TALKING FACE IDENTIFICATION

Participants were asked to provide identi ed talking faces within shots with their con dence scores and evidences justifying their assertions. Two strategies were developed. 3.1

The primary strategy

The primary strategy relies on the fact that report chapters are independent in broadcast news. The strategy is based on a restricted OPN propagation to cluster shots within the same chapter. Precisely, we followed those rules: Propagate OPN to overlapping shots and their shot clusters sharing the speaker cluster within a chapter. Propagate anchor name to overlapping \studio" shots and their shot clusters without chapters restrictions.

Propagate anchor name if the speaker role is an anchor.

For each identi ed talking face, the score was initialized by the provided OPN score and incrementally increased following those events: OPN shot overlapping, provided talkingface score > 0:8 and OPN pronounced around the shot( 5s). 3.2

The secondary strategy

The secondary strategy is based on a speaker identi cation followed by speaker-face rule-based mapping. This mapping relies on simple rules based on prior knowledge about broadcast news. Precisely, we considered a speaker visible when the name appears on the screen (OPN), on studio shots and on report shots when the role is not a reporter. In this strategy, no scores function was developed (score=1). 3.3

The evidence

To ensure that identities where detected only in unsupervised way, and to help collaborative annotations of the test set, participants were asked to select one shot per name proving his/her identity. For each name, we selected the provided OPN shot that maximizes the OCR result score. 3http://www.csie.ntu.edu.tw/~cjlin/liblinear/ 4.

Systems were evaluated using the Mean Average Precision (M AP ) metric and the o cial C and EwM AP metrics described in [ 7 ]. Two submission deadlines were xed: July 1st and 8th. In our submissions, the only di erence concerns shot boundary mapping. Indeed, on July the 1st, the mapping was based on overlapping shots over 0:5s (a rather cure strategy) while it was on overlapping coverage above 50% for the July the 8th submissions. Four runs were submitted: Primary: primary strategy with DNN- and HOGbased shot clustering.

Primary DNNOnly: primary strategy with DNNbased shot clustering.

Primary RGBOnly: primary strategy with RGBbased shot clustering.

Secondary: secondary strategy based on speaker identi cation and speaker-face rule-based mapping.

Table 1 shows results of the PERCOLATTE runs. The secondary strategy having similar principles than the baseline [ 8 ] shows a MAP improvement of 8%. Indeed, chapterrestricted propagation in addition to simple rule-based speakerface mapping based on shot classi cation and speaker roles allowed to detect less talking faces with higher precision. The primary strategy using DNN- and HOG-based shot clustering obtains the best MAP of 88.45%. This shows the consistency of the chapter-constrained propagation strategy in broadcast news. Contrastive runs with di erent features for shot clustering did not show signi cant di erences. Anchor names were detected in 93% of shows. However, the primary run without anchor-speci c modules performs 88.31% of MAP.

Metrics Baseline Secondary on July 1st Secondary on July 8th Primary DNNOnly on July 1st Primary DNNOnly on July 8th Primary RGBOnly on July 1st Primary RGBOnly on July 8th Primary on July 1st deadline Primary on July 8th

Acknowledgment. This work has been carried out thanks to the support of the A*MIDEX project (no ANR-11-IDEX-0001-02) funded by the \Investissements d'Avenir" French Government program, managed by the French National Research Agency (ANR).

[1]

Barras ,

Zhu ,

Meignier , and

J.-L.

Gauvain . Multi-stage speaker diarization of broadcast news . IEEE Transactions on Audio, Speech and Language Processing , 2006 .

[2]

Bechet ,

Bendris ,

Charlet ,

Damnati ,

Favre ,

Rouvier ,

Auguste ,

Bigot ,

Dufour ,

Fredouille , G. Linares, G. Senay,

Tirilly , and

Martinet . Multimodal understanding for person recognition in video broadcasts . In Interspeech, Singapore , 2014 .

[3]

Damnati and

Charlet . Multi-view approach for speaker turn role labeling in tv broadcast news shows . In INTERSPEECH , pages 1285 { 1288 . ISCA , 2011 .

[4]

Dinarelli and

Rosset . Models cascade for tree-structured named entity detection . In Proceedings of 5th International Joint Conference on Natural Language Processing , pages 1269 { 1278 , Chiang

Mai

, Thailand, November 2011 . Asian Federation of Natural Language Processing .

[5]

Krizhevsky , I. Sutskever , and

G. E.

Hinton . Imagenet classi cation with deep convolutional neural networks . In F. Pereira,

Burges ,

Bottou , and K. Weinberger, editors, Advances in Neural Information Processing Systems 25 , pages 1097 { 1105 . Curran Associates, Inc., 2012 .

[6]

Poignant ,

Besacier , G. Quenot, and

Thollard . From text detection in videos to person identi cation . In Multimedia and Expo (ICME) , 2012 IEEE International Conference on, 2012 .

[7]

Poignant ,

Bredin , and

Barras . Multimodal person discovery in broadcast tv at mediaeval 2015 . In MediaEval, 2015 .

[8]

Poignant ,

Bredin , V. -B. Le , L.

Besacier , C.

Barras , and G.

Quenot. Unsupervised Speaker

Identi cation using Overlaid Texts in TV Broadcast . In Interspeech 2012 - Conference of the International Speech Communication Association, Portland, OR , United States , 2012 . Poster Session: Speaker Recognition III .

[9]

Rouvier and

Meignier . A global optimization framework for speaker diarization . In Speaker Odyssey , 2012 .

[10]

Zhang ,

Hu , and

Song . A shot boundary detection method based on color feature . In Computer Science and Network Technology (ICCSNT) , 2011 International Conference on, 2011 .