Multimodal Person Discovery in Broadcast TV
                            at MediaEval 2015

                                   Johann Poignant, Hervé Bredin, Claude Barras
                                   LIMSI - CNRS - Rue John Von Neumann, Orsay, France.
                                                 firstname.lastname@limsi.fr


ABSTRACT                                                           to overcome the limitations of monomodal approaches. Its
We describe the “Multimodal Person Discovery in Broadcast          main goal was to answer the two questions “who speaks
TV” task of MediaEval 2015 benchmarking initiative. Par-           when?” and “who appears when?” using any available source
ticipants were asked to return the names of people who can         of information (including pre-existing biometric models and
be both seen as well as heard in every shot of a collection        person names extracted from text overlay and speech tran-
of videos. The list of people was not known a priori and           scripts). To assess the technology progress, annual evalua-
their names had to be discovered in an unsupervised way            tions were organized in 2012, 2013 and 2014. Thanks to this
from media content using text overlay or speech transcripts.       challenge and the associated multimodal corpus [16], signif-
The task was evaluated using information retrieval metrics,        icant progress was achieved in either supervised or unsuper-
based on a posteriori collaborative annotation of the test         vised mulitmodal person recognition [1, 2, 4, 5, 6, 7, 14, 15,
corpus.                                                            23, 25, 26, 27, 28]. The REPERE challenge came to an end
                                                                   in 2014 and this task can be seen as a follow-up campaign
                                                                   with a strong focus on unsupervised person recognition.
1.   MOTIVATION
   TV archives maintained by national institutions such as
the French INA, the Netherlands Institute for Sound & Vi-          2.        DEFINITION OF THE TASK
sion, or the British Broadcasting Corporation are rapidly             Participants were provided with a collection of TV broad-
growing in size. The need for applications that make these         cast recordings pre-segmented into shots. Each shot s ∈ S
archives searchable has led researchers to devote concerted        had to be automatically tagged with the names of people
effort to developing technologies that create indexes.             both speaking and appearing at the same time during the
   Indexes that represent the location and identity of peo-        shot: this tagging algorithm is denoted by L : S 7→ P(N ) in
ple in the archive are indispensable for searching archives.       the rest of the paper. The main novelty of the task is that
Human nature leads people to be very interested in other           the list of persons was not provided a priori, and person bio-
people. However, when the content is created or broadcast,         metric models (neither voice nor face) could not be trained
it is not always possible to predict which people will be the      on external data. The only way to identify a person was by
most important to find in the future. For this reason, it is       finding their name n ∈ N in the audio (e.g. using speech
not possible to assume that biometric models will always be        transcription – ASR) or visual (e.g. using optical charac-
available at indexing time. For some people, such a model          ter recognition – OCR) streams and associating them to the
may not be available in advance, simply because they are not       correct person. This made the task completely unsupervised
(yet) famous. In such cases, it is also possible that archivists   (i.e. using algorithms not relying on pre-existing labels or
annotating content by hand do not even know the name of            biometric models).
the person. The goal of this task is to address the challenge         Because person names were detected and transcribed au-
of indexing people in the archive, under real-world condi-         tomatically, they could contain transcription errors to a cer-
tions (i.e. when there is no pre-set list of people to index).     tain extent (more on that later in Section 5). In the follow-
   Canseco et al. [8, 9] pioneered approaches relying on pro-      ing, we denote by N the set of all possible person names in
nounced names instead of biometric models for speaker iden-        the universe, correctly formatted as firstname_lastname –
tification [13, 19, 22, 30]. However, due to relatively high       while N is the set of hypothesized names.
speech transcription and named entity detection errors, all
these audio-only approaches did not achieve good enough                                           Hello                 blah             blah    blah         LEGEND

identification performance. Similarly, for face recognition,                                      Mrs B                 blah             blah    blah

initial visual-only approaches based on overlaid title box                    A               A                     B               A        B
                                                                                                                                                                speech
                                                                                                                                                              transcript

transcriptions were very dependent on the quality of overlaid                                                                                       INPUT

                                                                                                                                                            text overlay
name transcription [18, 29, 32, 33].                                         Mr A
                                                                   shot #1          shot #2               shot #3              shot #4                      speaking face
   Started in 2011, the REPERE challenge aimed at sup-                                        A                     B                    B                   evidence
                                                                                                                                                   OUTPUT
porting research on multimodal person recognition [3, 20]                                                                                A


                                                                   Figure 1: For each shot, participants had to return
                                                                   the names of every speaking face. Each name had
Copyright is held by the author/owner(s).
MediaEval 2015 Workshop Sept. 14-15, 2015, Wurzen, Germany         to be backed up by an evidence.
   To ensure that participants followed this strict “no biomet-
ric supervision” constraint, each hypothesized name n ∈ N
had to be backed up by a carefully selected and unique shot
prooving that the person actually holds this name n: we
call this an evidence and denote it by E : N 7→ S. In real-
world conditions, this evidence would help a human anno-
tator double-check the automatically-generated index, even
for people they did not know beforehand.
   Two types of evidence were allowed: an image evidence
is a shot during which a person is visible, and their name is     Figure 2: Multimodal baseline pipeline. Output of
written on screen; an audio evidence is a shot during which a     greyed out modules is provided to the participants.
person is visible, and their name is pronounced at least once
during a [shot start time − 5s, shot end time + 5s] neighbor-
hood. For instance, in Figure 1, shot #1 is an image evi-         mapping between co-occuring faces and speech turns. Writ-
dence for Mr A (because his name and his face are visible         ten (resp. pronounced) person names were automatically
simultaneously on screen) while shot #3 is an audio evi-          extracted from the visual stream (resp. the audio stream)
dence for Mrs B (because her name is pronounced less than         using open source LOOV Optical Character Recognition [24]
5 seconds before or after her face is visble on screen).          (resp. Automatic Speech Recognition [21, 12]) followed by
                                                                  Named Entity detection (NE). The fusion module was a two-
3.     DATASETS                                                   steps algorithm: propagation of written names onto speaker
   The REPERE corpus – distributed by ELDA – served               clusters [26] followed by propagation of speaker names onto
as development set. It is composed of various TV shows            co-occurring speaking faces.
(around news, politics and people) from two French TV
channels, for a total of 137 hours. A subset of 50 hours is
                                                                  5.     EVALUATION METRIC
manually annotated. Audio annotations are dense and pro-             This information retrieval task was evaluated using a vari-
vide speech transcripts and identity-labeled speech turns.        ant of Mean Average Precision (MAP), that took the qual-
Video annotations are sparse (one image every 10 seconds)         ity of evidences into account. For each query q ∈ Q ⊂ N
and provide overlaid text transcripts and identity-labeled        (firstname_lastname), the hypothesized person name nq
face segmentation. Both speech and overlaid text transcripts      with the highest Levenshtein ratio ρ to the query q is se-
are tagged with named entities. The test set – distributed        lected (ρ : N × N 7→ [0, 1]) – allowing approximate name
by INA – contains 106 hours of video, corresponding to 172        transcription:
editions of evening broadcast news “Le 20 heures” of French                       nq = arg max ρ (q, n) and ρq = ρ (q, nq )
public channel “France 2”, from January 1st 2007 to June                                 n∈N
30st 2007.
   As the test set came completely free of any annotation, it     Average precision AP(q) is then computed classically based
was annotated a posteriori based on participants’ submis-         on relevant and returned shots:
sions. In the following, task groundtruths are denoted by
                                                                          relevant(q) = {s ∈ S | q ∈ L(s)}
function L : S 7→ P(N) that maps each shot s to the set
of names of every speaking face it contains, and function                returned(q) = {s ∈ S | nq ∈ L(s)}sorted by
                                                                                                           confidence
E : S 7→ P(N) that maps each shot s to the set of person
names for which it actually is an evidence.                       Proposed evidence is Correct if name nq is close enough to
                                                                  the query q and if shot E(nq ) actually is an evidence for q:
4.     BASELINE AND METADATA                                                          (
                                                                                       1 if ρq > 0.95 and q ∈ E(E(nq ))
   This task targeted researchers from several communities                   C(q) =
including multimedia, computer vision, speech and natural                              0 otherwise
language processing. Though the task was multimodal by
design and necessitated expertise in various domains, the         To ensure participants do provide correct evidences for every
technological barriers to entry was lowered by the provi-         hypothesized name n ∈ N , standard MAP is altered into
sion of a baseline system described in Figure 2 and available     EwMAP (Evidence-weighted Mean Average Precision), the
as open-source software1 . For instance, a researcher from        official metric for the task:
the speech processing community could focus its research ef-                             1 X
forts on improving speaker diarization and automatic speech                EwMAP =              C(q) · AP(q)
                                                                                        |Q| q∈Q
transcription, while still being able to rely on provided face
detection and tracking results to participate to the task.        Acknowledgment. This work was supported by the French
   The audio stream was segmented into speech turns, while        National Agency for Research under grant ANR-12-CHRI-
faces were detected and tracked in the visual stream. Speech      0006-01. The open source CAMOMILE collaborative anno-
turns (resp. face tracks) were then compared and clus-            tation platform2 was used extensively throughout the progress
tered based on MFCC and the Bayesian Information Cri-             of the task: from the run submission script to the automated
terion [10] (resp. HOG [11] and Logistic Discriminant Met-        leaderboard, including a posteriori collaborative annotation
ric Learning [17] on facial landmarks [31]). The approach         of the test corpus. We thank ELDA and INA for supporting
proposed in [27] was also used to compute a probabilistic         the task by distributing development and test datasets.
1                                                                 2
    http://github.com/MediaEvalPersonDiscoveryTask                    http://github.com/camomile-project
6.   REFERENCES                                                      O. Galibert, and L. Quintard. The REPERE Corpus :
 [1] F. Bechet, M. Bendris, D. Charlet, G. Damnati,                  a Multimodal Corpus for Person Recognition. In
     B. Favre, M. Rouvier, R. Auguste, B. Bigot,                     LREC, 2012.
     R. Dufour, C. Fredouille, G. Linarès, J. Martinet,        [17] M. Guillaumin, T. Mensink, J. Verbeek, and
     G. Senay, and P. Tirilly. Multimodal Understanding              C. Schmid. Face recognition from caption-based
     for Person Recognition in Video Broadcasts. In                  supervision. IJCV, 96(1), 2012.
     INTERSPEECH, 2014.                                         [18] R. Houghton. Named Faces: Putting Names to Faces.
 [2] M. Bendris, B. Favre, D. Charlet, G. Damnati,                   IEEE Intelligent Systems, 14, 1999.
     R. Auguste, J. Martinet, and G. Senay. Unsupervised        [19] V. Jousse, S. Petit-Renaud, S. Meignier, Y. Estève,
     Face Identification in TV Content using Audio-Visual            and C. Jacquin. Automatic named identification of
     Sources. In CBMI, 2013.                                         speakers using diarization and ASR systems. In
 [3] G. Bernard, O. Galibert, and J. Kahn. The First                 ICASSP, 2009.
     Official REPERE Evaluation. In                             [20] J. Kahn, O. Galibert, L. Quintard, M. Carré,
     SLAM-INTERSPEECH, 2013.                                         A. Giraudel, and P.Joly. A presentation of the
 [4] H. Bredin, A. Laurent, A. Sarkar, V.-B. Le, S. Rosset,          REPERE challenge. In CBMI, 2012.
     and C. Barras. Person Instance Graphs for Named            [21] L. Lamel, S. Courcinous, J. Despres, J. Gauvain,
     Speaker Identification in TV Broadcast. In Odyssey,             Y. Josse, K. Kilgour, F. Kraft, V.-B. Le, H. Ney,
     2014.                                                           M. Nussbaum-Thom, I. Oparin, T. Schlippe,
 [5] H. Bredin and J. Poignant. Integer Linear                       R. Schlëter, T. Schultz, T. F. da Silva, S. Stüker,
     Programming for Speaker Diarization and                         M. Sundermeyer, B. Vieru, N. Vu, A. Waibel, and
     Cross-Modal Identification in TV Broadcast. In                  C. Woehrling. Speech Recognition for Machine
     INTERSPEECH, 2013.                                              Translation in Quaero. In IWSLT, 2011.
 [6] H. Bredin, J. Poignant, G. Fortier, M. Tapaswi, V.-B.      [22] J. Mauclair, S. Meignier, and Y. Estève. Speaker
     Le, A. Sarkar, C. Barras, S. Rosset, A. Roy, Q. Yang,           diarization: about whom the speaker is talking? In
     H. Gao, A. Mignon, J. Verbeek, L. Besacier,                     Odyssey, 2006.
     G. Quénot, H. K. Ekenel, and R. Stiefelhagen.             [23] J. Poignant, L. Besacier, and G. Quénot. Unsupervised
     QCompere at REPERE 2013. In                                     Speaker Identification in TV Broadcast Based on
     SLAM-INTERSPEECH, 2013.                                         Written Names. IEEE/ACM ASLP, 23(1), 2015.
 [7] H. Bredin, A. Roy, V.-B. Le, and C. Barras. Person         [24] J. Poignant, L. Besacier, G. Quénot, and F. Thollard.
     instance graphs for mono-, cross- and multi-modal               From text detection in videos to person identification.
     person recognition in multimedia data: application to           In ICME, 2012.
     speaker identification in TV broadcast. In IJMIR,          [25] J. Poignant, H. Bredin, L. Besacier, G. Quénot, and
     2014.                                                           C. Barras. Towards a better integration of written
 [8] L. Canseco, L. Lamel, and J.-L. Gauvain. A                      names for unsupervised speakers identification in
     Comparative Study Using Manual and Automatic                    videos. In SLAM-INTERSPEECH, 2013.
     Transcriptions for Diarization. In ASRU, 2005.             [26] J. Poignant, H. Bredin, V. Le, L. Besacier, C. Barras,
 [9] L. Canseco-Rodriguez, L. Lamel, and J.-L. Gauvain.              and G. Quénot. Unsupervised speaker identification
     Speaker diarization from speech transcripts. In                 using overlaid texts in TV broadcast. In
     INTERSPEECH, 2004.                                              INTERSPEECH, 2012.
[10] S. Chen and P. Gopalakrishnan. Speaker, Environment        [27] J. Poignant, G. Fortier, L. Besacier, and G. Quénot.
     And Channel Change Detection And Clustering Via                 Naming multi-modal clusters to identify persons in
     The Bayesian Information Criterion. In DARPA                    TV broadcast. MTAP, 2015.
     Broadcast News Trans. and Under. Workshop, 1998.           [28] M. Rouvier, B. Favre, M. Bendris, D. Charlet, and
[11] N. Dalal and B. Triggs. Histograms of oriented                  G. Damnati. Scene understanding for identifying
     gradients for human detection. In CVPR, 2005.                   persons in TV shows: beyond face authentication. In
[12] M. Dinarelli and S. Rosset. Models Cascade for                  CBMI, 2014.
     Tree-Structured Named Entity Detection. In IJCNLP,         [29] S. Satoh, Y. Nakamura, and T. Kanade. Name-It:
     2011.                                                           Naming and Detecting Faces in News Videos. IEEE
[13] Y. Estève, S. Meignier, P. Deléglise, and J. Mauclair.        Multimedia, 6, 1999.
     Extracting true speaker identities from transcriptions.    [30] S. E. Tranter. WHO REALLY SPOKE WHEN?
     In INTERSPEECH, 2007.                                           FINDING SPEAKER TURNS AND IDENTITIES IN
[14] B. Favre, G. Damnati, F. Béchet, M. Bendris,                   BROADCAST NEWS AUDIO. In ICASSP, 2006.
     D. Charlet, R. Auguste, S. Ayache, B. Bigot,               [31] M. Uřičář, V. Franc, and V. Hlaváč. Detector of facial
     A. Delteil, R. Dufour, C. Fredouille, G. Linares,               landmarks learned by the structured output SVM. In
     J. Martinet, G. Senay, and P. Tirilly. PERCOLI: a               VISAPP, volume 1, 2012.
     person identification system for the 2013 REPERE           [32] J. Yang and A. G. Hauptmann. Naming every
     challenge. In SLAM-INTERSPEECH, 2013.                           individual in news video monologues. In ACM
[15] P. Gay, G. Dupuy, C. Lailler, J.-M. Odobez,                     Multimedia, 2004.
     S. Meignier, and P. Deléglise. Comparison of Two          [33] J. Yang, R. Yan, and A. G. Hauptmann. Multiple
     Methods for Unsupervised Person Identification in TV            instance learning for labeling faces in broadcasting
     Shows. In CBMI, 2014.                                           news video. In ACM Multimedia, 2005.
[16] A. Giraudel, M. Carré, V. Mapelli, J. Kahn,