Multimodal Person Discovery in Broadcast TV
                            at MediaEval 2016

                                Hervé Bredin, Claude Barras, Camille Guinaudeau
                     LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay, F-91405 Orsay, France.
                                                 firstname.lastname@limsi.fr


ABSTRACT                                                             2.        DEFINITION OF THE TASK
We describe the “Multimodal Person Discovery in Broadcast               Participants are provided with a collection of TV broad-
TV” task of MediaEval 2016 benchmarking initiative. Par-             cast recordings pre-segmented into shots. Each shot s ∈ S
ticipants are asked to return the names of people who can            has to be automatically tagged with the names of people
be both seen as well as heard in every shot of a collection          both speaking and appearing at the same time during the
of videos. The list of people is not known a priori and their        shot.
names has to be discovered in an unsupervised way from                  As last year, the list of persons is not provided a priori,
media content using text overlay or speech transcripts for           and person biometric models (neither voice nor face) can not
the primary runs. The task is evaluated using information            be trained on external data in the primary runs. The only
retrieval metrics, based on a posteriori collaborative anno-         way to identify a person is by finding their name n ∈ N in
tation of the test corpus.                                           the audio (e.g., using speech transcription – ASR) or visual
                                                                     (e.g., using optical character recognition – OCR) streams
                                                                     and associating them to the correct person. This makes the
1.    MOTIVATION                                                     task completely unsupervised (i.e. using algorithms not re-
   TV archives maintained by national institutions such as           lying on pre-existing labels or biometric models). The main
the French INA, the Netherlands Institute for Sound & Vi-            novelty of this year task is that participants may use their
sion, or the British Broadcasting Corporation are rapidly            contrastive run to try brave new ideas that may rely on any
growing in size. The need for applications that make these           external data, including textual metadata provided with the
archives searchable has led researchers to devote concerted          test set.
effort to developing technologies that create indexes.                  Because person names are detected and transcribed auto-
   Indexes that represent the location and identity of people        matically, they may contain transcription errors to a certain
in the archive are indispensable for searching archives. Hu-         extent (more on that later in Section 5). In the following, we
man nature leads people to be very interested in other peo-          denote by N the set of all possible person names in the uni-
ple. However, when the content is created or broadcasted,            verse, correctly formatted as firstname_lastname – while
it is not always possible to predict which people will be the        N is the set of hypothesized names.
most important to find in the future and biometric models
                                                                                                    Hello                 blah             blah    blah         LEGEND
may not yet be available at indexing time The goal of this                                          Mrs B                 blah             blah    blah
task is thus to address the challenge of indexing people in                                                                                                       speech
the archive under real-world conditions, i.e. when there is                     A               A                     B               A        B                transcript
                                                                                                                                                      INPUT

no pre-set list of people to index.                                            Mr A
                                                                                                                                                              text overlay


   Started in 2011, the REPERE challenge aimed at sup-               shot #1          shot #2               shot #3              shot #4                      speaking face


porting research on multimodal person recognition [3, 16].                                      A                     B                    B                   evidence
                                                                                                                                                     OUTPUT
                                                                                                                                           A
Its main goal was to answer the two questions “who speaks
when?” and “who appears when?” using any available source
of information (including pre-existing biometric models and          Figure 1: For each shot, participants have to return
person names extracted from text overlay and speech tran-            the names of every speaking face. Each name has to
scripts). Thanks to this challenge and the associated mul-           be backed up by an evidence.
timodal corpus [13], significant progress was achieved in ei-
ther supervised or unsupervised multimodal person recogni-
tion [1, 2, 4, 5, 6, 7, 11, 12, 17, 20, 21, 22, 24]. After the end   3.        DATASETS
of the REPERE challenge in 2014, the first edition of the              The 2015 test corpus serves as development set for this
“Multimodal Person Discovery in Broadcast TV” task was               year’s task. It contains 106 hours of video, corresponding to
organized in 2015 [19]. This year’s task is a follow-up of last      172 editions of evening broadcast news “Le 20 heures” of the
year edition.                                                        French public channel “France 2”, from January 1st 2007 to
                                                                     June 30st 2007. This development set is associated with a
                                                                     posteriori annotations based on last year participants’ sub-
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-        missions.
lands.                                                                 The test set is divided into three datasets: INA, DW and
3-24. The INA dataset contains a full week of broadcast         and the correlation tracker proposed by Danelljan et al. [10].
for 3 TV channels and 3 radio channels in French. Only a        Each face track is then described by its average FaceNet em-
subset (made of 2 TV video channels for a total duration        bedding and compared with all the others using Euclidean
of 90 hours) needs to be processed. However, participants       distance [25]. Finally, average-link hierarchical agglomera-
can process the rest of it if they think it might lead to im-   tive clustering is applied. Source code for this module is
proved results. Moreover, this dataset is associated with       available in pyannote-video 1 .
manual metadata provided by INA in the shape of CSV files.         Optical character recognition followed by name detection
The DW dataset [14] is composed of video downloaded from        is contributed by IDIAP [8] and UPC. UPC detection was
Deutsche Welle website, in English and German for a total       performed using LOOV [18]. Then, text results were filtered
duration of 50 hours. This dataset is also associated with      using first and last names gathered from internet and an
metadata that can be used in contrastive runs. The last         hand-crafted list of negative words. Due to the large diver-
dataset contains 13 hours of broadcast from 3/24 Catalan        sity of the test corpus, optical character recognition results
TV news channel.                                                are much more noisy than the ones provided in 2015.
   As the test set comes completely free of any annotation,
it will be annotated a posteriori based on participants’ sub-   4.2   Audio processing
missions. In order to ease this annotation process, par-          Speaker diarization and speech transcription for French,
ticipants are asked to justify their assertion. To this end,    German and English are contributed by LIUM [23, 15]. Pro-
each hypothesized name n ∈ N has to be backed up by a           nounced person names are automatically extracted from the
carefully selected and unique shot prooving that the per-       audio stream using a large list of names gathered from the
son actually holds this name n: we call this an evidence.       Wikipedia website.
In real-world conditions, this evidence would help a human
annotator double-check the automatically-generated index,       4.3   Multimodal fusion baseline
even for people they did not know beforehand.                      Three variants of the name propagation technique pro-
   Two types of evidence are allowed: an image evidence is a    posed in [21] are proposed. Baseline 1 tags each speaker clus-
time in a video when a person is visible, while his/her name    ter by the most co-occurring written name. Baseline 2 tags
is written on screen; an audio evidence is the time when the    each face cluster by the most co-occurring written name.
name of a person is pronounced, provided that this person is    Baseline 3 is the temporal intersection of both. These fu-
visible in a [time−5s, time+5s] neighborhood. For instance,     sion techniques are available as open-source software2 .
in Figure 1, shot #1 contains an image evidence for Mr A
(because his name and his face are visible simultaneously on    5.    EVALUATION METRIC
screen) while shot #3 contains an audio evidence for Mrs B
                                                                   Because of limited resources dedicated to collaborative an-
(because her name is pronounced less than 5 seconds before
                                                                notation, the test set cannot be fully annotated. Therefore,
or after her face is visible on screen).
                                                                the task is evaluated indirectly as an information retrieval
                                                                task, using the folllowing principle.
4.    BASELINE AND METADATA                                        For each query q ∈ Q ⊂ N (firstname_lastname), re-
   This task targets researchers from several communities in-   turned shots are first sorted by the edit distance between the
cluding multimedia, computer vision, speech and natural         hypothesized person name and the query q and then by con-
language processing. Though the task is multimodal by de-       fidence scores. Average precision AP(q) is then computed
sign and necessitates expertise in various domains, the tech-   classically based on the list of relevant shots (according to
nological barriers to entry is lowered by the provision of a    the groundtruth) and the sorted list of shots. Finally, Mean
baseline system available partially as open-source software.    Average Precision is computed as follows:
   For instance, a researcher from the speech processing com-                                 1 X
munity can focus its research efforts on improving speaker                          MAP =            AP(q)
                                                                                             |Q| q∈Q
diarization and automatic speech transcription, while still
being able to rely on provided face detection and tracking
results to participate to the task. Figure 2 summarizes the     Acknowledgment
available modules.                                              This work was supported by the French National Agency
                                                                for Research under grants ANR-12-CHRI-0006-01 and ANR-
                                                                14-CE24-0024. The open source CAMOMILE collaborative
                                                                annotation platform3 was used extensively throughout the
                                                                progress of the task: from the run submission script to the
                                                                automated leaderboard, including a posteriori collaborative
                                                                annotation of the test corpus. The task builds on Johann
                                                                Poignant involvement in 2015 task organization. Xavier Tri-
                                                                molet helped design and develop the 2016 annotation in-
                                                                terface. We also thank INA, LIUM, UPC and IDIAP for
       Figure 2: Multimodal baseline pipeline.                  providing datasets and baseline modules.


4.1   Video processing                                          1
                                                                  http://pyannote.github.io
                                                                2
  Face tracking-by-detection is applied within each shot us-      http://github.com/MediaEvalPersonDiscoveryTask
                                                                3
ing a detector based on histogram of oriented gradients [9]       http://github.com/camomile-project
6.   REFERENCES                                                    Analysis and Recommendation using UIMA. In
 [1] F. Bechet, M. Bendris, D. Charlet, G. Damnati,                International Conference on Computational
     B. Favre, M. Rouvier, R. Auguste, B. Bigot,                   Linguistics (Coling), 2014.
     R. Dufour, C. Fredouille, G. Linarès, J. Martinet,      [15] V. Gupta, P. Deléglise, G. Boulianne, Y. Estève,
     G. Senay, and P. Tirilly. Multimodal Understanding            S. Meignier, and A. Rousseau. CRIM and LIUM
     for Person Recognition in Video Broadcasts. In                approaches for multi-genre broadcast media
     INTERSPEECH, 2014.                                            transcription. In 2015 IEEE Workshop on Automatic
 [2] M. Bendris, B. Favre, D. Charlet, G. Damnati,                 Speech Recognition and Understanding (ASRU), pages
     R. Auguste, J. Martinet, and G. Senay. Unsupervised           681–686. IEEE, 2015.
     Face Identification in TV Content using Audio-Visual     [16] J. Kahn, O. Galibert, L. Quintard, M. Carré,
     Sources. In CBMI, 2013.                                       A. Giraudel, and P.Joly. A presentation of the
 [3] G. Bernard, O. Galibert, and J. Kahn. The First               REPERE challenge. In CBMI, 2012.
     Official REPERE Evaluation. In                           [17] J. Poignant, L. Besacier, and G. Quénot. Unsupervised
     SLAM-INTERSPEECH, 2013.                                       Speaker Identification in TV Broadcast Based on
 [4] H. Bredin, A. Laurent, A. Sarkar, V.-B. Le, S. Rosset,        Written Names. IEEE/ACM ASLP, 23(1), 2015.
     and C. Barras. Person Instance Graphs for Named          [18] J. Poignant, L. Besacier, G. Quénot, and F. Thollard.
     Speaker Identification in TV Broadcast. In Odyssey,           From text detection in videos to person identification.
     2014.                                                         In ICME, 2012.
 [5] H. Bredin and J. Poignant. Integer Linear                [19] J. Poignant, H. Bredin, and C. Barras. Multimodal
     Programming for Speaker Diarization and                       Person Discovery in Broadcast TV at MediaEval 2015.
     Cross-Modal Identification in TV Broadcast. In                In MediaEval 2015, 2015.
     INTERSPEECH, 2013.                                       [20] J. Poignant, H. Bredin, L. Besacier, G. Quénot, and
 [6] H. Bredin, J. Poignant, G. Fortier, M. Tapaswi, V.-B.         C. Barras. Towards a better integration of written
     Le, A. Sarkar, C. Barras, S. Rosset, A. Roy, Q. Yang,         names for unsupervised speakers identification in
     H. Gao, A. Mignon, J. Verbeek, L. Besacier,                   videos. In SLAM-INTERSPEECH, 2013.
     G. Quénot, H. K. Ekenel, and R. Stiefelhagen.           [21] J. Poignant, H. Bredin, V. Le, L. Besacier, C. Barras,
     QCompere at REPERE 2013. In                                   and G. Quénot. Unsupervised speaker identification
     SLAM-INTERSPEECH, 2013.                                       using overlaid texts in TV broadcast. In
 [7] H. Bredin, A. Roy, V.-B. Le, and C. Barras. Person            INTERSPEECH, 2012.
     instance graphs for mono-, cross- and multi-modal        [22] J. Poignant, G. Fortier, L. Besacier, and G. Quénot.
     person recognition in multimedia data: application to         Naming multi-modal clusters to identify persons in
     speaker identification in TV broadcast. In IJMIR,             TV broadcast. MTAP, 2015.
     2014.                                                    [23] M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin,
 [8] D. Chen and J.-M. Odobez. Video text recognition              and S. Meignier. An open-source state-of-the-art
     using sequential monte carlo and error voting methods.        toolbox for broadcast news diarization. In Interspeech,
     Pattern Recognition Letters, 26(9):1386 – 1403, 2005.         Lyon (France), 25-29 Aug. 2013.
 [9] N. Dalal and B. Triggs. Histograms of Oriented           [24] M. Rouvier, B. Favre, M. Bendris, D. Charlet, and
     Gradients for Human Detection. In IEEE Computer               G. Damnati. Scene understanding for identifying
     Society Conference on Computer Vision and Pattern             persons in TV shows: beyond face authentication. In
     Recognition, volume 1, pages 886–893 vol. 1, June             CBMI, 2014.
     2005.                                                    [25] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet:
[10] M. Danelljan, G. Häger, F. Shahbaz Khan, and                 a Unified Embedding for Face Recognition and
     M. Felsberg. Accurate Scale Estimation for Robust             Clustering. In Proceedings of the IEEE Conference on
     Visual Tracking. In Proceedings of the British Machine        Computer Vision and Pattern Recognition, pages
     Vision Conference. BMVA Press, 2014.                          815–823, 2015.
[11] B. Favre, G. Damnati, F. Béchet, M. Bendris,
     D. Charlet, R. Auguste, S. Ayache, B. Bigot,
     A. Delteil, R. Dufour, C. Fredouille, G. Linares,
     J. Martinet, G. Senay, and P. Tirilly. PERCOLI: a
     person identification system for the 2013 REPERE
     challenge. In SLAM-INTERSPEECH, 2013.
[12] P. Gay, G. Dupuy, C. Lailler, J.-M. Odobez,
     S. Meignier, and P. Deléglise. Comparison of Two
     Methods for Unsupervised Person Identification in TV
     Shows. In CBMI, 2014.
[13] A. Giraudel, M. Carré, V. Mapelli, J. Kahn,
     O. Galibert, and L. Quintard. The REPERE Corpus :
     a Multimodal Corpus for Person Recognition. In
     LREC, 2012.
[14] J. Grivolla, M. Melero, T. Badia, C. Cabulea,
     Y. Esteve, E. Herder, J.-M. Odobez, S. Preuss, and
     R. Marin. EUMSSI: a Platform for Multimodal