<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Person Discovery in Broadcast TV at MediaEval 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johann Poignant</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hervé Bredin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claude Barras LIMSI - CNRS - Rue John Von Neumann</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Orsay</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France. firstname.lastname@limsi.fr</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We describe the \Multimodal Person Discovery in Broadcast TV" task of MediaEval 2015 benchmarking initiative. Participants were asked to return the names of people who can be both seen as well as heard in every shot of a collection of videos. The list of people was not known a priori and their names had to be discovered in an unsupervised way from media content using text overlay or speech transcripts. The task was evaluated using information retrieval metrics, based on a posteriori collaborative annotation of the test corpus.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>MOTIVATION</title>
      <p>TV archives maintained by national institutions such as
the French INA, the Netherlands Institute for Sound &amp;
Vision, or the British Broadcasting Corporation are rapidly
growing in size. The need for applications that make these
archives searchable has led researchers to devote concerted
e ort to developing technologies that create indexes.</p>
      <p>Indexes that represent the location and identity of
people in the archive are indispensable for searching archives.
Human nature leads people to be very interested in other
people. However, when the content is created or broadcast,
it is not always possible to predict which people will be the
most important to nd in the future. For this reason, it is
not possible to assume that biometric models will always be
available at indexing time. For some people, such a model
may not be available in advance, simply because they are not
(yet) famous. In such cases, it is also possible that archivists
annotating content by hand do not even know the name of
the person. The goal of this task is to address the challenge
of indexing people in the archive, under real-world
conditions (i.e. when there is no pre-set list of people to index).</p>
      <p>
        Canseco et al. [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] pioneered approaches relying on
pronounced names instead of biometric models for speaker
identi cation [
        <xref ref-type="bibr" rid="ref13 ref19 ref22 ref30">13, 19, 22, 30</xref>
        ]. However, due to relatively high
speech transcription and named entity detection errors, all
these audio-only approaches did not achieve good enough
identi cation performance. Similarly, for face recognition,
initial visual-only approaches based on overlaid title box
transcriptions were very dependent on the quality of overlaid
name transcription [
        <xref ref-type="bibr" rid="ref18 ref29 ref32 ref33">18, 29, 32, 33</xref>
        ].
      </p>
      <p>
        Started in 2011, the REPERE challenge aimed at
supporting research on multimodal person recognition [
        <xref ref-type="bibr" rid="ref20 ref3">3, 20</xref>
        ]
to overcome the limitations of monomodal approaches. Its
main goal was to answer the two questions \who speaks
when?" and \who appears when?" using any available source
of information (including pre-existing biometric models and
person names extracted from text overlay and speech
transcripts). To assess the technology progress, annual
evaluations were organized in 2012, 2013 and 2014. Thanks to this
challenge and the associated multimodal corpus [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
significant progress was achieved in either supervised or
unsupervised mulitmodal person recognition [
        <xref ref-type="bibr" rid="ref1 ref14 ref15 ref2 ref23 ref25 ref26 ref27 ref28 ref4 ref5 ref6 ref7">1, 2, 4, 5, 6, 7, 14, 15,
23, 25, 26, 27, 28</xref>
        ]. The REPERE challenge came to an end
in 2014 and this task can be seen as a follow-up campaign
with a strong focus on unsupervised person recognition.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>DEFINITION OF THE TASK</title>
      <p>Participants were provided with a collection of TV
broadcast recordings pre-segmented into shots. Each shot s 2 S
had to be automatically tagged with the names of people
both speaking and appearing at the same time during the
shot: this tagging algorithm is denoted by L : S 7! P(N ) in
the rest of the paper. The main novelty of the task is that
the list of persons was not provided a priori, and person
biometric models (neither voice nor face) could not be trained
on external data. The only way to identify a person was by
nding their name n 2 N in the audio (e.g. using speech
transcription { ASR) or visual (e.g. using optical
character recognition { OCR) streams and associating them to the
correct person. This made the task completely unsupervised
(i.e. using algorithms not relying on pre-existing labels or
biometric models).</p>
      <p>Because person names were detected and transcribed
automatically, they could contain transcription errors to a
certain extent (more on that later in Section 5). In the
following, we denote by N the set of all possible person names in
the universe, correctly formatted as firstname_lastname {
while N is the set of hypothesized names.</p>
      <p>A
Mr A</p>
      <p>A
A</p>
      <p>Hello
Mrs B</p>
      <p>B
B
blah
blah
blah
blah
blah
blah
A</p>
      <p>B
B
A</p>
      <p>INPUT
OUTPUT</p>
      <p>LEGEND
speech
transcript
text overlay
speaking face
evidence
shot #1
shot #2
shot #3
shot #4</p>
      <p>To ensure that participants followed this strict \no
biometric supervision" constraint, each hypothesized name n 2 N
had to be backed up by a carefully selected and unique shot
prooving that the person actually holds this name n: we
call this an evidence and denote it by E : N 7! S. In
realworld conditions, this evidence would help a human
annotator double-check the automatically-generated index, even
for people they did not know beforehand.</p>
      <p>Two types of evidence were allowed: an image evidence
is a shot during which a person is visible, and their name is
written on screen; an audio evidence is a shot during which a
person is visible, and their name is pronounced at least once
during a [shot start time 5s; shot end time + 5s]
neighborhood. For instance, in Figure 1, shot #1 is an image
evidence for Mr A (because his name and his face are visible
simultaneously on screen) while shot #3 is an audio
evidence for Mrs B (because her name is pronounced less than
5 seconds before or after her face is visble on screen).
3.</p>
    </sec>
    <sec id="sec-3">
      <title>DATASETS</title>
      <p>The REPERE corpus { distributed by ELDA { served
as development set. It is composed of various TV shows
(around news, politics and people) from two French TV
channels, for a total of 137 hours. A subset of 50 hours is
manually annotated. Audio annotations are dense and
provide speech transcripts and identity-labeled speech turns.
Video annotations are sparse (one image every 10 seconds)
and provide overlaid text transcripts and identity-labeled
face segmentation. Both speech and overlaid text transcripts
are tagged with named entities. The test set { distributed
by INA { contains 106 hours of video, corresponding to 172
editions of evening broadcast news \Le 20 heures" of French
public channel \France 2", from January 1st 2007 to June
30st 2007.</p>
      <p>As the test set came completely free of any annotation, it
was annotated a posteriori based on participants'
submissions. In the following, task groundtruths are denoted by
function L : S 7! P(N) that maps each shot s to the set
of names of every speaking face it contains, and function
E : S 7! P(N) that maps each shot s to the set of person
names for which it actually is an evidence.</p>
    </sec>
    <sec id="sec-4">
      <title>BASELINE AND METADATA</title>
      <p>This task targeted researchers from several communities
including multimedia, computer vision, speech and natural
language processing. Though the task was multimodal by
design and necessitated expertise in various domains, the
technological barriers to entry was lowered by the
provision of a baseline system described in Figure 2 and available
as open-source software1. For instance, a researcher from
the speech processing community could focus its research
efforts on improving speaker diarization and automatic speech
transcription, while still being able to rely on provided face
detection and tracking results to participate to the task.</p>
      <p>
        The audio stream was segmented into speech turns, while
faces were detected and tracked in the visual stream. Speech
turns (resp. face tracks) were then compared and
clustered based on MFCC and the Bayesian Information
Criterion [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (resp. HOG [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Logistic Discriminant
Metric Learning [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] on facial landmarks [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]). The approach
proposed in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] was also used to compute a probabilistic
mapping between co-occuring faces and speech turns.
Written (resp. pronounced) person names were automatically
extracted from the visual stream (resp. the audio stream)
using open source LOOV Optical Character Recognition [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]
(resp. Automatic Speech Recognition [
        <xref ref-type="bibr" rid="ref12 ref21">21, 12</xref>
        ]) followed by
Named Entity detection (NE). The fusion module was a
twosteps algorithm: propagation of written names onto speaker
clusters [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] followed by propagation of speaker names onto
co-occurring speaking faces.
5.
      </p>
    </sec>
    <sec id="sec-5">
      <title>EVALUATION METRIC</title>
      <p>This information retrieval task was evaluated using a
variant of Mean Average Precision (MAP), that took the
quality of evidences into account. For each query q 2 Q N
(firstname_lastname), the hypothesized person name nq
with the highest Levenshtein ratio to the query q is
selected ( : N N 7! [0; 1]) { allowing approximate name
transcription:
nq = arg max (q; n) and q = (q; nq)</p>
      <p>n2N
Average precision AP(q) is then computed classically based
on relevant and returned shots:
relevant(q) = fs 2 S j q 2 L(s)g
returned(q) = fs 2 S j nq 2 L(s)gsorted by
con dence
Proposed evidence is Correct if name nq is close enough to
the query q and if shot E(nq) actually is an evidence for q:
C(q) =
(1 if q &gt; 0:95 and q 2 E(E(nq))</p>
      <p>0 otherwise
To ensure participants do provide correct evidences for every
hypothesized name n 2 N , standard MAP is altered into
EwMAP (Evidence-weighted Mean Average Precision), the
o cial metric for the task:</p>
      <p>EwMAP =
1</p>
      <p>X C(q) AP(q)
jQj q2Q
Acknowledgment. This work was supported by the French
National Agency for Research under grant
ANR-12-CHRI0006-01. The open source CAMOMILE collaborative
annotation platform2 was used extensively throughout the progress
of the task: from the run submission script to the automated
leaderboard, including a posteriori collaborative annotation
of the test corpus. We thank ELDA and INA for supporting
the task by distributing development and test datasets.
1http://github.com/MediaEvalPersonDiscoveryTask
2http://github.com/camomile-project</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bechet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Damnati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Auguste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bigot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dufour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fredouille</surname>
          </string-name>
          , G. Linares,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , G. Senay, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tirilly</surname>
          </string-name>
          .
          <article-title>Multimodal Understanding for Person Recognition in Video Broadcasts</article-title>
          .
          <source>In INTERSPEECH</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          , G. Damnati,
          <string-name>
            <given-names>R.</given-names>
            <surname>Auguste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Senay. Unsupervised Face</surname>
          </string-name>
          <article-title>Identi cation in TV Content using Audio-Visual Sources</article-title>
          .
          <source>In CBMI</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bernard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahn. The First O cial REPERE</surname>
          </string-name>
          <article-title>Evaluation</article-title>
          .
          <source>In SLAM-INTERSPEECH</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Laurent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rosset</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Person Instance Graphs for Named Speaker Identi cation in TV Broadcast</article-title>
          . In Odyssey,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          .
          <article-title>Integer Linear Programming for Speaker Diarization and Cross-Modal Identi cation in TV Broadcast</article-title>
          . In INTERSPEECH,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          , G. Fortier,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tapaswi</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rosset</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mignon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Besacier</surname>
            , G. Quenot,
            <given-names>H. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ekenel</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stiefelhagen</surname>
          </string-name>
          . QCompere at REPERE 2013. In SLAM-INTERSPEECH,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identi cation in TV broadcast</article-title>
          .
          <source>In IJMIR</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Canseco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          .
          <article-title>A Comparative Study Using Manual and Automatic Transcriptions for Diarization</article-title>
          .
          <source>In ASRU</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Canseco-Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          .
          <article-title>Speaker diarization from speech transcripts</article-title>
          .
          <source>In INTERSPEECH</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Gopalakrishnan</surname>
          </string-name>
          . Speaker,
          <article-title>Environment And Channel Change Detection And Clustering Via The Bayesian Information Criterion</article-title>
          .
          <source>In DARPA Broadcast News Trans. and Under. Workshop</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalal</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Triggs</surname>
          </string-name>
          .
          <article-title>Histograms of oriented gradients for human detection</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dinarelli</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosset</surname>
          </string-name>
          .
          <article-title>Models Cascade for Tree-Structured Named Entity Detection</article-title>
          .
          <source>In IJCNLP</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Esteve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Deleglise</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Mauclair</surname>
          </string-name>
          .
          <article-title>Extracting true speaker identities from transcriptions</article-title>
          .
          <source>In INTERSPEECH</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Damnati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bechet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Auguste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ayache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bigot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Delteil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dufour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fredouille</surname>
          </string-name>
          , G. Linares,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , G. Senay, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tirilly. PERCOLI:</surname>
          </string-name>
          <article-title>a person identi cation system for the 2013 REPERE challenge</article-title>
          .
          <source>In SLAM-INTERSPEECH</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gay</surname>
          </string-name>
          , G. Dupuy,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lailler</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Odobez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Meignier</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Deleglise</surname>
          </string-name>
          .
          <article-title>Comparison of Two Methods for Unsupervised Person Identi cation in TV Shows</article-title>
          . In CBMI,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Giraudel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mapelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Quintard</surname>
          </string-name>
          .
          <article-title>The REPERE Corpus : a Multimodal Corpus for Person Recognition</article-title>
          .
          <source>In LREC</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Guillaumin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mensink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Face recognition from caption-based supervision</article-title>
          .
          <source>IJCV</source>
          ,
          <volume>96</volume>
          (
          <issue>1</issue>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Houghton</surname>
          </string-name>
          . Named Faces:
          <article-title>Putting Names to Faces</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>14</volume>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>V.</given-names>
            <surname>Jousse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petit-Renaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Esteve</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Jacquin</surname>
          </string-name>
          .
          <article-title>Automatic named identi cation of speakers using diarization and ASR systems</article-title>
          .
          <source>In ICASSP</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Quintard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giraudel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Joly</surname>
          </string-name>
          .
          <article-title>A presentation of the REPERE challenge</article-title>
          .
          <source>In CBMI</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Courcinous</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Despres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Josse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kilgour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kraft</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ney</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nussbaum-Thom</surname>
            , I. Oparin,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Schlippe</surname>
          </string-name>
          , R. Schleter, T. Schultz, T. F. da
          <string-name>
            <surname>Silva</surname>
            , S. Stuker,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sundermeyer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Vieru</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Waibel</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Woehrling</surname>
          </string-name>
          .
          <article-title>Speech Recognition for Machine Translation in Quaero</article-title>
          .
          <source>In IWSLT</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mauclair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Esteve</surname>
          </string-name>
          .
          <article-title>Speaker diarization: about whom the speaker is talking</article-title>
          ? In Odyssey,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot. Unsupervised Speaker</surname>
          </string-name>
          <article-title>Identi cation in TV Broadcast Based on Written Names</article-title>
          . IEEE/ACM ASLP,
          <volume>23</volume>
          (
          <issue>1</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , G. Quenot, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Thollard</surname>
          </string-name>
          .
          <article-title>From text detection in videos to person identi cation</article-title>
          .
          <source>In ICME</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , G. Quenot, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Towards a better integration of written names for unsupervised speakers identi cation in videos</article-title>
          .
          <source>In SLAM-INTERSPEECH</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot</surname>
          </string-name>
          .
          <article-title>Unsupervised speaker identi cation using overlaid texts in TV broadcast</article-title>
          .
          <source>In INTERSPEECH</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fortier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot.</surname>
          </string-name>
          <article-title>Naming multi-modal clusters to identify persons in TV broadcast</article-title>
          .
          <source>MTAP</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Damnati</surname>
          </string-name>
          .
          <article-title>Scene understanding for identifying persons in TV shows: beyond face authentication</article-title>
          .
          <source>In CBMI</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satoh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Kanade.</surname>
          </string-name>
          Name-It:
          <article-title>Naming and Detecting Faces in News Videos</article-title>
          .
          <source>IEEE Multimedia</source>
          ,
          <volume>6</volume>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>S. E. Tranter. WHO</surname>
          </string-name>
          <article-title>REALLY SPOKE WHEN? FINDING SPEAKER TURNS AND IDENTITIES IN BROADCAST NEWS AUDIO</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>ICASSP</given-names>
          </string-name>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Uricar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Franc</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Hlavac</surname>
          </string-name>
          .
          <article-title>Detector of facial landmarks learned by the structured output SVM</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>VISAPP</given-names>
          </string-name>
          , volume
          <volume>1</volume>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Hauptmann</surname>
          </string-name>
          .
          <article-title>Naming every individual in news video monologues</article-title>
          .
          <source>In ACM Multimedia</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Hauptmann</surname>
          </string-name>
          .
          <article-title>Multiple instance learning for labeling faces in broadcasting news video</article-title>
          .
          <source>In ACM Multimedia</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>