<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Person Discovery in Broadcast TV at MediaEval 2016</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hervé Bredin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claude Barras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camille Guinaudeau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIMSI, CNRS, Univ. Paris-Sud, Université Paris-Saclay</institution>
          ,
          <addr-line>F-91405 Orsay</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>We describe the \Multimodal Person Discovery in Broadcast TV" task of MediaEval 2016 benchmarking initiative. Participants are asked to return the names of people who can be both seen as well as heard in every shot of a collection of videos. The list of people is not known a priori and their names has to be discovered in an unsupervised way from media content using text overlay or speech transcripts for the primary runs. The task is evaluated using information retrieval metrics, based on a posteriori collaborative annotation of the test corpus.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>MOTIVATION</title>
      <p>TV archives maintained by national institutions such as
the French INA, the Netherlands Institute for Sound &amp;
Vision, or the British Broadcasting Corporation are rapidly
growing in size. The need for applications that make these
archives searchable has led researchers to devote concerted
e ort to developing technologies that create indexes.</p>
      <p>Indexes that represent the location and identity of people
in the archive are indispensable for searching archives.
Human nature leads people to be very interested in other
people. However, when the content is created or broadcasted,
it is not always possible to predict which people will be the
most important to nd in the future and biometric models
may not yet be available at indexing time The goal of this
task is thus to address the challenge of indexing people in
the archive under real-world conditions, i.e. when there is
no pre-set list of people to index.</p>
      <p>
        Started in 2011, the REPERE challenge aimed at
supporting research on multimodal person recognition [
        <xref ref-type="bibr" rid="ref16 ref3">3, 16</xref>
        ].
Its main goal was to answer the two questions \who speaks
when?" and \who appears when?" using any available source
of information (including pre-existing biometric models and
person names extracted from text overlay and speech
transcripts). Thanks to this challenge and the associated
multimodal corpus [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], signi cant progress was achieved in
either supervised or unsupervised multimodal person
recognition [
        <xref ref-type="bibr" rid="ref1 ref11 ref12 ref17 ref2 ref20 ref21 ref22 ref24 ref4 ref5 ref6 ref7">1, 2, 4, 5, 6, 7, 11, 12, 17, 20, 21, 22, 24</xref>
        ]. After the end
of the REPERE challenge in 2014, the rst edition of the
\Multimodal Person Discovery in Broadcast TV" task was
organized in 2015 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This year's task is a follow-up of last
year edition.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>DEFINITION OF THE TASK</title>
      <p>Participants are provided with a collection of TV
broadcast recordings pre-segmented into shots. Each shot s 2 S
has to be automatically tagged with the names of people
both speaking and appearing at the same time during the
shot.</p>
      <p>As last year, the list of persons is not provided a priori,
and person biometric models (neither voice nor face) can not
be trained on external data in the primary runs. The only
way to identify a person is by nding their name n 2 N in
the audio (e.g., using speech transcription { ASR) or visual
(e.g., using optical character recognition { OCR) streams
and associating them to the correct person. This makes the
task completely unsupervised (i.e. using algorithms not
relying on pre-existing labels or biometric models). The main
novelty of this year task is that participants may use their
contrastive run to try brave new ideas that may rely on any
external data, including textual metadata provided with the
test set.</p>
      <p>Because person names are detected and transcribed
automatically, they may contain transcription errors to a certain
extent (more on that later in Section 5). In the following, we
denote by N the set of all possible person names in the
universe, correctly formatted as firstname_lastname { while
N is the set of hypothesized names.</p>
      <p>A
Mr A</p>
      <p>A
A</p>
      <p>Hello
Mrs B</p>
      <p>B
B
blah
blah
blah
blah
blah
blah
A</p>
      <p>B
B
A</p>
      <p>INPUT
OUTPUT</p>
      <p>LEGEND
speech
transcript
text overlay
speaking face
evidence
shot #1
shot #2
shot #3
shot #4
3.</p>
    </sec>
    <sec id="sec-3">
      <title>DATASETS</title>
      <p>The 2015 test corpus serves as development set for this
year's task. It contains 106 hours of video, corresponding to
172 editions of evening broadcast news \Le 20 heures" of the
French public channel \France 2", from January 1st 2007 to
June 30st 2007. This development set is associated with a
posteriori annotations based on last year participants'
submissions.</p>
      <p>
        The test set is divided into three datasets: INA, DW and
3-24. The INA dataset contains a full week of broadcast
for 3 TV channels and 3 radio channels in French. Only a
subset (made of 2 TV video channels for a total duration
of 90 hours) needs to be processed. However, participants
can process the rest of it if they think it might lead to
improved results. Moreover, this dataset is associated with
manual metadata provided by INA in the shape of CSV les.
The DW dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is composed of video downloaded from
Deutsche Welle website, in English and German for a total
duration of 50 hours. This dataset is also associated with
metadata that can be used in contrastive runs. The last
dataset contains 13 hours of broadcast from 3/24 Catalan
TV news channel.
      </p>
      <p>As the test set comes completely free of any annotation,
it will be annotated a posteriori based on participants'
submissions. In order to ease this annotation process,
participants are asked to justify their assertion. To this end,
each hypothesized name n 2 N has to be backed up by a
carefully selected and unique shot prooving that the
person actually holds this name n: we call this an evidence.
In real-world conditions, this evidence would help a human
annotator double-check the automatically-generated index,
even for people they did not know beforehand.</p>
      <p>Two types of evidence are allowed: an image evidence is a
time in a video when a person is visible, while his/her name
is written on screen; an audio evidence is the time when the
name of a person is pronounced, provided that this person is
visible in a [time 5s; time+5s] neighborhood. For instance,
in Figure 1, shot #1 contains an image evidence for Mr A
(because his name and his face are visible simultaneously on
screen) while shot #3 contains an audio evidence for Mrs B
(because her name is pronounced less than 5 seconds before
or after her face is visible on screen).</p>
    </sec>
    <sec id="sec-4">
      <title>BASELINE AND METADATA</title>
      <p>This task targets researchers from several communities
including multimedia, computer vision, speech and natural
language processing. Though the task is multimodal by
design and necessitates expertise in various domains, the
technological barriers to entry is lowered by the provision of a
baseline system available partially as open-source software.</p>
      <p>
        For instance, a researcher from the speech processing
community can focus its research e orts on improving speaker
diarization and automatic speech transcription, while still
being able to rely on provided face detection and tracking
results to participate to the task. Figure 2 summarizes the
available modules.
and the correlation tracker proposed by Danelljan et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Each face track is then described by its average FaceNet
embedding and compared with all the others using Euclidean
distance [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Finally, average-link hierarchical
agglomerative clustering is applied. Source code for this module is
available in pyannote-video1.
      </p>
      <p>
        Optical character recognition followed by name detection
is contributed by IDIAP [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and UPC. UPC detection was
performed using LOOV [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Then, text results were ltered
using rst and last names gathered from internet and an
hand-crafted list of negative words. Due to the large
diversity of the test corpus, optical character recognition results
are much more noisy than the ones provided in 2015.
      </p>
      <p>
        Three variants of the name propagation technique
proposed in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] are proposed. Baseline 1 tags each speaker
cluster by the most co-occurring written name. Baseline 2 tags
each face cluster by the most co-occurring written name.
Baseline 3 is the temporal intersection of both. These
fusion techniques are available as open-source software2.
5.
      </p>
    </sec>
    <sec id="sec-5">
      <title>EVALUATION METRIC</title>
      <p>Because of limited resources dedicated to collaborative
annotation, the test set cannot be fully annotated. Therefore,
the task is evaluated indirectly as an information retrieval
task, using the folllowing principle.</p>
      <p>For each query q 2 Q N (firstname_lastname),
returned shots are rst sorted by the edit distance between the
hypothesized person name and the query q and then by
condence scores. Average precision AP(q) is then computed
classically based on the list of relevant shots (according to
the groundtruth) and the sorted list of shots. Finally, Mean
Average Precision is computed as follows:</p>
      <p>MAP =
1</p>
      <p>X AP(q)
jQj q2Q</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>This work was supported by the French National Agency
for Research under grants ANR-12-CHRI-0006-01 and
ANR14-CE24-0024. The open source CAMOMILE collaborative
annotation platform3 was used extensively throughout the
progress of the task: from the run submission script to the
automated leaderboard, including a posteriori collaborative
annotation of the test corpus. The task builds on Johann
Poignant involvement in 2015 task organization. Xavier
Trimolet helped design and develop the 2016 annotation
interface. We also thank INA, LIUM, UPC and IDIAP for
providing datasets and baseline modules.
4.1</p>
    </sec>
    <sec id="sec-7">
      <title>Video processing</title>
      <p>
        Face tracking-by-detection is applied within each shot
using a detector based on histogram of oriented gradients [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
1http://pyannote.github.io
2http://github.com/MediaEvalPersonDiscoveryTask
3http://github.com/camomile-project
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bechet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Damnati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Auguste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bigot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dufour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fredouille</surname>
          </string-name>
          , G. Linares,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , G. Senay, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tirilly</surname>
          </string-name>
          .
          <article-title>Multimodal Understanding for Person Recognition in Video Broadcasts</article-title>
          .
          <source>In INTERSPEECH</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          , G. Damnati,
          <string-name>
            <given-names>R.</given-names>
            <surname>Auguste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Senay. Unsupervised Face</surname>
          </string-name>
          <article-title>Identi cation in TV Content using Audio-Visual Sources</article-title>
          .
          <source>In CBMI</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bernard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahn. The First O cial REPERE</surname>
          </string-name>
          <article-title>Evaluation</article-title>
          .
          <source>In SLAM-INTERSPEECH</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Laurent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rosset</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Person Instance Graphs for Named Speaker Identi cation in TV Broadcast</article-title>
          . In Odyssey,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          .
          <article-title>Integer Linear Programming for Speaker Diarization and Cross-Modal Identi cation in TV Broadcast</article-title>
          . In INTERSPEECH,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          , G. Fortier,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tapaswi</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rosset</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mignon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Besacier</surname>
            , G. Quenot,
            <given-names>H. K.</given-names>
          </string-name>
          <string-name>
            <surname>Ekenel</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stiefelhagen</surname>
          </string-name>
          . QCompere at REPERE 2013. In SLAM-INTERSPEECH,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identi cation in TV broadcast</article-title>
          .
          <source>In IJMIR</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <surname>J.-M. Odobez</surname>
          </string-name>
          .
          <article-title>Video text recognition using sequential monte carlo and error voting methods</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <volume>26</volume>
          (
          <issue>9</issue>
          ):
          <volume>1386</volume>
          {
          <fpage>1403</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalal</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Triggs</surname>
          </string-name>
          .
          <article-title>Histograms of Oriented Gradients for Human Detection</article-title>
          .
          <source>In IEEE Computer Society Conference on Computer Vision and Pattern Recognition</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>886</fpage>
          <lpage>{</lpage>
          893 vol.
          <volume>1</volume>
          ,
          <string-name>
            <surname>June</surname>
          </string-name>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Danelljan</surname>
          </string-name>
          , G. Hager, F. Shahbaz
          <string-name>
            <surname>Khan</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Felsberg</surname>
          </string-name>
          .
          <article-title>Accurate Scale Estimation for Robust Visual Tracking</article-title>
          .
          <source>In Proceedings of the British Machine Vision Conference</source>
          . BMVA Press,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Damnati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bechet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Auguste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ayache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bigot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Delteil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dufour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fredouille</surname>
          </string-name>
          , G. Linares,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , G. Senay, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tirilly. PERCOLI:</surname>
          </string-name>
          <article-title>a person identi cation system for the 2013 REPERE challenge</article-title>
          .
          <source>In SLAM-INTERSPEECH</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gay</surname>
          </string-name>
          , G. Dupuy,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lailler</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Odobez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Meignier</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Deleglise</surname>
          </string-name>
          .
          <article-title>Comparison of Two Methods for Unsupervised Person Identi cation in TV Shows</article-title>
          . In CBMI,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Giraudel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mapelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Quintard</surname>
          </string-name>
          .
          <article-title>The REPERE Corpus : a Multimodal Corpus for Person Recognition</article-title>
          .
          <source>In LREC</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Grivolla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Badia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cabulea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Esteve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Herder</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Odobez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Preuss</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Marin</surname>
          </string-name>
          .
          <article-title>EUMSSI: a Platform for Multimodal Analysis and Recommendation using UIMA</article-title>
          .
          <source>In International Conference on Computational Linguistics (Coling)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Deleglise</surname>
          </string-name>
          , G. Boulianne,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Esteve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rousseau</surname>
          </string-name>
          .
          <article-title>CRIM and LIUM approaches for multi-genre broadcast media transcription</article-title>
          .
          <source>In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)</source>
          , pages
          <fpage>681</fpage>
          {
          <fpage>686</fpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Quintard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giraudel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Joly</surname>
          </string-name>
          .
          <article-title>A presentation of the REPERE challenge</article-title>
          .
          <source>In CBMI</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot. Unsupervised Speaker</surname>
          </string-name>
          <article-title>Identi cation in TV Broadcast Based on Written Names</article-title>
          . IEEE/ACM ASLP,
          <volume>23</volume>
          (
          <issue>1</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , G. Quenot, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Thollard</surname>
          </string-name>
          .
          <article-title>From text detection in videos to person identi cation</article-title>
          .
          <source>In ICME</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Multimodal Person Discovery in Broadcast TV at MediaEval 2015</article-title>
          .
          <source>In MediaEval</source>
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , G. Quenot, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Towards a better integration of written names for unsupervised speakers identi cation in videos</article-title>
          .
          <source>In SLAM-INTERSPEECH</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot</surname>
          </string-name>
          .
          <article-title>Unsupervised speaker identi cation using overlaid texts in TV broadcast</article-title>
          .
          <source>In INTERSPEECH</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fortier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Quenot.</surname>
          </string-name>
          <article-title>Naming multi-modal clusters to identify persons in TV broadcast</article-title>
          .
          <source>MTAP</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          , G. Dupuy,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gay</surname>
          </string-name>
          , E. Khoury,
          <string-name>
            <given-names>T.</given-names>
            <surname>Merlin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          .
          <article-title>An open-source state-of-the-art toolbox for broadcast news diarization</article-title>
          . In Interspeech, Lyon (France),
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          Aug.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Favre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Charlet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Damnati</surname>
          </string-name>
          .
          <article-title>Scene understanding for identifying persons in TV shows: beyond face authentication</article-title>
          .
          <source>In CBMI</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Philbin.</surname>
          </string-name>
          <article-title>FaceNet: a Uni ed Embedding for Face Recognition and Clustering</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>815</volume>
          {
          <fpage>823</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>