<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EUMSSI team at the MediaEval Person Discovery Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nam Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Di Wu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sylvain Meignier</string-name>
          <email>sylvain.meignier@univ-lemans.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean-Marc Odobez</string-name>
          <email>odobez@idiap.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>École Polytechnique Fédéral de Lausanne</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Idiap Research Institute</institution>
          ,
          <addr-line>Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LIUM, University of Maine</institution>
          ,
          <addr-line>Le Mans</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We present the results of the EUMSSI team's participation in the Multimodal Person Discovery task at the MediaEval challenge 2015. The goal is to identify all people who simultaneously appear and speak in a video corpus, which implicitly involves both audio stream and visual stream. We emphasize on improving each modality separately and benchmarking them to analyze their pros and cons.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Nowadays, viewers, journalists, or archivists have access
to a vast amount multimedia data. The need for browsing
and retrieval tools of these archives has led researchers to
devote e ort to developing technologies that create
searchable indices [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In this view, as humans are very interested
in other people while consuming multimedia contents,
algorithms indexing identities of people and retrieving their
respective quotations are indispensable for searching archives.
This practical need leads to research problems on how to
identify people presence in videos and answer 'who appears
when?' or 'who speaks when?'.
      </p>
      <p>In particular, in the MediaEval Person Discovery task,
the goal is the following. Given the raw TV broadcasts,
each shot must be automatically tagged with the name(s) of
people who can be both seen as well as heard in the shot.
The list of people is not known a priori and their names
must be discovered in an unsupervised way from video text
overlay or speech transcripts. This situation corresponds to
cases where at the moment a content is created or broadcast,
some of the appearing people are relatively unknown but
may later on become a trending topic on social networks or
search engines. In addition, to ensure high quality indexes,
algorithms should also help human annotators double-check
these indexes by providing an evidence of the claimed
identity (especially for people who are not yet famous).</p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED SYSTEM</title>
      <p>
        The participation of the EUMSSI team was to enable the
assessment of the di erent modules developed by the authors
in the past [
        <xref ref-type="bibr" rid="ref11 ref17 ref4 ref7 ref8">11, 7, 8, 17, 4</xref>
        ]. In this view, starting from the
baseline provided by the organizer, the goal was to replace
baseline components by the team's components, whenever
they have been made compatible and their processing speed
was enough to address the data provided in the challenge,
and test their performance to understand their advantages.
      </p>
      <p>The used system, as illustrated in Fig. 1, consists of 2 main
stages. The rst stage detects and clusters speakers, faces
and overlaid person names, including extracting Named
Entities (NE). The second one associates speakers to faces using
co- occurrence statistics and the overlaid person names are
propagated to the speakers, or faces, in order to give the
identities of the persons in the show.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Speaker diarization</title>
      <p>
        The speaker diarization system (\who speak when?") is
based on the LIUM Speaker Diarization system[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], freely
distributed1. This system has achieved the best or second
best results in the speaker diarization task on REPERE
French broadcast evaluation campaigns 2012 and 2013 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The diarization system is rst composed of an acoustic
Bayesian Information Criterion (BIC)-based segmentation
followed by a BIC-based hierarchical clustering. Each
cluster represents a speaker and is modeled with a full
covariance Gaussian. A Viterbi decoding re-segments the signal
using GMMs with 8 diagonal components learned by
EMML, for each cluster. Segmentation, clustering and decoding
are performed with 12 MFCC+E, computed with a 10ms
frame rate. Music and jingle regions are removed using a
Viterbi decoding with 8 GMMs (trained on french
broadcast news data) for music, jingle, silence, and speech (with
wide/narrow band variants for the last two, and clean or
noised or musical background variants for wideband speech).</p>
      <p>
        In the above steps, features were used unnormalized in
order to preserve information on the background
environment, which may help di erentiating between speakers. At
this point however, each cluster contains the voice of only
one speaker, but several clusters can be related to a same
speaker. The background environment contribution must
be removed from each GMM cluster, through feature
gaussianization. Finally, the system is completed with clustering
method based on the i-vectors paradigm and Integer Linear
Programming (ILP). This new clustering method is fully
described in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The ILP clustering along with
i1www-lium.univ-lemans.fr/en/content/liumspkdiarization
vectors speaker models gives better results than the usual
hierarchical agglomerative clustering based on GMMs and
cross-likelihood distances [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Face diarization</title>
      <p>Given the video shots, face diarization process consists
of (i) face detection, detecting faces appearing within each
shot, (ii) face tracking, extending detections into continuous
tracks within each shot, and (iii) face clustering, grouping
all tracks with the same identity into clusters.</p>
      <p>
        Face detection. Detecting faces in broadcasting media can
be challenging due to the wide range of media content. Faces
can appear in widely di erent situations with varied
illumination and noise such as in studio, during live coverage, or
during political debate. To overcome these challenges, we
employ deformable part-based model (DPM) [
        <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
        ], which
can detect faces at multiple poses and variation. Because,
the main disadvantage of DPM is its long running time, face
detector is only applied 2 times per second.
      </p>
      <p>
        Face tracking. The goal of this step is to create continuous
face tracks in one video shot, which raises the need for
association individual detections. Because of long gaps between
detected faces, we exploit long term connectivity using
CRFbased multi-target tracking [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This framework relies on
the unsupervised learning of time sensitive association costs
for di erent features. First, similarities between detections
are computed based on low level features (color histogram,
position, motion, SURF keypoint descriptors) which can be
computed fast. Then, for each feature type, the
corresponding pairwise factor of the CRF is de ned as the probability of
similarity measurements between pairs of detections under
two distinct hypotheses that they correspond to the same
label or not. By optimizing a graph labeling posterior, we
assign the same label to detections belonging to the same
face, and di erent labels to di erent faces.
      </p>
      <p>
        Face clustering. Given the face tracks across all video
shots, we hierarchically merge face tracks tracks using
matching and biometric similarity measures [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Matching cluster
similarity is calculated based on average of distances
between sparse keypoints of two clusters. Meanwhile,
biometric model-based similarity measures how densely extracted
features from one cluster are likely to belong to the model of
the other cluster, as compared to the likelihood given by the
statistical model, and vice-versa. Face tracks are rst
clustered using only feature-based matching, yielding clusters
with su cient data to adapt the biometric models. Then,
model-based similarity is combined with matching similarity
to merge clusters until stopping criteria are met. Similarly to
speaker diarization, face diarization produces face segments
during which distinct identities appear.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Person Naming</title>
      <p>
        Identity candidate retrieval. OPNs can be more
reliably extracted using Optical Character Recognition (OCR)
techniques [
        <xref ref-type="bibr" rid="ref13 ref2">2, 13</xref>
        ] than from automatic speech transcripts.
Therefore, we only exploit name entities detected from OCR
by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as potential identity candidates.
      </p>
      <p>
        Direct one-to-one tagging. As mentioned earlier, our
goal is to benchmark improvements of each modality in the
system. Hence, there is one assumption that the temporal
clusters of the diarization processes are trustable. In this
work, we use a simple one-to-one naming method provided
by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] which nds the mapping between clusters and named
Method EwMAP
Baseline 78.35
FaceDia 83.04
SpkDia 89.75
SpkFace 89.53
Primary submission
      </p>
      <p>MAP
78.64
83.33
90.14
89.90</p>
      <p>C
92.71
90.77
97.05
96.52</p>
      <p>We evaluated 3 methods: SpkDia, FaceDia, and SpkFace.
In SpkDia (primary submission), we apply naming based
on audio information only (this is equivalent to assumption
that all speakers which are associated with a name are
visible and speaking). This is our primary submission for the
challenge. Second, in FaceDia, we apply naming based on
visual information only, and assume that all visible faces
(which are associated with a name) are talking. Third, in
SpkFace, we apply naming based on audio information only,
but validate if there exists visible faces during the speech
segments (if not, the segment is discarded). Because our
approaches are monomodal and fully unsupervised, we did
not use the information provided by leaderboard to improve
performance.</p>
      <p>
        The results using the challenge performance measures are
reported in Tab. 1 for the REPERE test 2 data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as the
initial development data and in Tab. 2 for the challenge
testing part of the INA dataset. SpkDia is the most robust and
performs the best even without any face information, which
might be explained by two points. First, there is usually
only one speaker at a time, and not much noise in the
challenge data. Meanwhile, face diarization can be di cult due
to multiple faces, facial variation, missed detections, etc.
Hence, speech clusters tend to be more reliable than face
clusters. Second, when a speaker is not visible, it is often
the anchor of the show, who is counted as one query equally
to those appearing for short duration. Therefore, SpkDia
is not penalized much by the visibility of speakers. We can
observe this e ect more in the last column of Tab. 2 which
shows the number of person presence with names predicted
by each scheme. Using faces to lter 1=3 of speech segments
does not help to increase precision because these segments
correspond to a small number of repetitive speakers. Also,
though face diarization gives only 1=3 of possible names,
these names are precise person-wise. This interesting fact
may provide outlook on combining 2 modalities.
4.
      </p>
    </sec>
    <sec id="sec-6">
      <title>FUTURE WORKS</title>
      <p>We have presented our system in MediaEval challenge.
The testing result serves as our basis for improving each
component. We are working on speeding up the tracking
process as well as investigating alternative face
representations such as total variability modeling. On another hand,
current system has not taken full advantage of both audio
and visual streams, which we plan to proceed in the future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          <article-title>. Multi-stage speaker diarization of broadcast news</article-title>
          .
          <volume>14</volume>
          (
          <issue>5</issue>
          ):
          <volume>1505</volume>
          {
          <fpage>1512</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <surname>J.-M. Odobez</surname>
          </string-name>
          .
          <article-title>Video text recognition using sequential monte carlo and error voting methods</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <volume>26</volume>
          (
          <issue>9</issue>
          ):
          <volume>1386</volume>
          {
          <fpage>1403</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dinarelli</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosset</surname>
          </string-name>
          .
          <article-title>Models cascade for tree-structured named entity detection</article-title>
          .
          <source>In IJCNLP</source>
          , pages
          <volume>1269</volume>
          {
          <fpage>1278</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dupuy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Deleglise</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Esteve</surname>
          </string-name>
          .
          <article-title>Recent improvements towards ILP-based clustering for broadcast news speaker diarization</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Felzenszwalb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McAllester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          .
          <article-title>Object detection with discriminatively trained part-based models</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>32</volume>
          (
          <issue>9</issue>
          ):
          <volume>1627</volume>
          {
          <fpage>1645</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Kahn.</surname>
          </string-name>
          <article-title>The rst o cial REPERE evaluation</article-title>
          . In Interspeech satellite workshop on Speech,
          <article-title>Language and Audio in Multimedia (SLAM), Marseille</article-title>
          , France,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Khoury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Odobez</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Deleglise. A Conditional Random</surname>
          </string-name>
          <article-title>Field approach for Audio-Visual people diarization</article-title>
          .
          <source>In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP</source>
          <year>2014</year>
          ),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Khoury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Odobez</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Deleglise</surname>
          </string-name>
          .
          <article-title>Face identi cation from overlaid texts using Local Face Recurrent Patterns and CRF models</article-title>
          .
          <source>In IEEE International Conference on Image Processing (ICIP)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Giraudel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mapelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Quintard.</surname>
          </string-name>
          <article-title>The repere corpus : a multimodal corpus for person recognition</article-title>
          . In N. C. C. Chair),
          <string-name>
            <given-names>K.</given-names>
            <surname>Choukri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. U.</given-names>
            <surname>Dogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maegaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mariani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Odijk</surname>
          </string-name>
          , and S. Piperidis, editors,
          <source>Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)</source>
          , Istanbul, Turkey, may
          <year>2012</year>
          .
          <article-title>European Language Resources Association (ELRA).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Heili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lopez-Mendez</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J.-M.</given-names>
            <surname>Odobez</surname>
          </string-name>
          .
          <article-title>Exploiting long-term connectivity and visual motion in crf-based multi-person tracking</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          ,
          <volume>23</volume>
          (
          <issue>7</issue>
          ):
          <volume>3040</volume>
          {
          <fpage>3056</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Khoury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gay</surname>
          </string-name>
          , and J.
          <string-name>
            <surname>-M. Odobez</surname>
          </string-name>
          .
          <article-title>Fusing matching and biometric similarity measures for face diarization in video</article-title>
          .
          <source>In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval</source>
          , pages
          <volume>97</volume>
          {
          <fpage>104</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Benenson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pedersoli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          .
          <article-title>Face detection without bells and whistles</article-title>
          .
          <source>In ECCV</source>
          , pages
          <volume>720</volume>
          {
          <fpage>735</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          , G. Quenot, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Thollard</surname>
          </string-name>
          .
          <article-title>From text detection in videos to person identi cation</article-title>
          .
          <source>In 2012 IEEE International Conference on Multimedia and Expo (ICME)</source>
          , pages
          <fpage>854</fpage>
          {
          <fpage>859</fpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          .
          <article-title>Multimodal person discovery in broadcast tv at mediaeval</article-title>
          <year>2015</year>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Poignant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bredin</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-B. Le</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Besacier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barras</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quenot</surname>
          </string-name>
          .
          <article-title>Unsupervised speaker identi cation using overlaid texts in tv broadcast</article-title>
          .
          <source>In Interspeech, page 4p</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          , G. Dupuy,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gay</surname>
          </string-name>
          , E. Khoury,
          <string-name>
            <given-names>T.</given-names>
            <surname>Merlin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          .
          <article-title>An open-source state-of-the-art toolbox for broadcast news diarization</article-title>
          . In Interspeech, Lyon (France),
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          Aug.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          .
          <article-title>A global optimization framework for speaker diarization</article-title>
          .
          <source>In Odyssey Workshop</source>
          , Singapore,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>