<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An active learning method for speaker identity annotation in audio recordings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Broux Pierre-Alexandre</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Doukhan David</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petitrenaud Simon</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meignier Sylvain</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carrive Jean</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Given that manual annotation of speech is an expensive and long process, we attempt in this paper to assist an annotator to perform a speaker diarization. This assistance takes place in an annotation background for a large amount of archives. We propose a method which decreases the intervention number of a human. This method corrects a diarization by taking into account the human interventions. The experiment is done using French broadcast TV shows drawn from ANR-REPERE evaluation campaign. Our method is mainly evaluated in terms of KSR (Keystroke Saving Rate), and we reduce the number of actions needed to correct a speaker diarization output by 6.8% in absolute value.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The work presented in this paper has been realized to meet the needs
of the French national audiovisual institute3 (INA). INA is a public
institution in charge of the digitalization, preservation, distribution
and dissemination of the French audiovisual heritage. Annotations
related to speaker identity, together with speech transcription, meet
several use-cases. Temporal localization of speaker interventions can
be used to enhance the navigation within a media [
        <xref ref-type="bibr" rid="ref12 ref22">12, 22</xref>
        ]. It may
also be used to perform complex queries within media databases [
        <xref ref-type="bibr" rid="ref11 ref19 ref5">5,
11, 19</xref>
        ].
      </p>
      <p>
        This article focuses on the realization of human-assisted speaker
diarization systems. Speaker diarization methods consist in
estimating "who spoke when" in an audio stream [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This media structuring
process is an efficient pre-processing step, for instance to help
segmenting a broadcast news into anchors and reports before manual
documentation processes. Speaker diarization algorithms are
generally based on unsupervised machine learning methods [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], in charge
of estimating the number of speakers, and splitting the audio stream
into labelled speech segments assigned to hypothesized speakers.
Speaker identity and temporal localization is known to be a
pertinent information for the access and exploitation of speech recordings
[
        <xref ref-type="bibr" rid="ref20 ref5">5, 20</xref>
        ]. However, the accuracy of automatic state-of-the-art speaker
recognition methods is still inadequate to be embedded into INA’s
archiving or media enhancement applications, and a human
intervention is required to obtain an optimal description of a speech archive.
      </p>
      <p>Manual annotation of speech is a very expensive process. Nine
hours are required to perform the manual annotation corresponding
to one hour of spontaneous speech (speech transcription and speaker
identity). Previous studies have shown that the speech annotation
1 Computer science laboratory of the university of Maine (LIUM - EA 4023),</p>
      <p>
        Le Mans, France
2 French National audiovisual institute (Ina), Paris, France
3 http://www.ina.fr
process may be sped-up using the output of automatic speech
recognition systems (ASR) together with speech turn annotations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The
resulting annotation task consists in correcting the output of
automatic systems, instead of doing the whole annotation manually.
      </p>
      <p>
        The model proposed in this paper is an active-learning extension
of this paradigm, applied to the speaker diarization task.
Annotator corrections are used in real-time to update the estimations of the
speaker diarization system. The aim of this update strategy is to lower
the amount of manual corrections to be done, which impact the time
spent in the interaction with the system. The quality of the
annotations obtained through this process should be maximal, with respect
to human abilities on speaker recognition tasks [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>The paper is organized as follows: Section 2 presents the
Humanassisted speaker diarization system. Section 3 presents the corpus, the
metrics, whereas section 4 analyzes the results. Section 5 concludes
with a discussion of possible directions for future works.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Human-assisted speaker diarization system</title>
      <p>The proposed speaker diarization prototype is aimed at interacting in
real-time with a human user, in charge of correcting the predictions
of the system. This system is aimed at producing high quality
diarization annotations with a minimal human cost. Such system could
be used to speed-up the annotation process of any speech corpus
requiring temporal speaker information.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>System overview</title>
      <p>In the following description, we assume that an easy-to-use interface
is provided to the user, and that the speech segments are presented
together with the speech transcription. We also assume that the
feedback of the user is limited to three actions:
1. The validation, when the speech segment has a correct speaker
label;
2. The speaker label modification, when the speech segment has an
incorrect speaker label;
3. The speaker label creation: for speakers encountered for the first
time in the recording.</p>
      <p>Actions such as speech segment split, or speech segment boundaries
modifications are not taken into account in the scope of this paper.</p>
      <p>Annotated speech segments corresponding to the whole recording
are presented to the annotator. The segment presentation order
follows the temporal occurrence of the segments. This choice has been
made in order to ease the manual speaker recognition task, with the
assumption that the media chronology provides the annotator with a
better understanding of the speech material. The annotator has to
correct, or validate the predictions of the diarization system. Our
working paradigm is that a correction requires more time for the annotator
than a validation.</p>
      <p>Figure 1 describes the proposed active-learning system. The
system consists in associating each annotator correction to a real-time
re-estimation of the labels of the remaining speech segments to be
presented. This method is aimed at improving the quality of the
next diarization predictions, resulting in a lower amount of
corrections to be done by the annotator, thus lowering the time required
for the manual correction. The system is composed of three main
steps, which will be detailed in the next sections. The two last steps
are repeated until all the segments are checked. Let us give a brief
description of these stages:
Initialization: an initial diarization is performed with a
fullyautomatic speaker diarization system. This step can be time
consuming and is performed offline.</p>
      <p>User input: the annotator checks each segment, and validates or
corrects the speaker label before inspecting the next segment.
Real-time reassignment: the annotator modifications are
associated to a re-evaluation of the speaker labels corresponding to the
next speech segments to be presented. The computations realized
during this step should be fast enough to allow real-time
interaction with a human user.</p>
      <p>Stage 1
Stage 2
Stage 3</p>
      <p>Ground truth segmentation
Initialization : Speaker diarization
(BIC clustering)</p>
      <p>User input
(validate or correct speaker label)</p>
      <p>Real-time reassignment
(re-evalution of speaker label)</p>
      <p>
        Correct speaker diarization
The speaker diarization system is inspired by the system described
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It was developed for the transcription and diarization tasks,
with the goal of minimizing both word error rate and speaker error
rate. It rests upon a segmentation and a hierarchical agglomerative
clustering. Furthermore, this system uses MFCC features as audio
descriptors [
        <xref ref-type="bibr" rid="ref17 ref2 ref7">2, 7, 17</xref>
        ].
      </p>
      <p>The system is composed of a segmentation step followed with a
clustering step. Speaker diarization needs to produce homogeneous
speech segments. Errors such as having two distinct clusters (i.e.,
detected speakers) corresponding to the same real speaker could be
easily corrected by merging both clusters. In this article, we focus the
study on the clustering step and the segmentation step is based on a
perfect manual segmentation (ground truth).</p>
      <p>The clustering algorithm is based upon a hierarchical
agglomerative clustering. The initial set of clusters is composed of one segment
per cluster. Each cluster is modeled by a Gaussian with a full
covariance matrix. The BIC measure (cf equation 1) is employed to
select the candidate clusters to group as well as to stop the merging
process. The two closest clusters i and j are merged at each iteration
until BIC(i, j) &gt; 0.</p>
      <p>Let |⌃ i|, |⌃ j | and |⌃ | be the determinants of gaussians associated
to the clusters i, j and i + j and be a parameter to set up. The
penalty factor P (eq. 2) depends on d, the dimension of the features,
as well as on ni and nj , referring to the total length of cluster i and
cluster j respectively. The BIC(i, j) measure between the clusters
i and j is then defined as follows:</p>
      <p>BIC(i, j) = ni + nj log |⌃ |
2
ni log |⌃ i|
2
nj log |⌃ j |
2
with P =</p>
      <p>
        This speaker diarization system is the first stage of most
state-ofthe-art systems for TV or radio recording as the one based on GMM
or i-vectors[
        <xref ref-type="bibr" rid="ref1 ref8">1, 8</xref>
        ]. GMM and i-vectors are both statistical models
which represent audio data. The generated clusters have a high purity
(i.e. each cluster contains mostly only one speaker) and the system is
fast.
User input consists in validating, or correcting, the speaker labels
estimated by the diarization system. The proposed active-learning
strategy consists in associating each correction, defined as a mismatch
between the speakers Ci and Cj , to the computation of new speaker
models, trained on the validated speech segments. The resulting
models are based on a single gaussian, which is fast to compute, and
assumed to be more accurate than the models inferred during the
initialization. These simple speaker models are then used to re-estimate
the BIC distance with the remaining speech segments involved to
the last mismatch (segments attributed to Ci and Cj only).
P,
(1)
(2)
      </p>
      <p>An illustration of these interactions is provided in figure 2. In this
example, four speakers (A, B, C, D) have been inferred through the
automatic initialization step. The user has manually validated the
four first speech segments (S1...S4) before reporting a speaker label
modification for segment S5, tagged as speaker B instead of speaker
A. The resulting action of the active-learning system, consists to
create speaker models for the mismatching speakers only (A and B).
These models are used to re-estimate the labels of the remaining
segments tagged with A or B (segments S7, S8 and S11), and may lead
to a speaker label modification (segment S11). Remaining speech
segments tagged with other labels (C and D) are not re-estimated.
The modified diarization is updated before the annotator moves to
the next segment S6. The process iterates until the last segments are
reached.
3
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
    </sec>
    <sec id="sec-5">
      <title>Corpus</title>
      <p>Experiments were performed on TV recordings drawn from the
corpora of ANR-REPERE challenge4 . The ANR-REPERE is a
challenge organized by the LNE (French national laboratory of
metrology and testing) and ELDA (Evaluations and Language resources
Distribution Agency) in 2010-2014. This challenge is a project in the
area of the multimedia recognition of people in television documents.
The aim is to find the identities of people who speak along with the
quoted and written names at each instant in a television show. The
data comes from two French channels (BFM and LCP). Shows were
recorded from two French digital terrestrial television channels.</p>
      <p>The ANR-REPERE project has started since 2010 and evaluations
are set up in 2013 and 2014. In this paper, we merge the 2013
evaluation corpus and the 2014 evaluation corpus to build the corpus called
REPERE in the below sections. The table 1 give us some statistics
about this corpus. The duration reported in table 1 shows that only a
part of the data is annotated and evaluated.</p>
      <p>Statistics
Show number
Recording number
Recording time
Annotation time
Speaker number</p>
      <p>REPERE
15
90
34h30
13h11
571</p>
      <p>
        The current diarization systems are less efficient with spontaneous
speech mainly present in debates than with prepared speech from
news [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We have chosen this corpus because of the variety of the
shows. The corpus is balanced between prepared and spontaneous
speech and composed of street interviews, debates and news shows.
      </p>
      <p>It is common to accept a ±250 millisecond tolerance on segment
boundaries for the recordings with prepared speech and far less for
the recordings with spontaneous speech. Having and using a
reference segmentation for the segmentation step, we do not normally
have segmentation errors. Therefore, we do not use any tolerances
on segment boundaries.</p>
      <p>
        Most of the diarization systems are not able to detected overlap
speech zones [
        <xref ref-type="bibr" rid="ref16 ref24 ref4">4, 16, 24</xref>
        ]. In the following described experiments,
we remove overlap speech from the evaluation and consider it as a
4 http://www.defi-repere.fr/
non-speech area. Figure 3 shows the segment duration after the
superposed speech deletion.
The metric used to measure performance in the speaker diarization
task is the Diarization Error Rate (DER) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. DER was introduced
by NIST as the fraction of speaking time which is not attributed
to the correct speaker, using the best matching between references
and hypothesis speaker labels. The scoring tool is available in the
sidekit/s4d toolkit[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        In order to evaluate the impact of a reassignment, we use the
percentage of pure clusters with respect to the total number of clusters.
We also use the well-known purity as defined in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which is the ratio
between the number of frames by the dominating speaker in a cluster
and the total number of frames in this cluster. This measure is used
in order to evaluate the purity of hypothesis clusters according to the
assignment provided by reference clusters. To evaluate the action
applied by a human, we simply use some counters. These counters will
be in the form of percentages in this paper.
3.2.2
      </p>
      <p>
        Keystroke Saving Rate
The DER and the purity measure the quality of a diarization. The
evaluation of the user input is difficult, as the proposed metric needs
to be as much as possible reproducible and objective [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In our
case, the human interactions are simulated.
      </p>
      <p>
        The proposed method is inspired from a previous work on
computer assisted transcription [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In this paper the authors proposed
to evaluate the human interactions with the Keystroke Saving Rate
(KSR) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>The KSR method has been developed for AAC (Augmentative
and Alternative Communication) systems, so that handicapped
persons can use it. It is computed according to the number of keyboard
strokes made by the user to write a message. In our case, the strokes
corresponds to the number of actions made by the annotator to
correct the diarization. To compute the KSR, we assume that the
annotator will always choose the best strategy to minimize the number of
actions.</p>
      <p>We suppose here that the annotator can make two kinds of actions
for a current segment: the reassignment to another cluster
(reassignment) or the assignment to a new cluster (creation). The annotator
can create a new cluster when the first segment of a given speaker is
checked. The number of creations in the whole document, denoted
by nc, is constant for any reassignment even if the threshold in
equation 1 differs. Similarly the total number of segments reassigned
by the user is denoted by nr and the number of segments is ns. We
define the KSR as the ratio of the sum of the numbers of created
clusters and the reassigned segments nr given the number of segments in
the initial diarization (equation 3):</p>
      <p>KSR = nc + nr ⇥ 100. (3)</p>
      <p>ns</p>
      <p>A KSR equal to 0% corresponds to a perfect speaker diarization in
which each segment is assigned to the true corresponding speaker. In
this case, the annotator does not reassign any segments. Conversely,
a KSR equal to 100% corresponds to the worse speaker diarization in
which each segment is assigned to the wrong speaker. Therefore, the
annotator needs to change the assignment of all the segments, if the
corrections are not gradually propagated in the rest of the document.
4
4.1</p>
      <p>
        The speaker diarization is based on hierarchical clustering where
each speaker is modeled by a gaussian with a full covariance
computed over acoustic features. The acoustic features are composed of
12 MFCCs with energy, and are not normalized (the background
channel helps to segment and cluster the speaker) [
        <xref ref-type="bibr" rid="ref17 ref2 ref7">2, 7, 17</xref>
        ].
      </p>
      <p>As mentioned previously, we use the ground truth segmentation
as input of the clustering algorithm (corresponding to the stage 1 in
figure 1) and the overlapping speaker segments are removed in the
ground truth.</p>
      <p>
        Figure 4 shows the DER of the speaker diarization for different
thresholds (cf. equation 1). The lower DER is 9.9% for a threshold
of 4.0. Compared to literature [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], this DER is rather low, which is
mainly due to ground truth segmentation: the segments contain the
voice of a single speaker, overlap segments are removed, as well as
there are no missed speech and no false alarm speech segments.
4.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Active-learning system</title>
      <p>In our experiments, the human annotator is simulated with the ground
truth speaker annotations. The main objective is to decrease the
number of actions performed by an annotator to obtain a perfect
diarization. To reach this goal, we compare the KSR obtained with or
without the human corrections taken into consideration (i.e. with or
without an active-learning reassignment) using various thresholds for
the speaker diarization.</p>
      <p>The real-time segment reassignment stage (stage 3 in figure 1) uses
the same parameters as the initial diarization: 12MFCC+energy, full
covariance gaussian and BIC metric to label the unchecked segments
.</p>
      <p>Figure 5 gives the KSR of the system with real-time reassignment
(including stages 2 &amp; 3) and the system without real-time
reassignment (including stage 2 only). The KSR decreases until = 3.5
in both systems and increases when is upper. The KSR is 56.5%
and 49.7% respectively without reassignment and with reassignment
when is equal to 3.5. About half segments are manually corrected
(49.7%) and the 6.8% in absolute value are reassigned to the correct
speaker automatically after a user correction.</p>
      <p>In the most favorable case when is at 3.5, the DER is low, about
10% and the average cluster purity is equal to 90%. In the same time,
only 60% of the clusters are 100% pure (cf. figure 4). The difference
between these indicators can be explained by the fact that, unlike
the DER, the KSR does not take into account the duration of the
segments . Most of the errors come from the small segments, and
these ones are numerous (cf. figure 3).</p>
      <p>The KSR remains almost static when is greater than 3.5 in the
system with reassignment, whereas the choice of the parameter is
more critical to minimize the number of actions in the system without
reassignment. Finally, one can notice that the system with
reassignment always obtains a lower KSR whatever the value, except for
= 0 where the KSR is equal to 100% in both cases.</p>
      <p>0.30
0.25</p>
      <p>After each user correction, the unchecked segments are clustered
again in the reassignment stage. The process is generally fast, since
the duration takes less than 0.03 second in 95% of cases, so it is
interesting to notice that this stage could be done in real time without
any impact on the user interface (figure 6).
5</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion &amp; prospects</title>
      <p>In this paper, we attempt to find a way to help a human to segment
and cluster the speakers in an audio or audio-visual document. We
propose a method that takes into consideration the annotator
corrections by modifying the allocation of the unchecked segments. The
proposed computer assisted method allows us to obtain a
noticeable reduction in the number of required corrections. Not only is our
method effective, but the corrections are also made quickly. Thanks
to its fast treatment, this could be applied in a real application
without impacting the reactivity of the interface and without increasing
the work intensity of the annotator.</p>
      <p>Some future improvements should be done on the base of this
preliminary work. Firstly, we plan to minimize the number of user
actions by applying a constrained clustering to reassign all unchecked
segments and to create or delete clusters. Another improvement
would be to integrate the automatic segmentation in the correction
process.
6</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This research was partially supported by the European
Commission, as part of the Event Understanding through Multimodal Social
Stream Interpretation (EUMSSI) project (contract number
FP7-ICT2013-10) in which the LIUM is involved.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Anguera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bozonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fredouille</surname>
          </string-name>
          , G. Friedland, and
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , '
          <article-title>Speaker diarization: A review of recent research</article-title>
          ',
          <volume>20</volume>
          (
          <issue>2</issue>
          ),
          <fpage>356</fpage>
          -
          <lpage>370</lpage>
          , (
          <year>Feb 2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Meignier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          ,
          <article-title>'Multi-stage speaker diarization of broadcast news'</article-title>
          ,
          <source>IEEE Transactions on Audio, Speech and Language Processing</source>
          ,
          <volume>14</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1505</fpage>
          -
          <lpage>1512</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Thierry</given-names>
            <surname>Bazillon</surname>
          </string-name>
          , Yannick Estève, and Daniel Luzzati, '
          <article-title>Transcription manuelle vs assistée de la parole préparé et spontanée'</article-title>
          ,
          <string-name>
            <surname>Revue</surname>
            <given-names>TAL</given-names>
          </string-name>
          , (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kofi</given-names>
            <surname>Boakye</surname>
          </string-name>
          , Beatriz Trueba-Hornero,
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , and Gerald Friedland, '
          <article-title>Overlapped speech detection for improved speaker diarization in multiparty meetings'</article-title>
          , in Acoustics,
          <source>Speech and Signal Processing</source>
          ,
          <year>2008</year>
          .
          <article-title>ICASSP 2008</article-title>
          . IEEE International Conference on, pp.
          <fpage>4353</fpage>
          -
          <lpage>4356</lpage>
          . IEEE, (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Mbarek</given-names>
            <surname>Charhad</surname>
          </string-name>
          , Daniel Moraru, Stéphane Ayache, and Georges Quénot, '
          <article-title>Speaker identity indexing in audio-visual documents'</article-title>
          ,
          <source>in Content-Based Multimedia Indexing (CBMI2005)</source>
          , (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ruchard</given-names>
            <surname>Dufour</surname>
          </string-name>
          , Vincent Jousse, Yannick Estève, Fréderic Béchet, and Georges Linarès, '
          <article-title>Spontaneous speech characterization and detection in large audio database', SPECOM, St</article-title>
          . Petersburg, (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Grégor</given-names>
            <surname>Dupuy</surname>
          </string-name>
          ,
          <article-title>Les collections volumineuses de documents audiovisuels: segmentation et regroupement en locuteurs</article-title>
          ,
          <source>Ph.D. dissertation</source>
          , Université du Maine,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Grégor</given-names>
            <surname>Dupuy</surname>
          </string-name>
          , Sylvain Meignier, Paul Deléglise, and Yannick Esteve, '
          <article-title>Recent improvements on ilp-based clustering for broadcast news speaker diarization'</article-title>
          ,
          <source>in Proc. Odyssey Workshop</source>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Jean-Luc</surname>
            <given-names>Gauvain</given-names>
          </string-name>
          , Lori Lamel, and Gilles Adda, '
          <article-title>Partitioning and transcription of broadcast news data</article-title>
          .',
          <string-name>
            <surname>in</surname>
            <given-names>ICSLP</given-names>
          </string-name>
          , volume
          <volume>98</volume>
          , pp.
          <fpage>1335</fpage>
          -
          <lpage>1338</lpage>
          , (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Edouard</surname>
            <given-names>Geoffrois</given-names>
          </string-name>
          , '
          <article-title>Evaluating interactive system adaptation'</article-title>
          ,
          <source>in The International Conference on Language Resources and Evaluation</source>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jerry</surname>
            <given-names>Goldman</given-names>
          </string-name>
          , Steve Renals, Steven Bird, Franciska De Jong, Marcello Federico, Carl Fleischhauer, Mark Kornbluh, Lori Lamel, Douglas W Oard, Claire
          <string-name>
            <surname>Stewart</surname>
          </string-name>
          , et al.,
          <article-title>'Accessing the spoken word'</article-title>
          ,
          <source>International Journal on Digital Libraries</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ),
          <fpage>287</fpage>
          -
          <lpage>298</lpage>
          , (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Nicolas</surname>
            <given-names>Hervé</given-names>
          </string-name>
          , Pierre Letessier, Mathieu Derval, and Hakim Nabi, 'Amalia.
          <article-title>js: An open-source metadata driven html5 multimedia player'</article-title>
          ,
          <source>in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM '15</source>
          , pp.
          <fpage>709</fpage>
          -
          <lpage>712</lpage>
          , New York, NY, USA, (
          <year>2015</year>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Juliette</surname>
            <given-names>Kahn</given-names>
          </string-name>
          , Parole de locuteur:
          <article-title>performance et confiance en identification biométrique vocale</article-title>
          ,
          <source>Ph.D. dissertation, Avignon</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Anthony</surname>
            <given-names>Larcher</given-names>
          </string-name>
          , Kong Aik Lee, and Sylvain Meignier, '
          <article-title>An extensible speaker identification sidekit in python'</article-title>
          ,
          <source>in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , pp.
          <fpage>5095</fpage>
          -
          <lpage>5099</lpage>
          . IEEE, (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Antoine</surname>
            <given-names>Laurent</given-names>
          </string-name>
          , Sylvain Meignier, Teva Merlin, and Paul Deléglise, '
          <article-title>Computer-assisted transcription of speech based on confusion network reordering'</article-title>
          , in Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2011</year>
          IEEE International Conference on, pp.
          <fpage>4884</fpage>
          -
          <lpage>4887</lpage>
          . IEEE, (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Anguera</surname>
          </string-name>
          <string-name>
            <surname>Miro</surname>
          </string-name>
          , Simon Bozonnet, Nicholas Evans,
          <string-name>
            <given-names>Corinne</given-names>
            <surname>Fredouille</surname>
          </string-name>
          , Gerald Friedland, and Oriol Vinyals, '
          <article-title>Speaker diarization: A review of recent research', Audio, Speech, and Language Processing</article-title>
          , IEEE Transactions on,
          <volume>20</volume>
          (
          <issue>2</issue>
          ),
          <fpage>356</fpage>
          -
          <lpage>370</lpage>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Lindasalwa</surname>
            <given-names>Muda</given-names>
          </string-name>
          , Mumtaj Begam,
          <string-name>
            <surname>and I Elamvazuthi,</surname>
          </string-name>
          '
          <article-title>Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques'</article-title>
          ,
          <source>arXiv preprint arXiv:1003.4083</source>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>NIST</surname>
          </string-name>
          .
          <article-title>The rich transcription spring 2003 (RT-03S) evaluation plan</article-title>
          . http://www.itl.nist.gov/iad/mig/tests/rt/ 2003-spring/docs/rt03-spring
          <article-title>-eval-plan-v4.pdf</article-title>
          ,
          <year>February 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Roeland</surname>
            <given-names>Ordelman</given-names>
          </string-name>
          , Franciska De Jong, and Martha Larson, '
          <article-title>Enhanced multimedia content access and exploitation using semantic speech retrieval'</article-title>
          ,
          <source>in Semantic Computing</source>
          ,
          <year>2009</year>
          . ICSC'09. IEEE International Conference on, pp.
          <fpage>521</fpage>
          -
          <lpage>528</lpage>
          . IEEE, (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Julien</given-names>
            <surname>Pinquier and Régine</surname>
          </string-name>
          André-Obrecht, '
          <article-title>Audio indexing: primary components retrieval', Multimedia tools</article-title>
          and applications,
          <volume>30</volume>
          (
          <issue>3</issue>
          ),
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Sue</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Tranter</surname>
          </string-name>
          and
          <article-title>Douglas A Reynolds, 'An overview of automatic speaker diarization systems'</article-title>
          ,
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          ,
          <volume>14</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1557</fpage>
          -
          <lpage>1565</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Félicien</surname>
            <given-names>Vallet</given-names>
          </string-name>
          , Jim Uro, Jérémy Andriamakaoly, Hakim Nabi, Mathieu Derval, and Jean Carrive, '
          <article-title>Speech trax: A bottom to the top approach for speaker tracking and indexing in an archiving context'</article-title>
          ,
          <source>in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC)</source>
          .
          <source>European Language Resources Association (ELRA)</source>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Matthew</surname>
            <given-names>EJ</given-names>
          </string-name>
          <string-name>
            <surname>Wood and Eric Lewis</surname>
          </string-name>
          , '
          <article-title>Windmill-the use of a parsing algorithm to produce predictions for disabled persons'</article-title>
          ,
          <source>PROCEEDINGSINSTITUTE OF ACOUSTICS</source>
          ,
          <volume>18</volume>
          ,
          <fpage>315</fpage>
          -
          <lpage>322</lpage>
          , (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Zelenák</surname>
          </string-name>
          and Javier Hernando, '
          <article-title>The detection of overlapping speech with prosodic features for speaker diarization</article-title>
          .',
          <string-name>
            <surname>in</surname>
            <given-names>INTERSPEECH</given-names>
          </string-name>
          , pp.
          <fpage>1041</fpage>
          -
          <lpage>1044</lpage>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>