=Paper= {{Paper |id=Vol-1801/paper5 |storemode=property |title=An Active Learning Method for Speaker Identity Annotation in Audio Recordings |pdfUrl=https://ceur-ws.org/Vol-1801/paper5.pdf |volume=Vol-1801 |authors=Pierre-Alexandre Broux,David Doukhan,Simon Petitrenaud,Sylvain Meignier,Jean Carrive |dblpUrl=https://dblp.org/rec/conf/ecai/BrouxDPMC16 }} ==An Active Learning Method for Speaker Identity Annotation in Audio Recordings== https://ceur-ws.org/Vol-1801/paper5.pdf
               An active learning method for speaker identity
                       annotation in audio recordings
                        Broux Pierre-Alexandre1,2 and Doukhan David1 and Petitrenaud Simon2
                                         and Meignier Sylvain1 and Carrive Jean2


Abstract. Given that manual annotation of speech is an expen-                process may be sped-up using the output of automatic speech recog-
sive and long process, we attempt in this paper to assist an anno-           nition systems (ASR) together with speech turn annotations [3]. The
tator to perform a speaker diarization. This assistance takes place in       resulting annotation task consists in correcting the output of auto-
an annotation background for a large amount of archives. We pro-             matic systems, instead of doing the whole annotation manually.
pose a method which decreases the intervention number of a human.               The model proposed in this paper is an active-learning extension
This method corrects a diarization by taking into account the human          of this paradigm, applied to the speaker diarization task. Annota-
interventions. The experiment is done using French broadcast TV              tor corrections are used in real-time to update the estimations of the
shows drawn from ANR-REPERE evaluation campaign. Our method                  speaker diarization system. The aim of this update strategy is to lower
is mainly evaluated in terms of KSR (Keystroke Saving Rate), and we          the amount of manual corrections to be done, which impact the time
reduce the number of actions needed to correct a speaker diarization         spent in the interaction with the system. The quality of the annota-
output by 6.8% in absolute value.                                            tions obtained through this process should be maximal, with respect
                                                                             to human abilities on speaker recognition tasks [13].
                                                                                The paper is organized as follows: Section 2 presents the Human-
1     Introduction                                                           assisted speaker diarization system. Section 3 presents the corpus, the
The work presented in this paper has been realized to meet the needs         metrics, whereas section 4 analyzes the results. Section 5 concludes
of the French national audiovisual institute3 (INA). INA is a public         with a discussion of possible directions for future works.
institution in charge of the digitalization, preservation, distribution
and dissemination of the French audiovisual heritage. Annotations            2     Human-assisted speaker diarization system
related to speaker identity, together with speech transcription, meet
several use-cases. Temporal localization of speaker interventions can        The proposed speaker diarization prototype is aimed at interacting in
be used to enhance the navigation within a media [12, 22]. It may            real-time with a human user, in charge of correcting the predictions
also be used to perform complex queries within media databases [5,           of the system. This system is aimed at producing high quality di-
11, 19].                                                                     arization annotations with a minimal human cost. Such system could
   This article focuses on the realization of human-assisted speaker         be used to speed-up the annotation process of any speech corpus re-
diarization systems. Speaker diarization methods consist in estimat-         quiring temporal speaker information.
ing "who spoke when" in an audio stream [2]. This media structuring
process is an efficient pre-processing step, for instance to help seg-       2.1    System overview
menting a broadcast news into anchors and reports before manual
documentation processes. Speaker diarization algorithms are gener-           In the following description, we assume that an easy-to-use interface
ally based on unsupervised machine learning methods [21], in charge          is provided to the user, and that the speech segments are presented
of estimating the number of speakers, and splitting the audio stream         together with the speech transcription. We also assume that the feed-
into labelled speech segments assigned to hypothesized speakers.             back of the user is limited to three actions:
Speaker identity and temporal localization is known to be a perti-
                                                                             1. The validation, when the speech segment has a correct speaker
nent information for the access and exploitation of speech recordings
                                                                                label;
[5, 20]. However, the accuracy of automatic state-of-the-art speaker
                                                                             2. The speaker label modification, when the speech segment has an
recognition methods is still inadequate to be embedded into INA’s
                                                                                incorrect speaker label;
archiving or media enhancement applications, and a human interven-
                                                                             3. The speaker label creation: for speakers encountered for the first
tion is required to obtain an optimal description of a speech archive.
                                                                                time in the recording.
   Manual annotation of speech is a very expensive process. Nine
hours are required to perform the manual annotation corresponding            Actions such as speech segment split, or speech segment boundaries
to one hour of spontaneous speech (speech transcription and speaker          modifications are not taken into account in the scope of this paper.
identity). Previous studies have shown that the speech annotation               Annotated speech segments corresponding to the whole recording
1 Computer science laboratory of the university of Maine (LIUM - EA 4023),   are presented to the annotator. The segment presentation order fol-
    Le Mans, France                                                          lows the temporal occurrence of the segments. This choice has been
2 French National audiovisual institute (Ina), Paris, France                 made in order to ease the manual speaker recognition task, with the
3 http://www.ina.fr
                                                                             assumption that the media chronology provides the annotator with a
better understanding of the speech material. The annotator has to cor-     study on the clustering step and the segmentation step is based on a
rect, or validate the predictions of the diarization system. Our work-     perfect manual segmentation (ground truth).
ing paradigm is that a correction requires more time for the annotator        The clustering algorithm is based upon a hierarchical agglomera-
than a validation.                                                         tive clustering. The initial set of clusters is composed of one segment
   Figure 1 describes the proposed active-learning system. The sys-        per cluster. Each cluster is modeled by a Gaussian with a full co-
tem consists in associating each annotator correction to a real-time       variance matrix. The BIC measure (cf equation 1) is employed to
re-estimation of the labels of the remaining speech segments to be         select the candidate clusters to group as well as to stop the merging
presented. This method is aimed at improving the quality of the            process. The two closest clusters i and j are merged at each iteration
next diarization predictions, resulting in a lower amount of correc-       until BIC(i, j) > 0.
tions to be done by the annotator, thus lowering the time required            Let |⌃i |, |⌃j | and |⌃| be the determinants of gaussians associated
for the manual correction. The system is composed of three main            to the clusters i, j and i + j and be a parameter to set up. The
steps, which will be detailed in the next sections. The two last steps     penalty factor P (eq. 2) depends on d, the dimension of the features,
are repeated until all the segments are checked. Let us give a brief       as well as on ni and nj , referring to the total length of cluster i and
description of these stages:                                               cluster j respectively. The BIC(i, j) measure between the clusters
                                                                           i and j is then defined as follows:
Initialization: an initial diarization is performed with a fully-
   automatic speaker diarization system. This step can be time con-
   suming and is performed offline.                                                         ni + nj               ni               nj
                                                                             BIC(i, j) =            log |⌃|          log |⌃i |        log |⌃j |    P,
User input: the annotator checks each segment, and validates or                                2                  2                2
                                                                                                                                                  (1)
   corrects the speaker label before inspecting the next segment.
Real-time reassignment: the annotator modifications are associ-                                     ✓                   ◆
   ated to a re-evaluation of the speaker labels corresponding to the                           1            d(d + 1)
                                                                                     with P =           d+                  + log(ni + nj ).      (2)
   next speech segments to be presented. The computations realized                              2               2
   during this step should be fast enough to allow real-time interac-         This speaker diarization system is the first stage of most state-of-
   tion with a human user.                                                 the-art systems for TV or radio recording as the one based on GMM
                                                                           or i-vectors[1, 8]. GMM and i-vectors are both statistical models
                      Ground truth segmentation                            which represent audio data. The generated clusters have a high purity
                                                                           (i.e. each cluster contains mostly only one speaker) and the system is
                                                                           fast.
                   Initialization : Speaker diarization
                              (BIC clustering)
         Stage 1
                                                                           2.3    User input and Real-time reassignment
                                                                           User input consists in validating, or correcting, the speaker labels es-
                                User input                                 timated by the diarization system. The proposed active-learning strat-
         Stage 2
                   (validate or correct speaker label)                     egy consists in associating each correction, defined as a mismatch
                                                                           between the speakers Ci and Cj , to the computation of new speaker
                                                                           models, trained on the validated speech segments. The resulting mod-
                                                                           els are based on a single gaussian, which is fast to compute, and
                       Real-time reassignment
                                                                           assumed to be more accurate than the models inferred during the ini-
                    (re-evalution of speaker label)
         Stage 3                                                           tialization. These simple speaker models are then used to re-estimate
                                                                           the BIC distance with the remaining speech segments involved to
                                                                           the last mismatch (segments attributed to Ci and Cj only).
                      Correct speaker diarization

                    Figure 1: Active-learning system




2.2    Initialization: speaker diarization
The speaker diarization system is inspired by the system described
in [2]. It was developed for the transcription and diarization tasks,
with the goal of minimizing both word error rate and speaker error
rate. It rests upon a segmentation and a hierarchical agglomerative
clustering. Furthermore, this system uses MFCC features as audio
descriptors [2, 7, 17].
   The system is composed of a segmentation step followed with a
clustering step. Speaker diarization needs to produce homogeneous
speech segments. Errors such as having two distinct clusters (i.e.,
detected speakers) corresponding to the same real speaker could be                     Figure 2: Example of user-input and reassignment
easily corrected by merging both clusters. In this article, we focus the
   An illustration of these interactions is provided in figure 2. In this   non-speech area. Figure 3 shows the segment duration after the su-
example, four speakers (A, B, C, D) have been inferred through the          perposed speech deletion.
automatic initialization step. The user has manually validated the
four first speech segments (S1 ...S4 ) before reporting a speaker label
modification for segment S5 , tagged as speaker B instead of speaker                       0.20

A. The resulting action of the active-learning system, consists to cre-
ate speaker models for the mismatching speakers only (A and B).
These models are used to re-estimate the labels of the remaining seg-                      0.15

ments tagged with A or B (segments S7 , S8 and S11 ), and may lead




                                                                            Quantity (%)
to a speaker label modification (segment S11 ). Remaining speech
                                                                                           0.10
segments tagged with other labels (C and D) are not re-estimated.
The modified diarization is updated before the annotator moves to
the next segment S6 . The process iterates until the last segments are
                                                                                           0.05
reached.


3     Evaluation                                                                           0.00
                                                                                              0.1      0.3      1     3.2    10      31.6   100    316.2
                                                                                                    Segment duration in seconds (log10 scale)
3.1    Corpus
                                                                                                     Figure 3: Segment duration of REPERE corpus
Experiments were performed on TV recordings drawn from the cor-
pora of ANR-REPERE challenge4 . The ANR-REPERE is a chal-
lenge organized by the LNE (French national laboratory of metrol-
ogy and testing) and ELDA (Evaluations and Language resources
Distribution Agency) in 2010-2014. This challenge is a project in the       3.2              Metrics
area of the multimedia recognition of people in television documents.
The aim is to find the identities of people who speak along with the        3.2.1                 Diarization
quoted and written names at each instant in a television show. The
data comes from two French channels (BFM and LCP). Shows were               The metric used to measure performance in the speaker diarization
recorded from two French digital terrestrial television channels.           task is the Diarization Error Rate (DER) [18]. DER was introduced
   The ANR-REPERE project has started since 2010 and evaluations            by NIST as the fraction of speaking time which is not attributed
are set up in 2013 and 2014. In this paper, we merge the 2013 evalua-       to the correct speaker, using the best matching between references
tion corpus and the 2014 evaluation corpus to build the corpus called       and hypothesis speaker labels. The scoring tool is available in the
REPERE in the below sections. The table 1 give us some statistics           sidekit/s4d toolkit[14].
about this corpus. The duration reported in table 1 shows that only a          In order to evaluate the impact of a reassignment, we use the per-
part of the data is annotated and evaluated.                                centage of pure clusters with respect to the total number of clusters.
                                                                            We also use the well-known purity as defined in [9] which is the ratio
                     Statistics           REPERE                            between the number of frames by the dominating speaker in a cluster
                     Show number            15                              and the total number of frames in this cluster. This measure is used
                     Recording number        90                             in order to evaluate the purity of hypothesis clusters according to the
                     Recording time        34h30                            assignment provided by reference clusters. To evaluate the action ap-
                     Annotation time       13h11
                     Speaker number         571                             plied by a human, we simply use some counters. These counters will
                                                                            be in the form of percentages in this paper.
Table 1: 2013-2014 news and debate TV recordings from REPERE corpus.
                                                                            3.2.2                 Keystroke Saving Rate

   The current diarization systems are less efficient with spontaneous      The DER and the purity measure the quality of a diarization. The
speech mainly present in debates than with prepared speech from             evaluation of the user input is difficult, as the proposed metric needs
news [6]. We have chosen this corpus because of the variety of the          to be as much as possible reproducible and objective [10]. In our
shows. The corpus is balanced between prepared and spontaneous              case, the human interactions are simulated.
speech and composed of street interviews, debates and news shows.              The proposed method is inspired from a previous work on com-
   It is common to accept a ±250 millisecond tolerance on segment           puter assisted transcription [15]. In this paper the authors proposed
boundaries for the recordings with prepared speech and far less for         to evaluate the human interactions with the Keystroke Saving Rate
the recordings with spontaneous speech. Having and using a refer-           (KSR) [23].
ence segmentation for the segmentation step, we do not normally                The KSR method has been developed for AAC (Augmentative
have segmentation errors. Therefore, we do not use any tolerances           and Alternative Communication) systems, so that handicapped per-
on segment boundaries.                                                      sons can use it. It is computed according to the number of keyboard
   Most of the diarization systems are not able to detected overlap         strokes made by the user to write a message. In our case, the strokes
speech zones [4, 16, 24]. In the following described experiments,           corresponds to the number of actions made by the annotator to cor-
we remove overlap speech from the evaluation and consider it as a           rect the diarization. To compute the KSR, we assume that the anno-
                                                                            tator will always choose the best strategy to minimize the number of
4 http://www.defi-repere.fr/
                                                                            actions.
   We suppose here that the annotator can make two kinds of actions
for a current segment: the reassignment to another cluster (reassign-
ment) or the assignment to a new cluster (creation). The annotator
can create a new cluster when the first segment of a given speaker is
checked. The number of creations in the whole document, denoted
by nc , is constant for any reassignment even if the threshold in
equation 1 differs. Similarly the total number of segments reassigned
by the user is denoted by nr and the number of segments is ns . We
define the KSR as the ratio of the sum of the numbers of created clus-
ters and the reassigned segments nr given the number of segments in
the initial diarization (equation 3):
                               nc + nr
                      KSR =              ⇥ 100.                   (3)
                                   ns                                                                                     Figure 5: KSR
   A KSR equal to 0% corresponds to a perfect speaker diarization in
which each segment is assigned to the true corresponding speaker. In
this case, the annotator does not reassign any segments. Conversely,         ber of actions performed by an annotator to obtain a perfect diariza-
a KSR equal to 100% corresponds to the worse speaker diarization in          tion. To reach this goal, we compare the KSR obtained with or with-
which each segment is assigned to the wrong speaker. Therefore, the          out the human corrections taken into consideration (i.e. with or with-
annotator needs to change the assignment of all the segments, if the         out an active-learning reassignment) using various thresholds for
corrections are not gradually propagated in the rest of the document.        the speaker diarization.
                                                                                The real-time segment reassignment stage (stage 3 in figure 1) uses
4     Results                                                                the same parameters as the initial diarization: 12MFCC+energy, full
                                                                             covariance gaussian and BIC metric to label the unchecked segments
4.1    Speaker diarization                                                   .
                                                                                Figure 5 gives the KSR of the system with real-time reassignment
                                                                             (including stages 2 & 3) and the system without real-time reassign-
                                                                             ment (including stage 2 only). The KSR decreases until         = 3.5
                                                                             in both systems and increases when is upper. The KSR is 56.5%
                                                                             and 49.7% respectively without reassignment and with reassignment
                                                                             when is equal to 3.5. About half segments are manually corrected
                                                                             (49.7%) and the 6.8% in absolute value are reassigned to the correct
                                                                             speaker automatically after a user correction.
                                                                                In the most favorable case when is at 3.5, the DER is low, about
                                                                             10% and the average cluster purity is equal to 90%. In the same time,
                                                                             only 60% of the clusters are 100% pure (cf. figure 4). The difference
                                                                             between these indicators can be explained by the fact that, unlike
                                                                             the DER, the KSR does not take into account the duration of the
Figure 4: Initial diarization: DER, % of pure clusters and average cluster   segments . Most of the errors come from the small segments, and
purity                                                                       these ones are numerous (cf. figure 3).
                                                                                The KSR remains almost static when is greater than 3.5 in the
                                                                             system with reassignment, whereas the choice of the parameter is
   The speaker diarization is based on hierarchical clustering where         more critical to minimize the number of actions in the system without
each speaker is modeled by a gaussian with a full covariance com-            reassignment. Finally, one can notice that the system with reassign-
puted over acoustic features. The acoustic features are composed of          ment always obtains a lower KSR whatever the value, except for
12 MFCCs with energy, and are not normalized (the background                    = 0 where the KSR is equal to 100% in both cases.
channel helps to segment and cluster the speaker) [2, 7, 17].
   As mentioned previously, we use the ground truth segmentation
as input of the clustering algorithm (corresponding to the stage 1 in                              0.30

figure 1) and the overlapping speaker segments are removed in the
                                                                                                   0.25
ground truth.
   Figure 4 shows the DER of the speaker diarization for different                                 0.20
                                                                                    Quantity (%)




thresholds (cf. equation 1). The lower DER is 9.9% for a threshold
                                                                                                   0.15
of 4.0. Compared to literature [8], this DER is rather low, which is
mainly due to ground truth segmentation: the segments contain the                                  0.10

voice of a single speaker, overlap segments are removed, as well as
there are no missed speech and no false alarm speech segments.                                     0.05


                                                                                                   0.00
                                                                                                     0.001     0.003    0.01     0.03      0.1      0.3       1
4.2    Active-learning system                                                                                Reassignment duration in seconds (log10 scale)


In our experiments, the human annotator is simulated with the ground                   Figure 6: Time of reassignment after each user correction.
truth speaker annotations. The main objective is to decrease the num-
   After each user correction, the unchecked segments are clustered           [11] Jerry Goldman, Steve Renals, Steven Bird, Franciska De Jong, Mar-
again in the reassignment stage. The process is generally fast, since              cello Federico, Carl Fleischhauer, Mark Kornbluh, Lori Lamel, Dou-
                                                                                   glas W Oard, Claire Stewart, et al., ‘Accessing the spoken word’, Inter-
the duration takes less than 0.03 second in 95% of cases, so it is                 national Journal on Digital Libraries, 5(4), 287–298, (2005).
interesting to notice that this stage could be done in real time without      [12] Nicolas Hervé, Pierre Letessier, Mathieu Derval, and Hakim Nabi,
any impact on the user interface (figure 6).                                       ‘Amalia.js: An open-source metadata driven html5 multimedia player’,
                                                                                   in Proceedings of the 23rd Annual ACM Conference on Multimedia
                                                                                   Conference, MM ’15, pp. 709–712, New York, NY, USA, (2015).
5    Conclusion & prospects                                                        ACM.
                                                                              [13] Juliette Kahn, Parole de locuteur: performance et confiance en identifi-
In this paper, we attempt to find a way to help a human to segment                 cation biométrique vocale, Ph.D. dissertation, Avignon, 2011.
                                                                              [14] Anthony Larcher, Kong Aik Lee, and Sylvain Meignier, ‘An extensible
and cluster the speakers in an audio or audio-visual document. We                  speaker identification sidekit in python’, in 2016 IEEE International
propose a method that takes into consideration the annotator correc-               Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.
tions by modifying the allocation of the unchecked segments. The                   5095–5099. IEEE, (2016).
proposed computer assisted method allows us to obtain a notice-               [15] Antoine Laurent, Sylvain Meignier, Teva Merlin, and Paul Deléglise,
                                                                                   ‘Computer-assisted transcription of speech based on confusion network
able reduction in the number of required corrections. Not only is our
                                                                                   reordering’, in Acoustics, Speech and Signal Processing (ICASSP),
method effective, but the corrections are also made quickly. Thanks                2011 IEEE International Conference on, pp. 4884–4887. IEEE, (2011).
to its fast treatment, this could be applied in a real application with-      [16] Xavier Anguera Miro, Simon Bozonnet, Nicholas Evans, Corinne Fre-
out impacting the reactivity of the interface and without increasing               douille, Gerald Friedland, and Oriol Vinyals, ‘Speaker diarization: A
the work intensity of the annotator.                                               review of recent research’, Audio, Speech, and Language Processing,
                                                                                   IEEE Transactions on, 20(2), 356–370, (2012).
   Some future improvements should be done on the base of this pre-           [17] Lindasalwa Muda, Mumtaj Begam, and I Elamvazuthi, ‘Voice
liminary work. Firstly, we plan to minimize the number of user ac-                 recognition algorithms using mel frequency cepstral coefficient
tions by applying a constrained clustering to reassign all unchecked               (mfcc) and dynamic time warping (dtw) techniques’, arXiv preprint
segments and to create or delete clusters. Another improvement                     arXiv:1003.4083, (2010).
                                                                              [18] NIST.       The rich transcription spring 2003 (RT-03S) evaluation
would be to integrate the automatic segmentation in the correction
                                                                                   plan.      http://www.itl.nist.gov/iad/mig/tests/rt/
process.                                                                           2003-spring/docs/rt03-spring-eval-plan-v4.pdf,
                                                                                   February 2003.
                                                                              [19] Roeland Ordelman, Franciska De Jong, and Martha Larson, ‘Enhanced
6    Acknowledgments                                                               multimedia content access and exploitation using semantic speech re-
                                                                                   trieval’, in Semantic Computing, 2009. ICSC’09. IEEE International
This research was partially supported by the European Commis-                      Conference on, pp. 521–528. IEEE, (2009).
sion, as part of the Event Understanding through Multimodal Social            [20] Julien Pinquier and Régine André-Obrecht, ‘Audio indexing: primary
                                                                                   components retrieval’, Multimedia tools and applications, 30(3), 313–
Stream Interpretation (EUMSSI) project (contract number FP7-ICT-
                                                                                   330, (2006).
2013-10) in which the LIUM is involved.                                       [21] Sue E Tranter and Douglas A Reynolds, ‘An overview of automatic
                                                                                   speaker diarization systems’, IEEE Transactions on Audio, Speech, and
                                                                                   Language Processing, 14(5), 1557–1565, (2006).
References                                                                    [22] Félicien Vallet, Jim Uro, Jérémy Andriamakaoly, Hakim Nabi, Math-
                                                                                   ieu Derval, and Jean Carrive, ‘Speech trax: A bottom to the top ap-
 [1] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and           proach for speaker tracking and indexing in an archiving context’, in
     O. Vinyals, ‘Speaker diarization: A review of recent research’, 20(2),        Proceedings of the Tenth International Conference on Language Re-
     356–370, (Feb 2012).                                                          sources and Evaluation (LREC). European Language Resources Asso-
 [2] C. Barras, X. Zhu, S. Meignier, and J.L. Gauvain, ‘Multi-stage speaker        ciation (ELRA), (2016).
     diarization of broadcast news’, IEEE Transactions on Audio, Speech       [23] Matthew EJ Wood and Eric Lewis, ‘Windmill-the use of a parsing al-
     and Language Processing, 14(5), 1505–1512, (2006).                            gorithm to produce predictions for disabled persons’, PROCEEDINGS-
 [3] Thierry Bazillon, Yannick Estève, and Daniel Luzzati, ‘Transcription          INSTITUTE OF ACOUSTICS, 18, 315–322, (1996).
     manuelle vs assistée de la parole préparé et spontanée’, Revue TAL,      [24] Martin Zelenák and Javier Hernando, ‘The detection of overlapping
     (2008).                                                                       speech with prosodic features for speaker diarization.’, in INTER-
 [4] Kofi Boakye, Beatriz Trueba-Hornero, Oriol Vinyals, and Gerald Fried-         SPEECH, pp. 1041–1044, (2011).
     land, ‘Overlapped speech detection for improved speaker diarization
     in multiparty meetings’, in Acoustics, Speech and Signal Processing,
     2008. ICASSP 2008. IEEE International Conference on, pp. 4353–
     4356. IEEE, (2008).
 [5] Mbarek Charhad, Daniel Moraru, Stéphane Ayache, and Georges
     Quénot, ‘Speaker identity indexing in audio-visual documents’, in
     Content-Based Multimedia Indexing (CBMI2005), (2005).
 [6] Ruchard Dufour, Vincent Jousse, Yannick Estève, Fréderic Béchet, and
     Georges Linarès, ‘Spontaneous speech characterization and detection
     in large audio database’, SPECOM, St. Petersburg, (2009).
 [7] Grégor Dupuy, Les collections volumineuses de documents audiovi-
     suels: segmentation et regroupement en locuteurs, Ph.D. dissertation,
     Université du Maine, 2015.
 [8] Grégor Dupuy, Sylvain Meignier, Paul Deléglise, and Yannick Es-
     teve, ‘Recent improvements on ilp-based clustering for broadcast news
     speaker diarization’, in Proc. Odyssey Workshop, (2014).
 [9] Jean-Luc Gauvain, Lori Lamel, and Gilles Adda, ‘Partitioning and tran-
     scription of broadcast news data.’, in ICSLP, volume 98, pp. 1335–
     1338, (1998).
[10] Edouard Geoffrois, ‘Evaluating interactive system adaptation’, in The
     International Conference on Language Resources and Evaluation,
     (2016).