An active learning method for speaker identity annotation in audio recordings Broux Pierre-Alexandre1,2 and Doukhan David1 and Petitrenaud Simon2 and Meignier Sylvain1 and Carrive Jean2 Abstract. Given that manual annotation of speech is an expen- process may be sped-up using the output of automatic speech recog- sive and long process, we attempt in this paper to assist an anno- nition systems (ASR) together with speech turn annotations [3]. The tator to perform a speaker diarization. This assistance takes place in resulting annotation task consists in correcting the output of auto- an annotation background for a large amount of archives. We pro- matic systems, instead of doing the whole annotation manually. pose a method which decreases the intervention number of a human. The model proposed in this paper is an active-learning extension This method corrects a diarization by taking into account the human of this paradigm, applied to the speaker diarization task. Annota- interventions. The experiment is done using French broadcast TV tor corrections are used in real-time to update the estimations of the shows drawn from ANR-REPERE evaluation campaign. Our method speaker diarization system. The aim of this update strategy is to lower is mainly evaluated in terms of KSR (Keystroke Saving Rate), and we the amount of manual corrections to be done, which impact the time reduce the number of actions needed to correct a speaker diarization spent in the interaction with the system. The quality of the annota- output by 6.8% in absolute value. tions obtained through this process should be maximal, with respect to human abilities on speaker recognition tasks [13]. The paper is organized as follows: Section 2 presents the Human- 1 Introduction assisted speaker diarization system. Section 3 presents the corpus, the The work presented in this paper has been realized to meet the needs metrics, whereas section 4 analyzes the results. Section 5 concludes of the French national audiovisual institute3 (INA). INA is a public with a discussion of possible directions for future works. institution in charge of the digitalization, preservation, distribution and dissemination of the French audiovisual heritage. Annotations 2 Human-assisted speaker diarization system related to speaker identity, together with speech transcription, meet several use-cases. Temporal localization of speaker interventions can The proposed speaker diarization prototype is aimed at interacting in be used to enhance the navigation within a media [12, 22]. It may real-time with a human user, in charge of correcting the predictions also be used to perform complex queries within media databases [5, of the system. This system is aimed at producing high quality di- 11, 19]. arization annotations with a minimal human cost. Such system could This article focuses on the realization of human-assisted speaker be used to speed-up the annotation process of any speech corpus re- diarization systems. Speaker diarization methods consist in estimat- quiring temporal speaker information. ing "who spoke when" in an audio stream [2]. This media structuring process is an efficient pre-processing step, for instance to help seg- 2.1 System overview menting a broadcast news into anchors and reports before manual documentation processes. Speaker diarization algorithms are gener- In the following description, we assume that an easy-to-use interface ally based on unsupervised machine learning methods [21], in charge is provided to the user, and that the speech segments are presented of estimating the number of speakers, and splitting the audio stream together with the speech transcription. We also assume that the feed- into labelled speech segments assigned to hypothesized speakers. back of the user is limited to three actions: Speaker identity and temporal localization is known to be a perti- 1. The validation, when the speech segment has a correct speaker nent information for the access and exploitation of speech recordings label; [5, 20]. However, the accuracy of automatic state-of-the-art speaker 2. The speaker label modification, when the speech segment has an recognition methods is still inadequate to be embedded into INA’s incorrect speaker label; archiving or media enhancement applications, and a human interven- 3. The speaker label creation: for speakers encountered for the first tion is required to obtain an optimal description of a speech archive. time in the recording. Manual annotation of speech is a very expensive process. Nine hours are required to perform the manual annotation corresponding Actions such as speech segment split, or speech segment boundaries to one hour of spontaneous speech (speech transcription and speaker modifications are not taken into account in the scope of this paper. identity). Previous studies have shown that the speech annotation Annotated speech segments corresponding to the whole recording 1 Computer science laboratory of the university of Maine (LIUM - EA 4023), are presented to the annotator. The segment presentation order fol- Le Mans, France lows the temporal occurrence of the segments. This choice has been 2 French National audiovisual institute (Ina), Paris, France made in order to ease the manual speaker recognition task, with the 3 http://www.ina.fr assumption that the media chronology provides the annotator with a better understanding of the speech material. The annotator has to cor- study on the clustering step and the segmentation step is based on a rect, or validate the predictions of the diarization system. Our work- perfect manual segmentation (ground truth). ing paradigm is that a correction requires more time for the annotator The clustering algorithm is based upon a hierarchical agglomera- than a validation. tive clustering. The initial set of clusters is composed of one segment Figure 1 describes the proposed active-learning system. The sys- per cluster. Each cluster is modeled by a Gaussian with a full co- tem consists in associating each annotator correction to a real-time variance matrix. The BIC measure (cf equation 1) is employed to re-estimation of the labels of the remaining speech segments to be select the candidate clusters to group as well as to stop the merging presented. This method is aimed at improving the quality of the process. The two closest clusters i and j are merged at each iteration next diarization predictions, resulting in a lower amount of correc- until BIC(i, j) > 0. tions to be done by the annotator, thus lowering the time required Let |⌃i |, |⌃j | and |⌃| be the determinants of gaussians associated for the manual correction. The system is composed of three main to the clusters i, j and i + j and be a parameter to set up. The steps, which will be detailed in the next sections. The two last steps penalty factor P (eq. 2) depends on d, the dimension of the features, are repeated until all the segments are checked. Let us give a brief as well as on ni and nj , referring to the total length of cluster i and description of these stages: cluster j respectively. The BIC(i, j) measure between the clusters i and j is then defined as follows: Initialization: an initial diarization is performed with a fully- automatic speaker diarization system. This step can be time con- suming and is performed offline. ni + nj ni nj BIC(i, j) = log |⌃| log |⌃i | log |⌃j | P, User input: the annotator checks each segment, and validates or 2 2 2 (1) corrects the speaker label before inspecting the next segment. Real-time reassignment: the annotator modifications are associ- ✓ ◆ ated to a re-evaluation of the speaker labels corresponding to the 1 d(d + 1) with P = d+ + log(ni + nj ). (2) next speech segments to be presented. The computations realized 2 2 during this step should be fast enough to allow real-time interac- This speaker diarization system is the first stage of most state-of- tion with a human user. the-art systems for TV or radio recording as the one based on GMM or i-vectors[1, 8]. GMM and i-vectors are both statistical models Ground truth segmentation which represent audio data. The generated clusters have a high purity (i.e. each cluster contains mostly only one speaker) and the system is fast. Initialization : Speaker diarization (BIC clustering) Stage 1 2.3 User input and Real-time reassignment User input consists in validating, or correcting, the speaker labels es- User input timated by the diarization system. The proposed active-learning strat- Stage 2 (validate or correct speaker label) egy consists in associating each correction, defined as a mismatch between the speakers Ci and Cj , to the computation of new speaker models, trained on the validated speech segments. The resulting mod- els are based on a single gaussian, which is fast to compute, and Real-time reassignment assumed to be more accurate than the models inferred during the ini- (re-evalution of speaker label) Stage 3 tialization. These simple speaker models are then used to re-estimate the BIC distance with the remaining speech segments involved to the last mismatch (segments attributed to Ci and Cj only). Correct speaker diarization Figure 1: Active-learning system 2.2 Initialization: speaker diarization The speaker diarization system is inspired by the system described in [2]. It was developed for the transcription and diarization tasks, with the goal of minimizing both word error rate and speaker error rate. It rests upon a segmentation and a hierarchical agglomerative clustering. Furthermore, this system uses MFCC features as audio descriptors [2, 7, 17]. The system is composed of a segmentation step followed with a clustering step. Speaker diarization needs to produce homogeneous speech segments. Errors such as having two distinct clusters (i.e., detected speakers) corresponding to the same real speaker could be Figure 2: Example of user-input and reassignment easily corrected by merging both clusters. In this article, we focus the An illustration of these interactions is provided in figure 2. In this non-speech area. Figure 3 shows the segment duration after the su- example, four speakers (A, B, C, D) have been inferred through the perposed speech deletion. automatic initialization step. The user has manually validated the four first speech segments (S1 ...S4 ) before reporting a speaker label modification for segment S5 , tagged as speaker B instead of speaker 0.20 A. The resulting action of the active-learning system, consists to cre- ate speaker models for the mismatching speakers only (A and B). These models are used to re-estimate the labels of the remaining seg- 0.15 ments tagged with A or B (segments S7 , S8 and S11 ), and may lead Quantity (%) to a speaker label modification (segment S11 ). Remaining speech 0.10 segments tagged with other labels (C and D) are not re-estimated. The modified diarization is updated before the annotator moves to the next segment S6 . The process iterates until the last segments are 0.05 reached. 3 Evaluation 0.00 0.1 0.3 1 3.2 10 31.6 100 316.2 Segment duration in seconds (log10 scale) 3.1 Corpus Figure 3: Segment duration of REPERE corpus Experiments were performed on TV recordings drawn from the cor- pora of ANR-REPERE challenge4 . The ANR-REPERE is a chal- lenge organized by the LNE (French national laboratory of metrol- ogy and testing) and ELDA (Evaluations and Language resources Distribution Agency) in 2010-2014. This challenge is a project in the 3.2 Metrics area of the multimedia recognition of people in television documents. The aim is to find the identities of people who speak along with the 3.2.1 Diarization quoted and written names at each instant in a television show. The data comes from two French channels (BFM and LCP). Shows were The metric used to measure performance in the speaker diarization recorded from two French digital terrestrial television channels. task is the Diarization Error Rate (DER) [18]. DER was introduced The ANR-REPERE project has started since 2010 and evaluations by NIST as the fraction of speaking time which is not attributed are set up in 2013 and 2014. In this paper, we merge the 2013 evalua- to the correct speaker, using the best matching between references tion corpus and the 2014 evaluation corpus to build the corpus called and hypothesis speaker labels. The scoring tool is available in the REPERE in the below sections. The table 1 give us some statistics sidekit/s4d toolkit[14]. about this corpus. The duration reported in table 1 shows that only a In order to evaluate the impact of a reassignment, we use the per- part of the data is annotated and evaluated. centage of pure clusters with respect to the total number of clusters. We also use the well-known purity as defined in [9] which is the ratio Statistics REPERE between the number of frames by the dominating speaker in a cluster Show number 15 and the total number of frames in this cluster. This measure is used Recording number 90 in order to evaluate the purity of hypothesis clusters according to the Recording time 34h30 assignment provided by reference clusters. To evaluate the action ap- Annotation time 13h11 Speaker number 571 plied by a human, we simply use some counters. These counters will be in the form of percentages in this paper. Table 1: 2013-2014 news and debate TV recordings from REPERE corpus. 3.2.2 Keystroke Saving Rate The current diarization systems are less efficient with spontaneous The DER and the purity measure the quality of a diarization. The speech mainly present in debates than with prepared speech from evaluation of the user input is difficult, as the proposed metric needs news [6]. We have chosen this corpus because of the variety of the to be as much as possible reproducible and objective [10]. In our shows. The corpus is balanced between prepared and spontaneous case, the human interactions are simulated. speech and composed of street interviews, debates and news shows. The proposed method is inspired from a previous work on com- It is common to accept a ±250 millisecond tolerance on segment puter assisted transcription [15]. In this paper the authors proposed boundaries for the recordings with prepared speech and far less for to evaluate the human interactions with the Keystroke Saving Rate the recordings with spontaneous speech. Having and using a refer- (KSR) [23]. ence segmentation for the segmentation step, we do not normally The KSR method has been developed for AAC (Augmentative have segmentation errors. Therefore, we do not use any tolerances and Alternative Communication) systems, so that handicapped per- on segment boundaries. sons can use it. It is computed according to the number of keyboard Most of the diarization systems are not able to detected overlap strokes made by the user to write a message. In our case, the strokes speech zones [4, 16, 24]. In the following described experiments, corresponds to the number of actions made by the annotator to cor- we remove overlap speech from the evaluation and consider it as a rect the diarization. To compute the KSR, we assume that the anno- tator will always choose the best strategy to minimize the number of 4 http://www.defi-repere.fr/ actions. We suppose here that the annotator can make two kinds of actions for a current segment: the reassignment to another cluster (reassign- ment) or the assignment to a new cluster (creation). The annotator can create a new cluster when the first segment of a given speaker is checked. The number of creations in the whole document, denoted by nc , is constant for any reassignment even if the threshold in equation 1 differs. Similarly the total number of segments reassigned by the user is denoted by nr and the number of segments is ns . We define the KSR as the ratio of the sum of the numbers of created clus- ters and the reassigned segments nr given the number of segments in the initial diarization (equation 3): nc + nr KSR = ⇥ 100. (3) ns Figure 5: KSR A KSR equal to 0% corresponds to a perfect speaker diarization in which each segment is assigned to the true corresponding speaker. In this case, the annotator does not reassign any segments. Conversely, ber of actions performed by an annotator to obtain a perfect diariza- a KSR equal to 100% corresponds to the worse speaker diarization in tion. To reach this goal, we compare the KSR obtained with or with- which each segment is assigned to the wrong speaker. Therefore, the out the human corrections taken into consideration (i.e. with or with- annotator needs to change the assignment of all the segments, if the out an active-learning reassignment) using various thresholds for corrections are not gradually propagated in the rest of the document. the speaker diarization. The real-time segment reassignment stage (stage 3 in figure 1) uses 4 Results the same parameters as the initial diarization: 12MFCC+energy, full covariance gaussian and BIC metric to label the unchecked segments 4.1 Speaker diarization . Figure 5 gives the KSR of the system with real-time reassignment (including stages 2 & 3) and the system without real-time reassign- ment (including stage 2 only). The KSR decreases until = 3.5 in both systems and increases when is upper. The KSR is 56.5% and 49.7% respectively without reassignment and with reassignment when is equal to 3.5. About half segments are manually corrected (49.7%) and the 6.8% in absolute value are reassigned to the correct speaker automatically after a user correction. In the most favorable case when is at 3.5, the DER is low, about 10% and the average cluster purity is equal to 90%. In the same time, only 60% of the clusters are 100% pure (cf. figure 4). The difference between these indicators can be explained by the fact that, unlike the DER, the KSR does not take into account the duration of the Figure 4: Initial diarization: DER, % of pure clusters and average cluster segments . Most of the errors come from the small segments, and purity these ones are numerous (cf. figure 3). The KSR remains almost static when is greater than 3.5 in the system with reassignment, whereas the choice of the parameter is The speaker diarization is based on hierarchical clustering where more critical to minimize the number of actions in the system without each speaker is modeled by a gaussian with a full covariance com- reassignment. Finally, one can notice that the system with reassign- puted over acoustic features. The acoustic features are composed of ment always obtains a lower KSR whatever the value, except for 12 MFCCs with energy, and are not normalized (the background = 0 where the KSR is equal to 100% in both cases. channel helps to segment and cluster the speaker) [2, 7, 17]. As mentioned previously, we use the ground truth segmentation as input of the clustering algorithm (corresponding to the stage 1 in 0.30 figure 1) and the overlapping speaker segments are removed in the 0.25 ground truth. Figure 4 shows the DER of the speaker diarization for different 0.20 Quantity (%) thresholds (cf. equation 1). The lower DER is 9.9% for a threshold 0.15 of 4.0. Compared to literature [8], this DER is rather low, which is mainly due to ground truth segmentation: the segments contain the 0.10 voice of a single speaker, overlap segments are removed, as well as there are no missed speech and no false alarm speech segments. 0.05 0.00 0.001 0.003 0.01 0.03 0.1 0.3 1 4.2 Active-learning system Reassignment duration in seconds (log10 scale) In our experiments, the human annotator is simulated with the ground Figure 6: Time of reassignment after each user correction. truth speaker annotations. The main objective is to decrease the num- After each user correction, the unchecked segments are clustered [11] Jerry Goldman, Steve Renals, Steven Bird, Franciska De Jong, Mar- again in the reassignment stage. The process is generally fast, since cello Federico, Carl Fleischhauer, Mark Kornbluh, Lori Lamel, Dou- glas W Oard, Claire Stewart, et al., ‘Accessing the spoken word’, Inter- the duration takes less than 0.03 second in 95% of cases, so it is national Journal on Digital Libraries, 5(4), 287–298, (2005). interesting to notice that this stage could be done in real time without [12] Nicolas Hervé, Pierre Letessier, Mathieu Derval, and Hakim Nabi, any impact on the user interface (figure 6). ‘Amalia.js: An open-source metadata driven html5 multimedia player’, in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, pp. 709–712, New York, NY, USA, (2015). 5 Conclusion & prospects ACM. [13] Juliette Kahn, Parole de locuteur: performance et confiance en identifi- In this paper, we attempt to find a way to help a human to segment cation biométrique vocale, Ph.D. dissertation, Avignon, 2011. [14] Anthony Larcher, Kong Aik Lee, and Sylvain Meignier, ‘An extensible and cluster the speakers in an audio or audio-visual document. We speaker identification sidekit in python’, in 2016 IEEE International propose a method that takes into consideration the annotator correc- Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. tions by modifying the allocation of the unchecked segments. The 5095–5099. IEEE, (2016). proposed computer assisted method allows us to obtain a notice- [15] Antoine Laurent, Sylvain Meignier, Teva Merlin, and Paul Deléglise, ‘Computer-assisted transcription of speech based on confusion network able reduction in the number of required corrections. Not only is our reordering’, in Acoustics, Speech and Signal Processing (ICASSP), method effective, but the corrections are also made quickly. Thanks 2011 IEEE International Conference on, pp. 4884–4887. IEEE, (2011). to its fast treatment, this could be applied in a real application with- [16] Xavier Anguera Miro, Simon Bozonnet, Nicholas Evans, Corinne Fre- out impacting the reactivity of the interface and without increasing douille, Gerald Friedland, and Oriol Vinyals, ‘Speaker diarization: A the work intensity of the annotator. review of recent research’, Audio, Speech, and Language Processing, IEEE Transactions on, 20(2), 356–370, (2012). Some future improvements should be done on the base of this pre- [17] Lindasalwa Muda, Mumtaj Begam, and I Elamvazuthi, ‘Voice liminary work. Firstly, we plan to minimize the number of user ac- recognition algorithms using mel frequency cepstral coefficient tions by applying a constrained clustering to reassign all unchecked (mfcc) and dynamic time warping (dtw) techniques’, arXiv preprint segments and to create or delete clusters. Another improvement arXiv:1003.4083, (2010). [18] NIST. The rich transcription spring 2003 (RT-03S) evaluation would be to integrate the automatic segmentation in the correction plan. http://www.itl.nist.gov/iad/mig/tests/rt/ process. 2003-spring/docs/rt03-spring-eval-plan-v4.pdf, February 2003. [19] Roeland Ordelman, Franciska De Jong, and Martha Larson, ‘Enhanced 6 Acknowledgments multimedia content access and exploitation using semantic speech re- trieval’, in Semantic Computing, 2009. ICSC’09. IEEE International This research was partially supported by the European Commis- Conference on, pp. 521–528. IEEE, (2009). sion, as part of the Event Understanding through Multimodal Social [20] Julien Pinquier and Régine André-Obrecht, ‘Audio indexing: primary components retrieval’, Multimedia tools and applications, 30(3), 313– Stream Interpretation (EUMSSI) project (contract number FP7-ICT- 330, (2006). 2013-10) in which the LIUM is involved. [21] Sue E Tranter and Douglas A Reynolds, ‘An overview of automatic speaker diarization systems’, IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1557–1565, (2006). References [22] Félicien Vallet, Jim Uro, Jérémy Andriamakaoly, Hakim Nabi, Math- ieu Derval, and Jean Carrive, ‘Speech trax: A bottom to the top ap- [1] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and proach for speaker tracking and indexing in an archiving context’, in O. Vinyals, ‘Speaker diarization: A review of recent research’, 20(2), Proceedings of the Tenth International Conference on Language Re- 356–370, (Feb 2012). sources and Evaluation (LREC). European Language Resources Asso- [2] C. Barras, X. Zhu, S. Meignier, and J.L. Gauvain, ‘Multi-stage speaker ciation (ELRA), (2016). diarization of broadcast news’, IEEE Transactions on Audio, Speech [23] Matthew EJ Wood and Eric Lewis, ‘Windmill-the use of a parsing al- and Language Processing, 14(5), 1505–1512, (2006). gorithm to produce predictions for disabled persons’, PROCEEDINGS- [3] Thierry Bazillon, Yannick Estève, and Daniel Luzzati, ‘Transcription INSTITUTE OF ACOUSTICS, 18, 315–322, (1996). manuelle vs assistée de la parole préparé et spontanée’, Revue TAL, [24] Martin Zelenák and Javier Hernando, ‘The detection of overlapping (2008). speech with prosodic features for speaker diarization.’, in INTER- [4] Kofi Boakye, Beatriz Trueba-Hornero, Oriol Vinyals, and Gerald Fried- SPEECH, pp. 1041–1044, (2011). land, ‘Overlapped speech detection for improved speaker diarization in multiparty meetings’, in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pp. 4353– 4356. IEEE, (2008). [5] Mbarek Charhad, Daniel Moraru, Stéphane Ayache, and Georges Quénot, ‘Speaker identity indexing in audio-visual documents’, in Content-Based Multimedia Indexing (CBMI2005), (2005). [6] Ruchard Dufour, Vincent Jousse, Yannick Estève, Fréderic Béchet, and Georges Linarès, ‘Spontaneous speech characterization and detection in large audio database’, SPECOM, St. Petersburg, (2009). [7] Grégor Dupuy, Les collections volumineuses de documents audiovi- suels: segmentation et regroupement en locuteurs, Ph.D. dissertation, Université du Maine, 2015. [8] Grégor Dupuy, Sylvain Meignier, Paul Deléglise, and Yannick Es- teve, ‘Recent improvements on ilp-based clustering for broadcast news speaker diarization’, in Proc. Odyssey Workshop, (2014). [9] Jean-Luc Gauvain, Lori Lamel, and Gilles Adda, ‘Partitioning and tran- scription of broadcast news data.’, in ICSLP, volume 98, pp. 1335– 1338, (1998). [10] Edouard Geoffrois, ‘Evaluating interactive system adaptation’, in The International Conference on Language Resources and Evaluation, (2016).