1. Introduction

10.18653/v1/2024.naacl-long.239

Modelling filled particles and prolongation using end-to-end Automatic Speech Recognition systems: a quantitative and qualitative analysis.

Vincenzo Norman Vitale

Loredana Schettino

Francesco Cutugno

1 0 Free University of Bozen-Bolzano , Bozen , Italy 1 University of Naples Federico II , Naples , Italy

2024

1 20 23

State-of-the-art automatic speech recognition systems based on End-to-End models (E2E-ASRs) achieve remarkable performances. However, phenomena that characterize spoken language such as fillers (<eeh> <ehm>) or segmental prolongations (the<ee>) are still mostly considered as disrupting objects that should not be included to obtain optimal transcriptions, despite their acknowledged regularity and communicative value. A recent study showed that two types of pre-trained systems with the same Conformer-based encoding architecture but diferent decoders - a Connectionist Temporal Classification (CTC) decoder and a Transducer decoder - tend to model some speech features that are functional for the identification of filled pauses and prolongation in speech. This work builds upon these findings by investigating which of the two systems is better at fillers and prolongations detection tasks and by conducting an error analysis to deepen our understanding of how these systems work.

eol>disfluences speech recognition probing interpretability explainability

1. Introduction

In recent works on Automatic Speech Recognition (ASR) systems based on the computing power of Deep Neural Networks (DNN), a great deal of efort is focused on incrementing the systems’ performances by employing increasingly complex, hence hardly interpretable, DNN models that require huge amounts of data for the training, like End-to-End Automatic Speech Recognition (E2EASR) models which represent the state-of-the-art. An E2E-ASR model directly converts a sequence of input acoustic feature vectors (or possibly raw audio samples) into a series of graphemes or words that represent the Figure 1: E2E ASRs are based on an encoder-decoder architranscription of the audio signal [1], as represented in tecture. The speech signal is fed to the encoder, producing an ifgure 1. In contrast, traditional ASR systems typically encoded representation that contains the information needed train the acoustic, pronunciation, and language mod- sbuybtwheorddescaondderbtuoilpdrothveidterathnescsreipqtuioennc.e of words/characters/els separately, requiring distinct modelling and training for each component. These systems usually aim to ob- that state-of-the-art systems based on End-to-End modtain speech transcriptions ‘cleaned’from phenomena that els (E2E-ASRs) can model linguistic and acoustic features characterise spoken language such as discourse mark- of spoken language, which can be investigated to explain ers, particles, pauses, or other phenomena commonly their internal dynamics. Several probing techniques have referred to as ‘disfluencies’. Studies on the interpretabil- been designed to inspect and better understand the inity of the dynamics underlying neural models showed ternal behavior of DNN layers at diferent depths. With these techniques, investigations on the internals of DeepCLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Speech2 [2, 3] revealed the influence of diatopic pronunci*DCecor0r4es—po0n6,d2in02g4a, uPtihsao,r.Italy ation variation in various English varieties and provided † These authors contributed equally. evidence that intermediate layers contain information $ vincenzonorman.vitale@unina.it (V. N. Vitale); crucial for their classification. Later, a study [ 4] on the lschettino@unibz.it (L. Schettino); cutugno@unina.it (F. Cutugno) layerwise capacity to encode information about acoustic 0000-0002-0365-8575 (V. N. Vitale); 0000-0002-3788-3754 features, phone identity, word identity, and word mean(L. Schettino); 0000-0001-9457-6243 (F. Cutugno) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License ing based on the context of occurrence highlighted that Attribution 4.0 International (CC BY 4.0). the last layer right before the decoding module retains in- ers of the Neapolitan variety of Italian, the speakers from formation about word meaning information, rather than [15] come from diferent Italian regions. local acoustic features and phone identity information More specifically, the considered speech data include: that are captured by the first layers and intermediate audio-visual recordings of guided tours at San Martino layers respectively. Then, other studies have further in- Charterhouse (in Naples) led by three female expert vestigated the capacity of state-of-the-art models to en- guides (CHROME corpus [14]), which consists of inforcode phonetic/phonemic information[5, 6], lexical tone mative semi-monologic, semi-spontaneous speech char[7] and gender [8]. Finally, [9] investigated the inter- acterized by a high degree of discourse planning and an nal dynamics of three pre-trained E2E-ASRs evidencing asymmetrical relationship between the speakers; audiothe emergence of syllable-related features by training visual recordings of 10 speakers narrating ‘Frog Stoan acoustic-syllable boundary detector. Following this ries’from a picture book [15], which elicited unplanned line of research, a recent study [10] investigated the abil- descriptive speech; four task-oriented dialogues from the ity of two types of pre-trained systems with the same CLIPS corpus [16], which provides mainly descriptive Conformer-based encoding architecture but diferent de- semi-spontaneous speech characterized by a low degree coders – a Connectionist Temporal Classification (CTC) of discourse planning and a high degree of collaboration decoder and a Transducer decoder – to model features between the interlocutors. that distinguish filled pauses and prolongations in speech and showed that, despite not being originally trained to 2.2. Annotation detect disfluencies, these systems tend to model some speech features that are functional for their identifica- Filled Pauses (FPs), defined as non-verbal fillers realized tion. Rather than disregarding the ability of E2E-ASRs as vocalization and/or nasalization, and Prolongations to model the acoustic information tied to such speech (PRLs), defined as marked lengthening of segmental matephenomena as a dispensable noise source, it could be rial [17, 18] were manually annotated along with pauses, exploited to achieve diferent ends. On the one hand, it lexical fillers, repetitions, deletions, insertions, and subcould be used to obtain more accurate transcriptions that stitutions following the annotation scheme described provide better, or rather more faithful, representations in [19]. This is a multilevel annotation system develof the speech signal, which would also support linguis- oped to account for both formal and functional features tic annotation processes. On the other hand, exploring of phenomena used to manage the own speech producthe systems’ modelling ability leads to deepening our tion. The identification of diferent types of phenomena understanding of their underlying dynamics. In the last was based on a ‘pragmatic approach’[20], which means 20 years, disfluency detection tasks have been conducted that it did not rely on absolute measures but on percepto improve speech recognition performances [11, 12] and tual judgments given the specific contexts of occurrence. diferent recent approaches to filler detection achieve The reliability of the annotation and the Inter-Annotator rather high performances, see [13]. However, these in- Agreement was evaluated by measuring Cohen’s . It vestigations mostly concern filler particles and, to our yielded 0.92 for dialogic data and 0.82 for monologic data, knowledge, no such system has been tested on Italian which stands for ‘high agreement’[21]. data so far. The proposed work aims to build upon these ifndings by investigating which of the two decoding sys- 2.3. Data Preparation tems is better at performing a detection task for fillers and prolongations. Moreover, a quantitative and qualitative error analysis is conducted to deepen our understanding of the way these systems work.

The considered dataset has been prepared based on a set

of praat TextGrid annotation files indicating the speaker and the type of disfluency according to the speech signal.

More specifically, considering only the PRLs and the FPs, the resulting dataset has a dimension of 1900 segments. 2. Materials and Method For each segment, the contextual information preceding and following the disfluency phenomenon has been con2.1. Data sidered, giving each segment a length of 4 seconds. Then, In this study, we employed approximately 210 minutes based on the combination of the so-composed dataset of expert annotated speech respectively divided into ∼ with each of the considered pre-trained models’ encoders 80 minutes of informative speech [14], 90 minutes of de- (details reported in Section 3.1), for each combination of scriptive speech [15] and approximately 40 minutes of segment and on each intermediate encoding layer the dialogic speech [16], that is dyads where two speakers following elements were extracted: recorded on diferent channels interact. While the data from [14] and [16] consists of speech produced by speak• A sequence of intermediate layer emissions/embedding representing the input segment in the layer’s (a) Average Dynamic time warping distance measured between sequences of labels with standard error (shade).

(b) Average Weighted F1 measure measured between sequences of labels with standard error (shade).

3. Results vectorial space. Each emission in the sequence represents a portion of 40 milliseconds of the input signal due to the considered model’s charac- 3.1. Disfluency Identification Through teristics. Model Probing • A sequence of labels associated with each sequence of emissions, indicating whether an intermediate emission belongs to a particular class of disfluencies (1 for FP and 2 for PRL) or not (label 0 if the segment does not belong to a disfluency).

Building upon recent studies that make use of probes to

better understand the internal behavior of pre-trained E2E-ASR models’[9, 4, 3], we apply a similar approach to investigate if and to which extent a pre-trained model () can codify disfluencies-related features in the encoding

The resulting dataset consists of pairs of sequences module, even if they are not trained to do so. The emof emissions (i.e., distilled features) and correspond- ployed approach is aimed at building specific classifiers ing labels identified by the model and the layer from whose inputs are represented by intermediate emissions which they were extracted. Note that each sequence of the considered model’s encoder layers (), combined of intermediate layer emissions has a length ℎ = with the appropriate sequence of labels based on dataset 4/40, as it represents the tempo- annotation. Internally, each classifier consists of a Long ral succession of segments before, during, and after dis- Short Term Memory (LSTM) module followed by a Feed lfuency phenomena. We use the term emission [10, 9] to Forward Neural Network (FFNN). Given that our probindicate intermediate layer neurons fire, instead of the lem can be related to sequence classification, the LSTMs more commonly used term embedding [8], as the latter seem to be the most naturally suited model [22]; usually, is widely used to indicate the output of an entire module an LSTM consists of one computational unit that iterarather than a layer. tively processes all input time series vectors. This unit (a) CTC-based classifier with hidden size 640 trained on distilled features from layer 18 (index 17 in F1,DTW plots). (b) RNN-T-based classifier with hidden size 640 trained on distilled features from layer 16 (index 15 in F1,DTW plots). comprises three gates processing one vector at a time and considered. The RNN-T is an auto-regressive speech trancombining it with information extracted from previous scription technique that overcomes CTC’s limitations, vectors. One of the most crucial parameters for an LSTM being non-auto-regressive and subject to limited label seis the hidden layer, therefore we investigate the impact of quence length. The Transducer decoding technique can three diferent layer sizes (hidden-layer size, ), namely produce label-transcription sequences longer than the 160, 320 and 640. So, an LSTM-based classifier processes input vector sequence and models inter-dependency in a sequence of {,} emission vectors (each of length long-term transcription elements. A Transducer typically ℎ) and produces a new sequence of vectors with size . comprises two sub-modules: one that forecasts the next The two sequences are aligned over time. At each time transcription label based on the previous transcriptions step , the FFNN produces a label indicating whether the (prediction network) and the other that combines the considered input represents a specific disfluency segment encoder and prediction-network outputs to produce a (label 1 for filled pause or 2 for prolongation) or not (with new transcription label (joiner network). These features label 0) based on the LSTM hidden-layer output. In sum- improve transcription speed and performance compared mary, we train and evaluate many diferent LSTM-based to CTC while requiring more training and computational disfluencies classifiers/detectors ( ,,) for all possible resources [26]. Note that both pre-trained models rely , , and combinations to search for the evidence of on the same encoder architecture, but the Conformerdisfluencies-related properties in the models’ decisions. CTC model has 18 encoding layers, while the Conformer

The goal is to explore which of the considered pre- Transducer encoder has 17 layers. trained E2E ASR models, based on diferent decoding In this study, ∼ 100 classifiers (2 models * ∼ 17 laysystems, better encodes characteristics associated with ers * 3 classifier sizes) were trained to investigate which disfluent speech segments to perform a fillers and prolon- of the considered pre-trained models, difering only by gations detection task. To this end, two publicly available the decoding approach, encodes enough information to [23] Conformer-based models [24] with 120 million pa- perform a disfluency detection task. rameters each, built with the NVIDIA Nemo toolkit and To evaluate the alignment between the output of the difering only in the decoding strategy, were selected. On classifier and the reference label sequence we employ the the one hand, a Conformer-based model with a Connec- Dynamic Time Warping Distance (DTW distance) [27], tionist Temporal Classification (CTC) [ 25] decoder has reported in figure 2a. The DTW results highlight that laybeen considered, as the CTC is one of the most popular ers closer to the decoding module seem to contain most decoding techniques. Such a decoding technique is a non- of the information needed to perform a correct detecauto-regressive speech transcription technique that col- tion of the considered disfluencies, obtaining an average lapses consecutive, all-equal, transcription labels (char- DTW distance of approximately 1.39 in all the cases, with acter, word piece, etc.) to one label unless a special label a considerably low standard error. Then, to evaluate the separates these. The result is a sequence of labels shorter capability of each classifier to provide a correct as well or equal to the input vector sequence length. Being non- as aligned labels sequence, we employed the weighted F1 auto-regressive, it is also considered computationally ef- measure, reported in figure 2b. Also in this case, F1 results fective as it requires less time and resources for training confirm that layers closer to the decoding module seem and inference phases. On the other hand, a Conformer- to be those containing most of the information needed to based model with the Recurrent Neural Network Trans- correctly identify the disfluency segment. The combinaducer (RNN-T), commonly known as Transducer has been tion of F1 and DTW provides an integrated perspective (a) (c) (b) (d) on the system’s ability to classify and align segments FP than for PRL phenomena. Focusing on the PRL incorrectly. Finally, in Figure 3 (a and b), we report the con- stances, a negative correlation is observed between the F1 fusion matrix of the best classifiers obtained from each weighted scores and PRLs’ duration (CTC non-recognized considered model. On the one side, the CTC seems to r = - 0.91, figure 4c; RNN Transducer non-recognized r = be better at discriminating non-disfluent segments (ND), - 0.87, figure 4d). while showing the worst performance in disfluency iden- The error analysis was supported by an auditory intification. On the other side, the RNN-T-based classifier spection of the unrecognized and misclassified samples shows considerable performance at identifying FPs and filtered based on the average DTW distance, namely, is the worst in discriminating ND segments, while PRL 1.39 for the Transducer-based and 1.40 for the CTCperformance is comparable to the CTC classifier. Both based classifier. Issues in PRL recognition mostly conmatrices highlight that the most dificult disfluency phe- cerned shorter instances, those characterized by peculiar nomena to classify are prolongations, which is the focus ‘non-prototypical’phonation features (such as unsteady, of our preliminary exploratory error analysis. creaky phonation) and the alignment of PRL-predicted occurrences. Also, several PRL phenomena were misclas3.2. Qualitative Analysis sified as FP when occurring with monosyllabic words, such as ‘o<oo>’, ‘un po<oo>’, ‘che<ee>’, ‘e<ee>’. In fact, the phonetic realization of these instances is closer to the ones that characterize FP for their vowel quality and as being, to a certain extent, independent elements from the phonetic environment The qualitative analysis is based on the best classifier for each of the considered models used to generate the distilled features. In particular, for the CTC version, the best classifier resulted in the one with 640 hidden neurons trained on 18-th layer features. Among the transducerbased versions, the one with 640 hidden neurons trained on 17-th layer features emerged as the best version.

The visual inspection of the distribution of the considered phenomena highlights that for both the CTC (4a) and the RNN Transducer classifiers (4b), FP phenomena concentrate on higher F1 weighted values, whereas wider distributions are observed for PRL phenomena, which shows that both classifiers work better when dealing with

4. Discussion and Conclusions

In this work, we build upon a previous study that investigated to what extent modern ASR E2Es encode features related to disfluency phenomena, even if they are not directly trained to do so. We showed that pre-trained models with the same audio encoder but with two diferent state-of-the-art decoding strategies (CTC and Transducer) capture disfluency-related features, especially in These findings could be used to improve transcription the latest encoding layer, and both model features that applications by enriching them with disfluency annocan be used for the identification and positioning of dis- tation (including filler particles and prolongation phelfuent speech segments [ 10]. Although there seems to be nomena), which are still rather costly processes for studa tendency to forget this information with subsequent ies concerning hesitation phenomena and (own) speech layers, as the trends for DTW (figure 2a) and F1-measure management in typical as well as atypical speech (e.g., (figure 2b) would suggest, the last layers, which are those pathological or language learners’ speech. Indeed, an closest to the objective function represented by the de- immediate development of the described work consists coding module, seem the most prone to retain character- of increasing the capabilities of the pre-trained E2E-ASRs istics useful to locate and identify disfluency phenomena. by adding a simple disfluency identification module to Interestingly, despite the diferences between the two complement the existing decoder, thus enriching the redecoding modules which are respectively non-recurrent sulting transcriptions. (CTC) and recurrent (RNN-T), the performances for the Our work is built upon unidirectional LSTMs rather chosen task are comparable. However, the confusion ma- than on bidirectional LSTMs (BiLSTMs), which provide trices highlight that the CTC-based classifier performs better performance because the latter have slightly longer better in the disfluency feature discrimination task, while inference times and require a larger amount of data, rethe Transducer-based classifier more precisely identi- sources, time to be trained and, most importantly, present ifes filled pauses, which could be related to the scope a more complex behaviour [29]. However, the introduc(recurrent/non-recurrent) of the objective function. The tion of diferent architecture modules like bidirectional results align with the literature that shows a strong sen- LSTM could improve the detection of prolongation disflusitivity to features concerning words and phone of the encies. This will be part of future developments focused layers closest to the encoder[4], while the layers clos- on performance and increased neural network complexest to the input are more sensitive to features related ity. to accent and local acoustic characteristics [3, 4]. It is worth noticing that, in a recent work [9], sensitivity to syllabic boundaries was found in layers 3-5, with a pat- References tern similar to the one shown in Figure 2 but without the peak in the last layers. The reason can be found in [1] S. Wang, G. Li, Overview of end-to-end speech the fact that syllables and their boundaries do not have a recognition, in: Journal of Physics: Conference Segraphic distinction in the transcriptions, conversely, in ries, volume 1187, IOP Publishing, 2019, p. 052068. the case of disfluencies, there is a form of transcription [2] T. Viglino, P. Motlicek, M. Cernak, End-to-end that identifies them within a language model. accented speech recognition., in: Interspeech, 2019,

The exploratory analysis of the errors highlighted pp. 2140–2144. that prolongations are more dificult to detect than filled [3] A. Prasad, P. Jyothi, How accents confound: Probpauses, which could depend on their being an integral ing for accent information in end-to-end speech (though lengthened) part of ‘fluent’words while filled recognition systems, in: Proceedings of the 58th pauses are mostly realized as independent elements. Also, Annual Meeting of the Association for Computainstances of prolongation are mostly non-recognized or tional Linguistics, 2020, pp. 3739–3753. misclassified as filled pauses when characterized by pecu- [4] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analyliar ‘non-prototypical’phonation features, such as creaky sis of a self-supervised speech representation model, phonations, or filler-like features, as in the case of mono- in: 2021 IEEE Automatic Speech Recognition and syllabic word-final prolongations. Also, previous studies Understanding Workshop (ASRU), IEEE, 2021, pp. on the segmental quality of prolongations in Italian [28] 914–921. showed that prolongations, especially when concerning [5] P. C. English, J. Kelleher, J. Carson-Berndsen, consonantal sounds, can be realised with schwa sounds Domain-informed probing of wav2vec 2.0 embedsimilar to those that characterize most filled pauses. This dings for phonetic features, in: Proceedings of the ifller-like quality could also be considered among the 19th SIGMORPHON Workshop on Computational underlying reasons for the negative correlation between Research in Phonetics, Phonology, and Morphology, the evaluation metrics of prolongations misclassification 2022, pp. 83–91. and their duration. Another possible motivation could [6] K. Martin, J. Gauthier, C. Breiss, R. Levy, Probreside in a bias in the dataset combined with the classifier ing self-supervised speech models for phonetic architecture (LSTM), which easily recognises prolonga- and phonemic information: A case study in aspirations responding to a specific length pattern. This means tion, in: INTERSPEECH 2023, 2023, pp. 251–255. that the scarcity of longer prolongations hinders their doi:10.21437/Interspeech.2023-2359. modelling leading to their misclassification. [7] G. Shen, M. Watkins, A. Alishahi, A. Bisazza,