<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/2024.naacl-long.239</article-id>
      <title-group>
        <article-title>Modelling filled particles and prolongation using end-to-end Automatic Speech Recognition systems: a quantitative and qualitative analysis.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vincenzo Norman Vitale</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Loredana Schettino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Cutugno</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Free University of Bozen-Bolzano</institution>
          ,
          <addr-line>Bozen</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Naples Federico II</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>20</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>State-of-the-art automatic speech recognition systems based on End-to-End models (E2E-ASRs) achieve remarkable performances. However, phenomena that characterize spoken language such as fillers (&lt;eeh&gt; &lt;ehm&gt;) or segmental prolongations (the&lt;ee&gt;) are still mostly considered as disrupting objects that should not be included to obtain optimal transcriptions, despite their acknowledged regularity and communicative value. A recent study showed that two types of pre-trained systems with the same Conformer-based encoding architecture but diferent decoders - a Connectionist Temporal Classification (CTC) decoder and a Transducer decoder - tend to model some speech features that are functional for the identification of filled pauses and prolongation in speech. This work builds upon these findings by investigating which of the two systems is better at fillers and prolongations detection tasks and by conducting an error analysis to deepen our understanding of how these systems work.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;disfluences</kwd>
        <kwd>speech recognition</kwd>
        <kwd>probing</kwd>
        <kwd>interpretability</kwd>
        <kwd>explainability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent works on Automatic Speech Recognition (ASR)
systems based on the computing power of Deep
Neural Networks (DNN), a great deal of efort is focused on
incrementing the systems’ performances by employing
increasingly complex, hence hardly interpretable, DNN
models that require huge amounts of data for the
training, like End-to-End Automatic Speech Recognition
(E2EASR) models which represent the state-of-the-art. An
E2E-ASR model directly converts a sequence of input
acoustic feature vectors (or possibly raw audio samples)
into a series of graphemes or words that represent the Figure 1: E2E ASRs are based on an encoder-decoder
architranscription of the audio signal [1], as represented in tecture. The speech signal is fed to the encoder, producing an
ifgure 1. In contrast, traditional ASR systems typically encoded representation that contains the information needed
train the acoustic, pronunciation, and language mod- sbuybtwheorddescaondderbtuoilpdrothveidterathnescsreipqtuioennc.e of
words/characters/els separately, requiring distinct modelling and training
for each component. These systems usually aim to ob- that state-of-the-art systems based on End-to-End
modtain speech transcriptions ‘cleaned’from phenomena that els (E2E-ASRs) can model linguistic and acoustic features
characterise spoken language such as discourse mark- of spoken language, which can be investigated to explain
ers, particles, pauses, or other phenomena commonly their internal dynamics. Several probing techniques have
referred to as ‘disfluencies’. Studies on the interpretabil- been designed to inspect and better understand the
inity of the dynamics underlying neural models showed ternal behavior of DNN layers at diferent depths. With
these techniques, investigations on the internals of
DeepCLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Speech2 [2, 3] revealed the influence of diatopic
pronunci*DCecor0r4es—po0n6,d2in02g4a, uPtihsao,r.Italy ation variation in various English varieties and provided
† These authors contributed equally. evidence that intermediate layers contain information
$ vincenzonorman.vitale@unina.it (V. N. Vitale); crucial for their classification. Later, a study [ 4] on the
lschettino@unibz.it (L. Schettino); cutugno@unina.it (F. Cutugno) layerwise capacity to encode information about acoustic
0000-0002-0365-8575 (V. N. Vitale); 0000-0002-3788-3754 features, phone identity, word identity, and word
mean(L. Schettino); 0000-0001-9457-6243 (F. Cutugno)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License ing based on the context of occurrence highlighted that
Attribution 4.0 International (CC BY 4.0).
the last layer right before the decoding module retains in- ers of the Neapolitan variety of Italian, the speakers from
formation about word meaning information, rather than [15] come from diferent Italian regions.
local acoustic features and phone identity information More specifically, the considered speech data include:
that are captured by the first layers and intermediate audio-visual recordings of guided tours at San Martino
layers respectively. Then, other studies have further in- Charterhouse (in Naples) led by three female expert
vestigated the capacity of state-of-the-art models to en- guides (CHROME corpus [14]), which consists of
inforcode phonetic/phonemic information[5, 6], lexical tone mative semi-monologic, semi-spontaneous speech
char[7] and gender [8]. Finally, [9] investigated the inter- acterized by a high degree of discourse planning and an
nal dynamics of three pre-trained E2E-ASRs evidencing asymmetrical relationship between the speakers;
audiothe emergence of syllable-related features by training visual recordings of 10 speakers narrating ‘Frog
Stoan acoustic-syllable boundary detector. Following this ries’from a picture book [15], which elicited unplanned
line of research, a recent study [10] investigated the abil- descriptive speech; four task-oriented dialogues from the
ity of two types of pre-trained systems with the same CLIPS corpus [16], which provides mainly descriptive
Conformer-based encoding architecture but diferent de- semi-spontaneous speech characterized by a low degree
coders – a Connectionist Temporal Classification (CTC) of discourse planning and a high degree of collaboration
decoder and a Transducer decoder – to model features between the interlocutors.
that distinguish filled pauses and prolongations in speech
and showed that, despite not being originally trained to 2.2. Annotation
detect disfluencies, these systems tend to model some
speech features that are functional for their identifica- Filled Pauses (FPs), defined as non-verbal fillers realized
tion. Rather than disregarding the ability of E2E-ASRs as vocalization and/or nasalization, and Prolongations
to model the acoustic information tied to such speech (PRLs), defined as marked lengthening of segmental
matephenomena as a dispensable noise source, it could be rial [17, 18] were manually annotated along with pauses,
exploited to achieve diferent ends. On the one hand, it lexical fillers, repetitions, deletions, insertions, and
subcould be used to obtain more accurate transcriptions that stitutions following the annotation scheme described
provide better, or rather more faithful, representations in [19]. This is a multilevel annotation system
develof the speech signal, which would also support linguis- oped to account for both formal and functional features
tic annotation processes. On the other hand, exploring of phenomena used to manage the own speech
producthe systems’ modelling ability leads to deepening our tion. The identification of diferent types of phenomena
understanding of their underlying dynamics. In the last was based on a ‘pragmatic approach’[20], which means
20 years, disfluency detection tasks have been conducted that it did not rely on absolute measures but on
percepto improve speech recognition performances [11, 12] and tual judgments given the specific contexts of occurrence.
diferent recent approaches to filler detection achieve The reliability of the annotation and the Inter-Annotator
rather high performances, see [13]. However, these in- Agreement was evaluated by measuring Cohen’s  . It
vestigations mostly concern filler particles and, to our yielded 0.92 for dialogic data and 0.82 for monologic data,
knowledge, no such system has been tested on Italian which stands for ‘high agreement’[21].
data so far. The proposed work aims to build upon these
ifndings by investigating which of the two decoding sys- 2.3. Data Preparation
tems is better at performing a detection task for fillers and
prolongations. Moreover, a quantitative and qualitative
error analysis is conducted to deepen our understanding
of the way these systems work.</p>
      <sec id="sec-1-1">
        <title>The considered dataset has been prepared based on a set</title>
        <p>of praat TextGrid annotation files indicating the speaker
and the type of disfluency according to the speech signal.</p>
        <p>More specifically, considering only the PRLs and the FPs,
the resulting dataset has a dimension of 1900 segments.
2. Materials and Method For each segment, the contextual information preceding
and following the disfluency phenomenon has been
con2.1. Data sidered, giving each segment a length of 4 seconds. Then,
In this study, we employed approximately 210 minutes based on the combination of the so-composed dataset
of expert annotated speech respectively divided into ∼ with each of the considered pre-trained models’ encoders
80 minutes of informative speech [14], 90 minutes of de- (details reported in Section 3.1), for each combination of
scriptive speech [15] and approximately 40 minutes of segment and on each intermediate encoding layer the
dialogic speech [16], that is dyads where two speakers following elements were extracted:
recorded on diferent channels interact. While the data
from [14] and [16] consists of speech produced by
speak• A sequence of intermediate layer
emissions/embedding representing the input segment in the layer’s
(a) Average Dynamic time warping distance measured between sequences of labels with standard error (shade).</p>
        <p>(b) Average Weighted F1 measure measured between sequences of labels with standard error (shade).</p>
        <p>3. Results
vectorial space. Each emission in the sequence
represents a portion of 40 milliseconds of the
input signal due to the considered model’s charac- 3.1. Disfluency Identification Through
teristics. Model Probing
• A sequence of labels associated with each
sequence of emissions, indicating whether an
intermediate emission belongs to a particular class of
disfluencies (1 for FP and 2 for PRL) or not (label
0 if the segment does not belong to a disfluency).</p>
      </sec>
      <sec id="sec-1-2">
        <title>Building upon recent studies that make use of probes to</title>
        <p>better understand the internal behavior of pre-trained
E2E-ASR models’[9, 4, 3], we apply a similar approach to
investigate if and to which extent a pre-trained model ()
can codify disfluencies-related features in the encoding</p>
        <p>The resulting dataset consists of pairs of sequences module, even if they are not trained to do so. The
emof emissions (i.e., distilled features) and correspond- ployed approach is aimed at building specific classifiers
ing labels identified by the model and the layer from whose inputs are represented by intermediate emissions
which they were extracted. Note that each sequence of the considered model’s encoder layers (), combined
of intermediate layer emissions has a length ℎ = with the appropriate sequence of labels based on dataset
4/40, as it represents the tempo- annotation. Internally, each classifier consists of a Long
ral succession of segments before, during, and after dis- Short Term Memory (LSTM) module followed by a Feed
lfuency phenomena. We use the term emission [10, 9] to Forward Neural Network (FFNN). Given that our
probindicate intermediate layer neurons fire, instead of the lem can be related to sequence classification, the LSTMs
more commonly used term embedding [8], as the latter seem to be the most naturally suited model [22]; usually,
is widely used to indicate the output of an entire module an LSTM consists of one computational unit that
iterarather than a layer. tively processes all input time series vectors. This unit
(a) CTC-based classifier with hidden size 640 trained on
distilled features from layer 18 (index 17 in F1,DTW
plots).
(b) RNN-T-based classifier with hidden size 640 trained on
distilled features from layer 16 (index 15 in F1,DTW
plots).
comprises three gates processing one vector at a time and considered. The RNN-T is an auto-regressive speech
trancombining it with information extracted from previous scription technique that overcomes CTC’s limitations,
vectors. One of the most crucial parameters for an LSTM being non-auto-regressive and subject to limited label
seis the hidden layer, therefore we investigate the impact of quence length. The Transducer decoding technique can
three diferent layer sizes (hidden-layer size, ), namely produce label-transcription sequences longer than the
160, 320 and 640. So, an LSTM-based classifier processes input vector sequence and models inter-dependency in
a sequence of {,} emission vectors (each of length long-term transcription elements. A Transducer typically
ℎ) and produces a new sequence of vectors with size . comprises two sub-modules: one that forecasts the next
The two sequences are aligned over time. At each time transcription label based on the previous transcriptions
step , the FFNN produces a label indicating whether the (prediction network) and the other that combines the
considered input represents a specific disfluency segment encoder and prediction-network outputs to produce a
(label 1 for filled pause or 2 for prolongation) or not (with new transcription label (joiner network). These features
label 0) based on the LSTM hidden-layer output. In sum- improve transcription speed and performance compared
mary, we train and evaluate many diferent LSTM-based to CTC while requiring more training and computational
disfluencies classifiers/detectors ( ,,) for all possible resources [26]. Note that both pre-trained models rely
, , and  combinations to search for the evidence of on the same encoder architecture, but the
Conformerdisfluencies-related properties in the models’ decisions. CTC model has 18 encoding layers, while the
Conformer</p>
        <p>The goal is to explore which of the considered pre- Transducer encoder has 17 layers.
trained E2E ASR models, based on diferent decoding In this study, ∼ 100 classifiers (2 models * ∼ 17
laysystems, better encodes characteristics associated with ers * 3 classifier sizes) were trained to investigate which
disfluent speech segments to perform a fillers and prolon- of the considered pre-trained models, difering only by
gations detection task. To this end, two publicly available the decoding approach, encodes enough information to
[23] Conformer-based models [24] with 120 million pa- perform a disfluency detection task.
rameters each, built with the NVIDIA Nemo toolkit and To evaluate the alignment between the output of the
difering only in the decoding strategy, were selected. On classifier and the reference label sequence we employ the
the one hand, a Conformer-based model with a Connec- Dynamic Time Warping Distance (DTW distance) [27],
tionist Temporal Classification (CTC) [ 25] decoder has reported in figure 2a. The DTW results highlight that
laybeen considered, as the CTC is one of the most popular ers closer to the decoding module seem to contain most
decoding techniques. Such a decoding technique is a non- of the information needed to perform a correct
detecauto-regressive speech transcription technique that col- tion of the considered disfluencies, obtaining an average
lapses consecutive, all-equal, transcription labels (char- DTW distance of approximately 1.39 in all the cases, with
acter, word piece, etc.) to one label unless a special label a considerably low standard error. Then, to evaluate the
separates these. The result is a sequence of labels shorter capability of each classifier to provide a correct as well
or equal to the input vector sequence length. Being non- as aligned labels sequence, we employed the weighted F1
auto-regressive, it is also considered computationally ef- measure, reported in figure 2b. Also in this case, F1 results
fective as it requires less time and resources for training confirm that layers closer to the decoding module seem
and inference phases. On the other hand, a Conformer- to be those containing most of the information needed to
based model with the Recurrent Neural Network Trans- correctly identify the disfluency segment. The
combinaducer (RNN-T), commonly known as Transducer has been tion of F1 and DTW provides an integrated perspective
(a)
(c)
(b)
(d)
on the system’s ability to classify and align segments FP than for PRL phenomena. Focusing on the PRL
incorrectly. Finally, in Figure 3 (a and b), we report the con- stances, a negative correlation is observed between the F1
fusion matrix of the best classifiers obtained from each weighted scores and PRLs’ duration (CTC non-recognized
considered model. On the one side, the CTC seems to r = - 0.91, figure 4c; RNN Transducer non-recognized r =
be better at discriminating non-disfluent segments (ND), - 0.87, figure 4d).
while showing the worst performance in disfluency iden- The error analysis was supported by an auditory
intification. On the other side, the RNN-T-based classifier spection of the unrecognized and misclassified samples
shows considerable performance at identifying FPs and filtered based on the average DTW distance, namely,
is the worst in discriminating ND segments, while PRL 1.39 for the Transducer-based and 1.40 for the
CTCperformance is comparable to the CTC classifier. Both based classifier. Issues in PRL recognition mostly
conmatrices highlight that the most dificult disfluency phe- cerned shorter instances, those characterized by peculiar
nomena to classify are prolongations, which is the focus ‘non-prototypical’phonation features (such as unsteady,
of our preliminary exploratory error analysis. creaky phonation) and the alignment of PRL-predicted
occurrences. Also, several PRL phenomena were
misclas3.2. Qualitative Analysis sified as FP when occurring with monosyllabic words,
such as ‘o&lt;oo&gt;’, ‘un po&lt;oo&gt;’, ‘che&lt;ee&gt;’, ‘e&lt;ee&gt;’. In fact,
the phonetic realization of these instances is closer to the
ones that characterize FP for their vowel quality and as
being, to a certain extent, independent elements from the
phonetic environment
The qualitative analysis is based on the best classifier
for each of the considered models used to generate the
distilled features. In particular, for the CTC version, the
best classifier resulted in the one with 640 hidden neurons
trained on 18-th layer features. Among the
transducerbased versions, the one with 640 hidden neurons trained
on 17-th layer features emerged as the best version.</p>
        <p>The visual inspection of the distribution of the
considered phenomena highlights that for both the CTC (4a)
and the RNN Transducer classifiers (4b), FP phenomena
concentrate on higher F1 weighted values, whereas wider
distributions are observed for PRL phenomena, which
shows that both classifiers work better when dealing with</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Discussion and Conclusions</title>
      <p>In this work, we build upon a previous study that
investigated to what extent modern ASR E2Es encode features
related to disfluency phenomena, even if they are not
directly trained to do so. We showed that pre-trained
models with the same audio encoder but with two
diferent state-of-the-art decoding strategies (CTC and
Transducer) capture disfluency-related features, especially in These findings could be used to improve transcription
the latest encoding layer, and both model features that applications by enriching them with disfluency
annocan be used for the identification and positioning of dis- tation (including filler particles and prolongation
phelfuent speech segments [ 10]. Although there seems to be nomena), which are still rather costly processes for
studa tendency to forget this information with subsequent ies concerning hesitation phenomena and (own) speech
layers, as the trends for DTW (figure 2a) and F1-measure management in typical as well as atypical speech (e.g.,
(figure 2b) would suggest, the last layers, which are those pathological or language learners’ speech. Indeed, an
closest to the objective function represented by the de- immediate development of the described work consists
coding module, seem the most prone to retain character- of increasing the capabilities of the pre-trained E2E-ASRs
istics useful to locate and identify disfluency phenomena. by adding a simple disfluency identification module to
Interestingly, despite the diferences between the two complement the existing decoder, thus enriching the
redecoding modules which are respectively non-recurrent sulting transcriptions.
(CTC) and recurrent (RNN-T), the performances for the Our work is built upon unidirectional LSTMs rather
chosen task are comparable. However, the confusion ma- than on bidirectional LSTMs (BiLSTMs), which provide
trices highlight that the CTC-based classifier performs better performance because the latter have slightly longer
better in the disfluency feature discrimination task, while inference times and require a larger amount of data,
rethe Transducer-based classifier more precisely identi- sources, time to be trained and, most importantly, present
ifes filled pauses, which could be related to the scope a more complex behaviour [29]. However, the
introduc(recurrent/non-recurrent) of the objective function. The tion of diferent architecture modules like bidirectional
results align with the literature that shows a strong sen- LSTM could improve the detection of prolongation
disflusitivity to features concerning words and phone of the encies. This will be part of future developments focused
layers closest to the encoder[4], while the layers clos- on performance and increased neural network
complexest to the input are more sensitive to features related ity.
to accent and local acoustic characteristics [3, 4]. It is
worth noticing that, in a recent work [9], sensitivity to
syllabic boundaries was found in layers 3-5, with a pat- References
tern similar to the one shown in Figure 2 but without
the peak in the last layers. The reason can be found in [1] S. Wang, G. Li, Overview of end-to-end speech
the fact that syllables and their boundaries do not have a recognition, in: Journal of Physics: Conference
Segraphic distinction in the transcriptions, conversely, in ries, volume 1187, IOP Publishing, 2019, p. 052068.
the case of disfluencies, there is a form of transcription [2] T. Viglino, P. Motlicek, M. Cernak, End-to-end
that identifies them within a language model. accented speech recognition., in: Interspeech, 2019,</p>
      <p>The exploratory analysis of the errors highlighted pp. 2140–2144.
that prolongations are more dificult to detect than filled [3] A. Prasad, P. Jyothi, How accents confound:
Probpauses, which could depend on their being an integral ing for accent information in end-to-end speech
(though lengthened) part of ‘fluent’words while filled recognition systems, in: Proceedings of the 58th
pauses are mostly realized as independent elements. Also, Annual Meeting of the Association for
Computainstances of prolongation are mostly non-recognized or tional Linguistics, 2020, pp. 3739–3753.
misclassified as filled pauses when characterized by pecu- [4] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise
analyliar ‘non-prototypical’phonation features, such as creaky sis of a self-supervised speech representation model,
phonations, or filler-like features, as in the case of mono- in: 2021 IEEE Automatic Speech Recognition and
syllabic word-final prolongations. Also, previous studies Understanding Workshop (ASRU), IEEE, 2021, pp.
on the segmental quality of prolongations in Italian [28] 914–921.
showed that prolongations, especially when concerning [5] P. C. English, J. Kelleher, J. Carson-Berndsen,
consonantal sounds, can be realised with schwa sounds Domain-informed probing of wav2vec 2.0
embedsimilar to those that characterize most filled pauses. This dings for phonetic features, in: Proceedings of the
ifller-like quality could also be considered among the 19th SIGMORPHON Workshop on Computational
underlying reasons for the negative correlation between Research in Phonetics, Phonology, and Morphology,
the evaluation metrics of prolongations misclassification 2022, pp. 83–91.
and their duration. Another possible motivation could [6] K. Martin, J. Gauthier, C. Breiss, R. Levy,
Probreside in a bias in the dataset combined with the classifier ing self-supervised speech models for phonetic
architecture (LSTM), which easily recognises prolonga- and phonemic information: A case study in
aspirations responding to a specific length pattern. This means tion, in: INTERSPEECH 2023, 2023, pp. 251–255.
that the scarcity of longer prolongations hinders their doi:10.21437/Interspeech.2023-2359.
modelling leading to their misclassification. [7] G. Shen, M. Watkins, A. Alishahi, A. Bisazza,</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>