=Paper= {{Paper |id=Vol-3878/107_main_long |storemode=property |title=Modelling Filled Particles and Prolongation Using End-to-end Automatic Speech Recognition Systems: A Quantitative and Qualitative Analysis |pdfUrl=https://ceur-ws.org/Vol-3878/107_main_long.pdf |volume=Vol-3878 |authors=Vincenzo Norman Vitale,Loredana Schettino,Francesco Cutugno |dblpUrl=https://dblp.org/rec/conf/clic-it/VitaleSC24 }} ==Modelling Filled Particles and Prolongation Using End-to-end Automatic Speech Recognition Systems: A Quantitative and Qualitative Analysis== https://ceur-ws.org/Vol-3878/107_main_long.pdf
                                Modelling filled particles and prolongation using
                                end-to-end Automatic Speech Recognition systems: a
                                quantitative and qualitative analysis.
                                Vincenzo Norman Vitale1,† , Loredana Schettino2,† and Francesco Cutugno1
                                1
                                    University of Naples Federico II, Naples, Italy
                                2
                                    Free University of Bozen-Bolzano, Bozen, Italy


                                                   Abstract
                                                   State-of-the-art automatic speech recognition systems based on End-to-End models (E2E-ASRs) achieve remarkable perfor-
                                                   mances. However, phenomena that characterize spoken language such as fillers ( ) or segmental prolongations
                                                   (the) are still mostly considered as disrupting objects that should not be included to obtain optimal transcriptions, despite
                                                   their acknowledged regularity and communicative value. A recent study showed that two types of pre-trained systems with
                                                   the same Conformer-based encoding architecture but different decoders – a Connectionist Temporal Classification (CTC)
                                                   decoder and a Transducer decoder – tend to model some speech features that are functional for the identification of filled
                                                   pauses and prolongation in speech. This work builds upon these findings by investigating which of the two systems is better
                                                   at fillers and prolongations detection tasks and by conducting an error analysis to deepen our understanding of how these
                                                   systems work.

                                                   Keywords
                                                   disfluences, speech recognition, probing, interpretability, explainability



                                1. Introduction
                                In recent works on Automatic Speech Recognition (ASR)
                                systems based on the computing power of Deep Neu-
                                ral Networks (DNN), a great deal of effort is focused on
                                incrementing the systems’ performances by employing
                                increasingly complex, hence hardly interpretable, DNN
                                models that require huge amounts of data for the train-
                                ing, like End-to-End Automatic Speech Recognition (E2E-
                                ASR) models which represent the state-of-the-art. An
                                E2E-ASR model directly converts a sequence of input
                                acoustic feature vectors (or possibly raw audio samples)
                                into a series of graphemes or words that represent the                            Figure 1: E2E ASRs are based on an encoder-decoder archi-
                                transcription of the audio signal [1], as represented in                          tecture. The speech signal is fed to the encoder, producing an
                                                                                                                  encoded representation that contains the information needed
                                figure 1. In contrast, traditional ASR systems typically
                                                                                                                  by the decoder to provide the sequence of words/characters/-
                                train the acoustic, pronunciation, and language mod-                              subwords and build the transcription.
                                els separately, requiring distinct modelling and training
                                for each component. These systems usually aim to ob-                                                   that state-of-the-art systems based on End-to-End mod-
                                tain speech transcriptions ‘cleaned’from phenomena that                                                els (E2E-ASRs) can model linguistic and acoustic features
                                characterise spoken language such as discourse mark-                                                   of spoken language, which can be investigated to explain
                                ers, particles, pauses, or other phenomena commonly                                                    their internal dynamics. Several probing techniques have
                                referred to as ‘disfluencies’. Studies on the interpretabil-                                           been designed to inspect and better understand the in-
                                ity of the dynamics underlying neural models showed                                                    ternal behavior of DNN layers at different depths. With
                                                                                                                                       these techniques, investigations on the internals of Deep-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Speech2 [2, 3] revealed the influence of diatopic pronunci-
                                Dec 04 — 06, 2024, Pisa, Italy
                                *
                                  Corresponding author.
                                                                                                                                       ation variation in various English varieties and provided
                                †
                                  These authors contributed equally.
                                                                                                                                       evidence that intermediate layers contain information
                                $ vincenzonorman.vitale@unina.it (V. N. Vitale);                                                       crucial for their classification. Later, a study [4] on the
                                lschettino@unibz.it (L. Schettino); cutugno@unina.it (F. Cutugno) layerwise capacity to encode information about acoustic
                                 0000-0002-0365-8575 (V. N. Vitale); 0000-0002-3788-3754                                              features, phone identity, word identity, and word mean-
                                (L. Schettino); 0000-0001-9457-6243 (F. Cutugno)                                                       ing based on the context of occurrence highlighted that
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
the last layer right before the decoding module retains in-     ers of the Neapolitan variety of Italian, the speakers from
formation about word meaning information, rather than           [15] come from different Italian regions.
local acoustic features and phone identity information             More specifically, the considered speech data include:
that are captured by the first layers and intermediate          audio-visual recordings of guided tours at San Martino
layers respectively. Then, other studies have further in-       Charterhouse (in Naples) led by three female expert
vestigated the capacity of state-of-the-art models to en-       guides (CHROME corpus [14]), which consists of infor-
code phonetic/phonemic information[5, 6], lexical tone          mative semi-monologic, semi-spontaneous speech char-
[7] and gender [8]. Finally, [9] investigated the inter-        acterized by a high degree of discourse planning and an
nal dynamics of three pre-trained E2E-ASRs evidencing           asymmetrical relationship between the speakers; audio-
the emergence of syllable-related features by training          visual recordings of 10 speakers narrating ‘Frog Sto-
an acoustic-syllable boundary detector. Following this          ries’from a picture book [15], which elicited unplanned
line of research, a recent study [10] investigated the abil-    descriptive speech; four task-oriented dialogues from the
ity of two types of pre-trained systems with the same           CLIPS corpus [16], which provides mainly descriptive
Conformer-based encoding architecture but different de-         semi-spontaneous speech characterized by a low degree
coders – a Connectionist Temporal Classification (CTC)          of discourse planning and a high degree of collaboration
decoder and a Transducer decoder – to model features            between the interlocutors.
that distinguish filled pauses and prolongations in speech
and showed that, despite not being originally trained to        2.2. Annotation
detect disfluencies, these systems tend to model some
speech features that are functional for their identifica-       Filled Pauses (FPs), defined as non-verbal fillers realized
tion. Rather than disregarding the ability of E2E-ASRs          as vocalization and/or nasalization, and Prolongations
to model the acoustic information tied to such speech           (PRLs), defined as marked lengthening of segmental mate-
phenomena as a dispensable noise source, it could be            rial [17, 18] were manually annotated along with pauses,
exploited to achieve different ends. On the one hand, it        lexical fillers, repetitions, deletions, insertions, and sub-
could be used to obtain more accurate transcriptions that       stitutions following the annotation scheme described
provide better, or rather more faithful, representations        in [19]. This is a multilevel annotation system devel-
of the speech signal, which would also support linguis-         oped to account for both formal and functional features
tic annotation processes. On the other hand, exploring          of phenomena used to manage the own speech produc-
the systems’ modelling ability leads to deepening our           tion. The identification of different types of phenomena
understanding of their underlying dynamics. In the last         was based on a ‘pragmatic approach’[20], which means
20 years, disfluency detection tasks have been conducted        that it did not rely on absolute measures but on percep-
to improve speech recognition performances [11, 12] and         tual judgments given the specific contexts of occurrence.
different recent approaches to filler detection achieve         The reliability of the annotation and the Inter-Annotator
rather high performances, see [13]. However, these in-          Agreement was evaluated by measuring Cohen’s 𝜅. It
vestigations mostly concern filler particles and, to our        yielded 0.92 for dialogic data and 0.82 for monologic data,
knowledge, no such system has been tested on Italian            which stands for ‘high agreement’[21].
data so far. The proposed work aims to build upon these
findings by investigating which of the two decoding sys-        2.3. Data Preparation
tems is better at performing a detection task for fillers and
prolongations. Moreover, a quantitative and qualitative  The considered dataset has been prepared based on a set
error analysis is conducted to deepen our understanding  of praat TextGrid annotation files indicating the speaker
of the way these systems work.                           and the type of disfluency according to the speech signal.
                                                         More specifically, considering only the PRLs and the FPs,
                                                         the resulting dataset has a dimension of 1900 segments.
2. Materials and Method                                  For each segment, the contextual information preceding
                                                         and following the disfluency phenomenon has been con-
2.1. Data                                                sidered, giving each segment a length of 4 seconds. Then,
                                                         based on the combination of the so-composed dataset
In this study, we employed approximately 210 minutes
                                                         with each of the considered pre-trained models’ encoders
of expert annotated speech respectively divided into ∼
                                                         (details reported in Section 3.1), for each combination of
80 minutes of informative speech [14], 90 minutes of de-
                                                         segment and on each intermediate encoding layer the
scriptive speech [15] and approximately 40 minutes of
                                                         following elements were extracted:
dialogic speech [16], that is dyads where two speakers
recorded on different channels interact. While the data       • A sequence of intermediate layer emissions/embed-
from [14] and [16] consists of speech produced by speak-          ding representing the input segment in the layer’s
      (a) Average Dynamic time warping distance measured between sequences of labels with standard error (shade).




           (b) Average Weighted F1 measure measured between sequences of labels with standard error (shade).
Figure 2: Dynamic Time Warping distance (figure a) and Weighted F1 (figure b) for all the trained classifers. The x-axis
indicates the index (starting from index 0) of the intermediate layer from which the distilled features have been extracted to
train the corresponding classifier.
       vectorial space. Each emission in the sequence           3. Results
       represents a portion of 40 milliseconds of the in-
       put signal due to the considered model’s charac-         3.1. Disfluency Identification Through
       teristics.                                                    Model Probing
     • A sequence of labels associated with each se-
       quence of emissions, indicating whether an inter-  Building upon recent studies that make use of probes to
       mediate emission belongs to a particular class of  better understand the internal behavior of pre-trained
       disfluencies (1 for FP and 2 for PRL) or not (labelE2E-ASR models’[9, 4, 3], we apply a similar approach to
       0 if the segment does not belong to a disfluency). investigate if and to which extent a pre-trained model (𝑚)
                                                          can codify disfluencies-related features in the encoding
   The resulting dataset consists of pairs of sequences module, even if they are not trained to do so. The em-
of emissions (i.e., distilled features) and correspond- ployed approach is aimed at building specific classifiers
ing labels identified by the model and the layer from whose inputs are represented by intermediate emissions
which they were extracted. Note that each sequence of the considered model’s encoder layers (𝑙), combined
of intermediate layer emissions has a length ℎ = with the appropriate sequence of labels based on dataset
4𝑠𝑒𝑐𝑜𝑛𝑑𝑠/40𝑚𝑖𝑙𝑙𝑖𝑠𝑒𝑐𝑜𝑛𝑑𝑠, as it represents the tempo- annotation. Internally, each classifier consists of a Long
ral succession of segments before, during, and after dis- Short Term Memory (LSTM) module followed by a Feed
fluency phenomena. We use the term emission [10, 9] to Forward Neural Network (FFNN). Given that our prob-
indicate intermediate layer neurons fire, instead of the lem can be related to sequence classification, the LSTMs
more commonly used term embedding [8], as the latter seem to be the most naturally suited model [22]; usually,
is widely used to indicate the output of an entire module an LSTM consists of one computational unit that itera-
rather than a layer.                                      tively processes all input time series vectors. This unit
(a) CTC-based classifier with hidden size 640 trained on            (b) RNN-T-based classifier with hidden size 640 trained on
    distilled features from layer 18 (index 17 in F1,DTW                distilled features from layer 16 (index 15 in F1,DTW
    plots).                                                             plots).

Figure 3: Confusion matrix for the best classifiers obtained for each of the considered decoding approaches.
comprises three gates processing one vector at a time and       considered. The RNN-T is an auto-regressive speech tran-
combining it with information extracted from previous           scription technique that overcomes CTC’s limitations,
vectors. One of the most crucial parameters for an LSTM         being non-auto-regressive and subject to limited label se-
is the hidden layer, therefore we investigate the impact of     quence length. The Transducer decoding technique can
three different layer sizes (hidden-layer size, 𝑛), namely      produce label-transcription sequences longer than the
160, 320 and 640. So, an LSTM-based classifier processes        input vector sequence and models inter-dependency in
a sequence of {𝑒𝑙,𝑚 } emission vectors (each of length          long-term transcription elements. A Transducer typically
ℎ) and produces a new sequence of vectors with size 𝑛.          comprises two sub-modules: one that forecasts the next
The two sequences are aligned over time. At each time           transcription label based on the previous transcriptions
step 𝑡, the FFNN produces a label indicating whether the        (prediction network) and the other that combines the
considered input represents a specific disfluency segment       encoder and prediction-network outputs to produce a
(label 1 for filled pause or 2 for prolongation) or not (with   new transcription label (joiner network). These features
label 0) based on the LSTM hidden-layer output. In sum-         improve transcription speed and performance compared
mary, we train and evaluate many different LSTM-based           to CTC while requiring more training and computational
disfluencies classifiers/detectors (𝐿𝑛,𝑚,𝑙 ) for all possible   resources [26]. Note that both pre-trained models rely
𝑛, 𝑚, and 𝑙 combinations to search for the evidence of          on the same encoder architecture, but the Conformer-
disfluencies-related properties in the models’ decisions.       CTC model has 18 encoding layers, while the Conformer-
   The goal is to explore which of the considered pre-          Transducer encoder has 17 layers.
trained E2E ASR models, based on different decoding                In this study, ∼ 100 classifiers (2 models * ∼17 lay-
systems, better encodes characteristics associated with         ers * 3 classifier sizes) were trained to investigate which
disfluent speech segments to perform a fillers and prolon-      of the considered pre-trained models, differing only by
gations detection task. To this end, two publicly available     the decoding approach, encodes enough information to
[23] Conformer-based models [24] with 120 million pa-           perform a disfluency detection task.
rameters each, built with the NVIDIA Nemo toolkit and              To evaluate the alignment between the output of the
differing only in the decoding strategy, were selected. On      classifier and the reference label sequence we employ the
the one hand, a Conformer-based model with a Connec-            Dynamic Time Warping Distance (DTW distance) [27],
tionist Temporal Classification (CTC) [25] decoder has          reported in figure 2a. The DTW results highlight that lay-
been considered, as the CTC is one of the most popular          ers closer to the decoding module seem to contain most
decoding techniques. Such a decoding technique is a non-        of the information needed to perform a correct detec-
auto-regressive speech transcription technique that col-        tion of the considered disfluencies, obtaining an average
lapses consecutive, all-equal, transcription labels (char-      DTW distance of approximately 1.39 in all the cases, with
acter, word piece, etc.) to one label unless a special label    a considerably low standard error. Then, to evaluate the
separates these. The result is a sequence of labels shorter     capability of each classifier to provide a correct as well
or equal to the input vector sequence length. Being non-        as aligned labels sequence, we employed the weighted F1
auto-regressive, it is also considered computationally ef-      measure, reported in figure 2b. Also in this case, F1 results
fective as it requires less time and resources for training     confirm that layers closer to the decoding module seem
and inference phases. On the other hand, a Conformer-           to be those containing most of the information needed to
based model with the Recurrent Neural Network Trans-            correctly identify the disfluency segment. The combina-
ducer (RNN-T), commonly known as Transducer has been            tion of F1 and DTW provides an integrated perspective
                            (a)                                                              (b)




                            (c)                                                             (d)
Figure 4: The plots in (a) for CTC and (b) for RNN-T report the F1 measure related to the frequency of FP (yellow) and PRL
(purple). Scatterplots for CTC (c) and RNN-T (d) compare the duration of the PRL segments with the respective F1 measure.
on the system’s ability to classify and align segments          FP than for PRL phenomena. Focusing on the PRL in-
correctly. Finally, in Figure 3 (a and b), we report the con-   stances, a negative correlation is observed between the F1
fusion matrix of the best classifiers obtained from each        weighted scores and PRLs’ duration (CTC non-recognized
considered model. On the one side, the CTC seems to             r = - 0.91, figure 4c; RNN Transducer non-recognized r =
be better at discriminating non-disfluent segments (ND),        - 0.87, figure 4d).
while showing the worst performance in disfluency iden-             The error analysis was supported by an auditory in-
tification. On the other side, the RNN-T-based classifier       spection of the unrecognized and misclassified samples
shows considerable performance at identifying FPs and           filtered based on the average DTW distance, namely,
is the worst in discriminating ND segments, while PRL           1.39 for the Transducer-based and 1.40 for the CTC-
performance is comparable to the CTC classifier. Both           based classifier. Issues in PRL recognition mostly con-
matrices highlight that the most difficult disfluency phe-      cerned shorter instances, those characterized by peculiar
nomena to classify are prolongations, which is the focus        ‘non-prototypical’phonation features (such as unsteady,
of our preliminary exploratory error analysis.                  creaky phonation) and the alignment of PRL-predicted
                                                                occurrences. Also, several PRL phenomena were misclas-
3.2. Qualitative Analysis                                       sified as FP when occurring with monosyllabic words,
                                                                such as ‘o’, ‘un po’, ‘che’, ‘e’. In fact,
The qualitative analysis is based on the best classifier        the phonetic realization of these instances is closer to the
for each of the considered models used to generate the          ones that characterize FP for their vowel quality and as
distilled features. In particular, for the CTC version, the     being, to a certain extent, independent elements from the
best classifier resulted in the one with 640 hidden neurons     phonetic environment
trained on 18-th layer features. Among the transducer-
based versions, the one with 640 hidden neurons trained
on 17-th layer features emerged as the best version.            4. Discussion and Conclusions
   The visual inspection of the distribution of the consid-
ered phenomena highlights that for both the CTC (4a)            In this work, we build upon a previous study that investi-
and the RNN Transducer classifiers (4b), FP phenomena           gated to what extent modern ASR E2Es encode features
concentrate on higher F1 weighted values, whereas wider         related to disfluency phenomena, even if they are not
distributions are observed for PRL phenomena, which             directly trained to do so. We showed that pre-trained
shows that both classifiers work better when dealing with       models with the same audio encoder but with two differ-
                                                                ent state-of-the-art decoding strategies (CTC and Trans-
ducer) capture disfluency-related features, especially in         These findings could be used to improve transcription
the latest encoding layer, and both model features that        applications by enriching them with disfluency anno-
can be used for the identification and positioning of dis-     tation (including filler particles and prolongation phe-
fluent speech segments [10]. Although there seems to be        nomena), which are still rather costly processes for stud-
a tendency to forget this information with subsequent          ies concerning hesitation phenomena and (own) speech
layers, as the trends for DTW (figure 2a) and F1-measure       management in typical as well as atypical speech (e.g.,
(figure 2b) would suggest, the last layers, which are those    pathological or language learners’ speech. Indeed, an
closest to the objective function represented by the de-       immediate development of the described work consists
coding module, seem the most prone to retain character-        of increasing the capabilities of the pre-trained E2E-ASRs
istics useful to locate and identify disfluency phenomena.     by adding a simple disfluency identification module to
Interestingly, despite the differences between the two         complement the existing decoder, thus enriching the re-
decoding modules which are respectively non-recurrent          sulting transcriptions.
(CTC) and recurrent (RNN-T), the performances for the             Our work is built upon unidirectional LSTMs rather
chosen task are comparable. However, the confusion ma-         than on bidirectional LSTMs (BiLSTMs), which provide
trices highlight that the CTC-based classifier performs        better performance because the latter have slightly longer
better in the disfluency feature discrimination task, while    inference times and require a larger amount of data, re-
the Transducer-based classifier more precisely identi-         sources, time to be trained and, most importantly, present
fies filled pauses, which could be related to the scope        a more complex behaviour [29]. However, the introduc-
(recurrent/non-recurrent) of the objective function. The       tion of different architecture modules like bidirectional
results align with the literature that shows a strong sen-     LSTM could improve the detection of prolongation disflu-
sitivity to features concerning words and phone of the         encies. This will be part of future developments focused
layers closest to the encoder[4], while the layers clos-       on performance and increased neural network complex-
est to the input are more sensitive to features related        ity.
to accent and local acoustic characteristics [3, 4]. It is
worth noticing that, in a recent work [9], sensitivity to
syllabic boundaries was found in layers 3-5, with a pat-       References
tern similar to the one shown in Figure 2 but without
                                                                [1] S. Wang, G. Li, Overview of end-to-end speech
the peak in the last layers. The reason can be found in
                                                                    recognition, in: Journal of Physics: Conference Se-
the fact that syllables and their boundaries do not have a
                                                                    ries, volume 1187, IOP Publishing, 2019, p. 052068.
graphic distinction in the transcriptions, conversely, in
                                                                [2] T. Viglino, P. Motlicek, M. Cernak, End-to-end
the case of disfluencies, there is a form of transcription
                                                                    accented speech recognition., in: Interspeech, 2019,
that identifies them within a language model.
                                                                    pp. 2140–2144.
    The exploratory analysis of the errors highlighted
                                                                [3] A. Prasad, P. Jyothi, How accents confound: Prob-
that prolongations are more difficult to detect than filled
                                                                    ing for accent information in end-to-end speech
pauses, which could depend on their being an integral
                                                                    recognition systems, in: Proceedings of the 58th
(though lengthened) part of ‘fluent’words while filled
                                                                    Annual Meeting of the Association for Computa-
pauses are mostly realized as independent elements. Also,
                                                                    tional Linguistics, 2020, pp. 3739–3753.
instances of prolongation are mostly non-recognized or
                                                                [4] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy-
misclassified as filled pauses when characterized by pecu-
                                                                    sis of a self-supervised speech representation model,
liar ‘non-prototypical’phonation features, such as creaky
                                                                    in: 2021 IEEE Automatic Speech Recognition and
phonations, or filler-like features, as in the case of mono-
                                                                    Understanding Workshop (ASRU), IEEE, 2021, pp.
syllabic word-final prolongations. Also, previous studies
                                                                    914–921.
on the segmental quality of prolongations in Italian [28]
                                                                [5] P. C. English, J. Kelleher, J. Carson-Berndsen,
showed that prolongations, especially when concerning
                                                                    Domain-informed probing of wav2vec 2.0 embed-
consonantal sounds, can be realised with schwa sounds
                                                                    dings for phonetic features, in: Proceedings of the
similar to those that characterize most filled pauses. This
                                                                    19th SIGMORPHON Workshop on Computational
filler-like quality could also be considered among the
                                                                    Research in Phonetics, Phonology, and Morphology,
underlying reasons for the negative correlation between
                                                                    2022, pp. 83–91.
the evaluation metrics of prolongations misclassification
                                                                [6] K. Martin, J. Gauthier, C. Breiss, R. Levy, Prob-
and their duration. Another possible motivation could
                                                                    ing self-supervised speech models for phonetic
reside in a bias in the dataset combined with the classifier
                                                                    and phonemic information: A case study in aspira-
architecture (LSTM), which easily recognises prolonga-
                                                                    tion, in: INTERSPEECH 2023, 2023, pp. 251–255.
tions responding to a specific length pattern. This means
                                                                    doi:10.21437/Interspeech.2023-2359.
that the scarcity of longer prolongations hinders their
                                                                [7] G. Shen, M. Watkins, A. Alishahi, A. Bisazza,
modelling leading to their misclassification.
     G. Chrupała, Encoding of lexical tone in self-                20–23 July 2009, Liverpool, UK, 2009, pp. 20–23.
     supervised models of spoken language, in: K. Duh,        [17] R. Eklund, Disfluency in Swedish Human–Human
     H. Gomez, S. Bethard (Eds.), Proceedings of the               and Human–Machine travel booking dialogues,
     2024 Conference of the North American Chapter                 Ph.D. thesis, Linköping University Electronic Press,
     of the Association for Computational Linguistics:             2004.
     Human Language Technologies (Volume 1: Long              [18] S. Betz, Hesitations in Spoken Dialogue Systems,
     Papers), Association for Computational Linguis-               Ph.D. thesis, Universität Bielefeld, 2020.
     tics, Mexico City, Mexico, 2024, pp. 4250–4261.          [19] L. Schettino, The Role of Disfluencies in Italian Dis-
     URL: https://aclanthology.org/2024.naacl-long.239.            course. Modelling and Speech Synthesis Applica-
     doi:10.18653/v1/2024.naacl-long.239.                          tions., Ph.D. thesis, Università degli Studi di Salerno,
 [8] A. Krishnan, B. M. Abdullah, D. Klakow, On the en-            2022.
     coding of gender in transformer-based asr represen-      [20] R. J. Lickley, Fluency and disfluency, in: M. A. Red-
     tations, in: Interspeech 2024, 2024, pp. 3090–3094.           ford (Ed.), The handbook of speech production, Wi-
     doi:10.21437/Interspeech.2024-2209.                           ley Online Library, 2015, pp. 445–474. doi:https:
 [9] V. N. Vitale, F. Cutugno, A. Origlia, G. Coro, Ex-            //doi.org/10.1002/9781118584156.ch20.
     ploring emergent syllables in end-to-end automatic       [21] J. R. Landis, G. G. Koch, The measurement of ob-
     speech recognizers through model explainability               server agreement for categorical data, Biometrics
     technique, Neural Computing and Applications                  (1977) 159–174.
     (2024) 1–27.                                             [22] S. Hochreiter, J. Schmidhuber, Long short-term
[10] V. N. Vitale, L. Schettino, F. Cutugno, Rich                  memory, Neural computation 9 (1997) 1735–1780.
     speech signal: exploring and exploiting end-to-          [23] NVIDIA, Nvidia catalog for pre-trained conformer
     end automatic speech recognizers’ ability to model            models, 2023. URL: https://catalog.ngc.nvidia.
     hesitation phenomena, in: Interspeech 2024,                   com/orgs/nvidia/teams/nemo/models/stt_en_
     2024, pp. 222–226. doi:10.21437/Interspeech.                  conformer_{transducer|ctc}_large.
     2024-2029.                                               [24] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang,
[11] M. Gabrea, D. OShaughnessy, Detection of filled               J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu,
     pauses in spontaneous conversational speech, in:              et al., Conformer: Convolution-augmented trans-
     6th International Conference on Spoken Language               former for speech recognition, arXiv preprint
     Processing (ICSLP 2000), ISCA, 2000, pp. vol. 3, 678–         arXiv:2005.08100 (2020).
     681–0. URL: https://www.isca-archive.org/icslp_          [25] A. Graves, S. Fernández, F. Gomez, J. Schmidhu-
     2000/gabrea00_icslp.html. doi:10.21437/ICSLP.                 ber, Connectionist temporal classification: labelling
     2000-626.                                                     unsegmented sequence data with recurrent neural
[12] E. Shriberg, Spontaneous speech: how people really            networks, in: Proceedings of the 23rd international
     talk and why engineers should care., in: INTER-               conference on Machine learning, 2006, pp. 369–376.
     SPEECH, Citeseer, 2005, pp. 1781–1784.                   [26] A. Graves, Sequence transduction with recurrent
[13] V. Kany, J. Trouvain, Semiautomatic support of                neural networks, arXiv preprint arXiv:1211.3711
     speech fluency assessment by detecting filler parti-          (2012).
     cles and determining speech tempo, in: Workshop          [27] M. Müller, Dynamic time warping, Information
     on prosodic features of language learners’ fluency,           retrieval for music and motion (2007) 69–84.
     2024.                                                    [28] L. Schettino, R. Eklund, Prolongation in italian, in:
[14] A. Origlia, R. Savy, I. Poggi, F. Cutugno, I. Alfano,         Proceedings of Disfluency in Spontaneous Speech
     F. D’Errico, L. Vincze, V. Cataldo, An audiovisual            Workshop 2023 (DiSS 2023), 28–30 August 2023,
     corpus of guided tours in cultural sites: Data collec-        Bielefeld, Germany, 2023, pp. 81–85.
     tion protocols in the CHROME project, in: Proceed-       [29] S. Siami-Namini, N. Tavakoli, A. S. Namin, The
     ings of the 2018 AVI-CH Workshop on Advanced Vi-              performance of lstm and bilstm in forecasting time
     sual Interfaces for Cultural Heritage, volume 2091,           series, in: 2019 IEEE International conference on
     2018, pp. 1–4.                                                big data (Big Data), IEEE, 2019, pp. 3285–3292.
[15] G. Sarro, The many ways to search for an Italian
     frog. The Manner encoding in an Italian corpus
     collected with Modokit., Master’s thesis, Università
     degli Studi dell’Aquila., 2023.
[16] R. Savy, F. Cutugno, Diatopic, diamesic and diapha-
     sic variations in spoken Italian, in: M. Mahlberg,
     V. González-Díaz, C. Smith (Eds.), Proceedings of
     CL2009, The 5th Corpus Linguistics Conference,