=Paper=
{{Paper
|id=Vol-3878/107_main_long
|storemode=property
|title=Modelling Filled Particles and Prolongation Using End-to-end Automatic Speech Recognition Systems: A Quantitative and Qualitative Analysis
|pdfUrl=https://ceur-ws.org/Vol-3878/107_main_long.pdf
|volume=Vol-3878
|authors=Vincenzo Norman Vitale,Loredana Schettino,Francesco Cutugno
|dblpUrl=https://dblp.org/rec/conf/clic-it/VitaleSC24
}}
==Modelling Filled Particles and Prolongation Using End-to-end Automatic Speech Recognition Systems: A Quantitative and Qualitative Analysis==
Modelling filled particles and prolongation using
end-to-end Automatic Speech Recognition systems: a
quantitative and qualitative analysis.
Vincenzo Norman Vitale1,† , Loredana Schettino2,† and Francesco Cutugno1
1
University of Naples Federico II, Naples, Italy
2
Free University of Bozen-Bolzano, Bozen, Italy
Abstract
State-of-the-art automatic speech recognition systems based on End-to-End models (E2E-ASRs) achieve remarkable perfor-
mances. However, phenomena that characterize spoken language such as fillers ( ) or segmental prolongations
(the) are still mostly considered as disrupting objects that should not be included to obtain optimal transcriptions, despite
their acknowledged regularity and communicative value. A recent study showed that two types of pre-trained systems with
the same Conformer-based encoding architecture but different decoders – a Connectionist Temporal Classification (CTC)
decoder and a Transducer decoder – tend to model some speech features that are functional for the identification of filled
pauses and prolongation in speech. This work builds upon these findings by investigating which of the two systems is better
at fillers and prolongations detection tasks and by conducting an error analysis to deepen our understanding of how these
systems work.
Keywords
disfluences, speech recognition, probing, interpretability, explainability
1. Introduction
In recent works on Automatic Speech Recognition (ASR)
systems based on the computing power of Deep Neu-
ral Networks (DNN), a great deal of effort is focused on
incrementing the systems’ performances by employing
increasingly complex, hence hardly interpretable, DNN
models that require huge amounts of data for the train-
ing, like End-to-End Automatic Speech Recognition (E2E-
ASR) models which represent the state-of-the-art. An
E2E-ASR model directly converts a sequence of input
acoustic feature vectors (or possibly raw audio samples)
into a series of graphemes or words that represent the Figure 1: E2E ASRs are based on an encoder-decoder archi-
transcription of the audio signal [1], as represented in tecture. The speech signal is fed to the encoder, producing an
encoded representation that contains the information needed
figure 1. In contrast, traditional ASR systems typically
by the decoder to provide the sequence of words/characters/-
train the acoustic, pronunciation, and language mod- subwords and build the transcription.
els separately, requiring distinct modelling and training
for each component. These systems usually aim to ob- that state-of-the-art systems based on End-to-End mod-
tain speech transcriptions ‘cleaned’from phenomena that els (E2E-ASRs) can model linguistic and acoustic features
characterise spoken language such as discourse mark- of spoken language, which can be investigated to explain
ers, particles, pauses, or other phenomena commonly their internal dynamics. Several probing techniques have
referred to as ‘disfluencies’. Studies on the interpretabil- been designed to inspect and better understand the in-
ity of the dynamics underlying neural models showed ternal behavior of DNN layers at different depths. With
these techniques, investigations on the internals of Deep-
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Speech2 [2, 3] revealed the influence of diatopic pronunci-
Dec 04 — 06, 2024, Pisa, Italy
*
Corresponding author.
ation variation in various English varieties and provided
†
These authors contributed equally.
evidence that intermediate layers contain information
$ vincenzonorman.vitale@unina.it (V. N. Vitale); crucial for their classification. Later, a study [4] on the
lschettino@unibz.it (L. Schettino); cutugno@unina.it (F. Cutugno) layerwise capacity to encode information about acoustic
0000-0002-0365-8575 (V. N. Vitale); 0000-0002-3788-3754 features, phone identity, word identity, and word mean-
(L. Schettino); 0000-0001-9457-6243 (F. Cutugno) ing based on the context of occurrence highlighted that
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
the last layer right before the decoding module retains in- ers of the Neapolitan variety of Italian, the speakers from
formation about word meaning information, rather than [15] come from different Italian regions.
local acoustic features and phone identity information More specifically, the considered speech data include:
that are captured by the first layers and intermediate audio-visual recordings of guided tours at San Martino
layers respectively. Then, other studies have further in- Charterhouse (in Naples) led by three female expert
vestigated the capacity of state-of-the-art models to en- guides (CHROME corpus [14]), which consists of infor-
code phonetic/phonemic information[5, 6], lexical tone mative semi-monologic, semi-spontaneous speech char-
[7] and gender [8]. Finally, [9] investigated the inter- acterized by a high degree of discourse planning and an
nal dynamics of three pre-trained E2E-ASRs evidencing asymmetrical relationship between the speakers; audio-
the emergence of syllable-related features by training visual recordings of 10 speakers narrating ‘Frog Sto-
an acoustic-syllable boundary detector. Following this ries’from a picture book [15], which elicited unplanned
line of research, a recent study [10] investigated the abil- descriptive speech; four task-oriented dialogues from the
ity of two types of pre-trained systems with the same CLIPS corpus [16], which provides mainly descriptive
Conformer-based encoding architecture but different de- semi-spontaneous speech characterized by a low degree
coders – a Connectionist Temporal Classification (CTC) of discourse planning and a high degree of collaboration
decoder and a Transducer decoder – to model features between the interlocutors.
that distinguish filled pauses and prolongations in speech
and showed that, despite not being originally trained to 2.2. Annotation
detect disfluencies, these systems tend to model some
speech features that are functional for their identifica- Filled Pauses (FPs), defined as non-verbal fillers realized
tion. Rather than disregarding the ability of E2E-ASRs as vocalization and/or nasalization, and Prolongations
to model the acoustic information tied to such speech (PRLs), defined as marked lengthening of segmental mate-
phenomena as a dispensable noise source, it could be rial [17, 18] were manually annotated along with pauses,
exploited to achieve different ends. On the one hand, it lexical fillers, repetitions, deletions, insertions, and sub-
could be used to obtain more accurate transcriptions that stitutions following the annotation scheme described
provide better, or rather more faithful, representations in [19]. This is a multilevel annotation system devel-
of the speech signal, which would also support linguis- oped to account for both formal and functional features
tic annotation processes. On the other hand, exploring of phenomena used to manage the own speech produc-
the systems’ modelling ability leads to deepening our tion. The identification of different types of phenomena
understanding of their underlying dynamics. In the last was based on a ‘pragmatic approach’[20], which means
20 years, disfluency detection tasks have been conducted that it did not rely on absolute measures but on percep-
to improve speech recognition performances [11, 12] and tual judgments given the specific contexts of occurrence.
different recent approaches to filler detection achieve The reliability of the annotation and the Inter-Annotator
rather high performances, see [13]. However, these in- Agreement was evaluated by measuring Cohen’s 𝜅. It
vestigations mostly concern filler particles and, to our yielded 0.92 for dialogic data and 0.82 for monologic data,
knowledge, no such system has been tested on Italian which stands for ‘high agreement’[21].
data so far. The proposed work aims to build upon these
findings by investigating which of the two decoding sys- 2.3. Data Preparation
tems is better at performing a detection task for fillers and
prolongations. Moreover, a quantitative and qualitative The considered dataset has been prepared based on a set
error analysis is conducted to deepen our understanding of praat TextGrid annotation files indicating the speaker
of the way these systems work. and the type of disfluency according to the speech signal.
More specifically, considering only the PRLs and the FPs,
the resulting dataset has a dimension of 1900 segments.
2. Materials and Method For each segment, the contextual information preceding
and following the disfluency phenomenon has been con-
2.1. Data sidered, giving each segment a length of 4 seconds. Then,
based on the combination of the so-composed dataset
In this study, we employed approximately 210 minutes
with each of the considered pre-trained models’ encoders
of expert annotated speech respectively divided into ∼
(details reported in Section 3.1), for each combination of
80 minutes of informative speech [14], 90 minutes of de-
segment and on each intermediate encoding layer the
scriptive speech [15] and approximately 40 minutes of
following elements were extracted:
dialogic speech [16], that is dyads where two speakers
recorded on different channels interact. While the data • A sequence of intermediate layer emissions/embed-
from [14] and [16] consists of speech produced by speak- ding representing the input segment in the layer’s
(a) Average Dynamic time warping distance measured between sequences of labels with standard error (shade).
(b) Average Weighted F1 measure measured between sequences of labels with standard error (shade).
Figure 2: Dynamic Time Warping distance (figure a) and Weighted F1 (figure b) for all the trained classifers. The x-axis
indicates the index (starting from index 0) of the intermediate layer from which the distilled features have been extracted to
train the corresponding classifier.
vectorial space. Each emission in the sequence 3. Results
represents a portion of 40 milliseconds of the in-
put signal due to the considered model’s charac- 3.1. Disfluency Identification Through
teristics. Model Probing
• A sequence of labels associated with each se-
quence of emissions, indicating whether an inter- Building upon recent studies that make use of probes to
mediate emission belongs to a particular class of better understand the internal behavior of pre-trained
disfluencies (1 for FP and 2 for PRL) or not (labelE2E-ASR models’[9, 4, 3], we apply a similar approach to
0 if the segment does not belong to a disfluency). investigate if and to which extent a pre-trained model (𝑚)
can codify disfluencies-related features in the encoding
The resulting dataset consists of pairs of sequences module, even if they are not trained to do so. The em-
of emissions (i.e., distilled features) and correspond- ployed approach is aimed at building specific classifiers
ing labels identified by the model and the layer from whose inputs are represented by intermediate emissions
which they were extracted. Note that each sequence of the considered model’s encoder layers (𝑙), combined
of intermediate layer emissions has a length ℎ = with the appropriate sequence of labels based on dataset
4𝑠𝑒𝑐𝑜𝑛𝑑𝑠/40𝑚𝑖𝑙𝑙𝑖𝑠𝑒𝑐𝑜𝑛𝑑𝑠, as it represents the tempo- annotation. Internally, each classifier consists of a Long
ral succession of segments before, during, and after dis- Short Term Memory (LSTM) module followed by a Feed
fluency phenomena. We use the term emission [10, 9] to Forward Neural Network (FFNN). Given that our prob-
indicate intermediate layer neurons fire, instead of the lem can be related to sequence classification, the LSTMs
more commonly used term embedding [8], as the latter seem to be the most naturally suited model [22]; usually,
is widely used to indicate the output of an entire module an LSTM consists of one computational unit that itera-
rather than a layer. tively processes all input time series vectors. This unit
(a) CTC-based classifier with hidden size 640 trained on (b) RNN-T-based classifier with hidden size 640 trained on
distilled features from layer 18 (index 17 in F1,DTW distilled features from layer 16 (index 15 in F1,DTW
plots). plots).
Figure 3: Confusion matrix for the best classifiers obtained for each of the considered decoding approaches.
comprises three gates processing one vector at a time and considered. The RNN-T is an auto-regressive speech tran-
combining it with information extracted from previous scription technique that overcomes CTC’s limitations,
vectors. One of the most crucial parameters for an LSTM being non-auto-regressive and subject to limited label se-
is the hidden layer, therefore we investigate the impact of quence length. The Transducer decoding technique can
three different layer sizes (hidden-layer size, 𝑛), namely produce label-transcription sequences longer than the
160, 320 and 640. So, an LSTM-based classifier processes input vector sequence and models inter-dependency in
a sequence of {𝑒𝑙,𝑚 } emission vectors (each of length long-term transcription elements. A Transducer typically
ℎ) and produces a new sequence of vectors with size 𝑛. comprises two sub-modules: one that forecasts the next
The two sequences are aligned over time. At each time transcription label based on the previous transcriptions
step 𝑡, the FFNN produces a label indicating whether the (prediction network) and the other that combines the
considered input represents a specific disfluency segment encoder and prediction-network outputs to produce a
(label 1 for filled pause or 2 for prolongation) or not (with new transcription label (joiner network). These features
label 0) based on the LSTM hidden-layer output. In sum- improve transcription speed and performance compared
mary, we train and evaluate many different LSTM-based to CTC while requiring more training and computational
disfluencies classifiers/detectors (𝐿𝑛,𝑚,𝑙 ) for all possible resources [26]. Note that both pre-trained models rely
𝑛, 𝑚, and 𝑙 combinations to search for the evidence of on the same encoder architecture, but the Conformer-
disfluencies-related properties in the models’ decisions. CTC model has 18 encoding layers, while the Conformer-
The goal is to explore which of the considered pre- Transducer encoder has 17 layers.
trained E2E ASR models, based on different decoding In this study, ∼ 100 classifiers (2 models * ∼17 lay-
systems, better encodes characteristics associated with ers * 3 classifier sizes) were trained to investigate which
disfluent speech segments to perform a fillers and prolon- of the considered pre-trained models, differing only by
gations detection task. To this end, two publicly available the decoding approach, encodes enough information to
[23] Conformer-based models [24] with 120 million pa- perform a disfluency detection task.
rameters each, built with the NVIDIA Nemo toolkit and To evaluate the alignment between the output of the
differing only in the decoding strategy, were selected. On classifier and the reference label sequence we employ the
the one hand, a Conformer-based model with a Connec- Dynamic Time Warping Distance (DTW distance) [27],
tionist Temporal Classification (CTC) [25] decoder has reported in figure 2a. The DTW results highlight that lay-
been considered, as the CTC is one of the most popular ers closer to the decoding module seem to contain most
decoding techniques. Such a decoding technique is a non- of the information needed to perform a correct detec-
auto-regressive speech transcription technique that col- tion of the considered disfluencies, obtaining an average
lapses consecutive, all-equal, transcription labels (char- DTW distance of approximately 1.39 in all the cases, with
acter, word piece, etc.) to one label unless a special label a considerably low standard error. Then, to evaluate the
separates these. The result is a sequence of labels shorter capability of each classifier to provide a correct as well
or equal to the input vector sequence length. Being non- as aligned labels sequence, we employed the weighted F1
auto-regressive, it is also considered computationally ef- measure, reported in figure 2b. Also in this case, F1 results
fective as it requires less time and resources for training confirm that layers closer to the decoding module seem
and inference phases. On the other hand, a Conformer- to be those containing most of the information needed to
based model with the Recurrent Neural Network Trans- correctly identify the disfluency segment. The combina-
ducer (RNN-T), commonly known as Transducer has been tion of F1 and DTW provides an integrated perspective
(a) (b)
(c) (d)
Figure 4: The plots in (a) for CTC and (b) for RNN-T report the F1 measure related to the frequency of FP (yellow) and PRL
(purple). Scatterplots for CTC (c) and RNN-T (d) compare the duration of the PRL segments with the respective F1 measure.
on the system’s ability to classify and align segments FP than for PRL phenomena. Focusing on the PRL in-
correctly. Finally, in Figure 3 (a and b), we report the con- stances, a negative correlation is observed between the F1
fusion matrix of the best classifiers obtained from each weighted scores and PRLs’ duration (CTC non-recognized
considered model. On the one side, the CTC seems to r = - 0.91, figure 4c; RNN Transducer non-recognized r =
be better at discriminating non-disfluent segments (ND), - 0.87, figure 4d).
while showing the worst performance in disfluency iden- The error analysis was supported by an auditory in-
tification. On the other side, the RNN-T-based classifier spection of the unrecognized and misclassified samples
shows considerable performance at identifying FPs and filtered based on the average DTW distance, namely,
is the worst in discriminating ND segments, while PRL 1.39 for the Transducer-based and 1.40 for the CTC-
performance is comparable to the CTC classifier. Both based classifier. Issues in PRL recognition mostly con-
matrices highlight that the most difficult disfluency phe- cerned shorter instances, those characterized by peculiar
nomena to classify are prolongations, which is the focus ‘non-prototypical’phonation features (such as unsteady,
of our preliminary exploratory error analysis. creaky phonation) and the alignment of PRL-predicted
occurrences. Also, several PRL phenomena were misclas-
3.2. Qualitative Analysis sified as FP when occurring with monosyllabic words,
such as ‘o’, ‘un po’, ‘che’, ‘e’. In fact,
The qualitative analysis is based on the best classifier the phonetic realization of these instances is closer to the
for each of the considered models used to generate the ones that characterize FP for their vowel quality and as
distilled features. In particular, for the CTC version, the being, to a certain extent, independent elements from the
best classifier resulted in the one with 640 hidden neurons phonetic environment
trained on 18-th layer features. Among the transducer-
based versions, the one with 640 hidden neurons trained
on 17-th layer features emerged as the best version. 4. Discussion and Conclusions
The visual inspection of the distribution of the consid-
ered phenomena highlights that for both the CTC (4a) In this work, we build upon a previous study that investi-
and the RNN Transducer classifiers (4b), FP phenomena gated to what extent modern ASR E2Es encode features
concentrate on higher F1 weighted values, whereas wider related to disfluency phenomena, even if they are not
distributions are observed for PRL phenomena, which directly trained to do so. We showed that pre-trained
shows that both classifiers work better when dealing with models with the same audio encoder but with two differ-
ent state-of-the-art decoding strategies (CTC and Trans-
ducer) capture disfluency-related features, especially in These findings could be used to improve transcription
the latest encoding layer, and both model features that applications by enriching them with disfluency anno-
can be used for the identification and positioning of dis- tation (including filler particles and prolongation phe-
fluent speech segments [10]. Although there seems to be nomena), which are still rather costly processes for stud-
a tendency to forget this information with subsequent ies concerning hesitation phenomena and (own) speech
layers, as the trends for DTW (figure 2a) and F1-measure management in typical as well as atypical speech (e.g.,
(figure 2b) would suggest, the last layers, which are those pathological or language learners’ speech. Indeed, an
closest to the objective function represented by the de- immediate development of the described work consists
coding module, seem the most prone to retain character- of increasing the capabilities of the pre-trained E2E-ASRs
istics useful to locate and identify disfluency phenomena. by adding a simple disfluency identification module to
Interestingly, despite the differences between the two complement the existing decoder, thus enriching the re-
decoding modules which are respectively non-recurrent sulting transcriptions.
(CTC) and recurrent (RNN-T), the performances for the Our work is built upon unidirectional LSTMs rather
chosen task are comparable. However, the confusion ma- than on bidirectional LSTMs (BiLSTMs), which provide
trices highlight that the CTC-based classifier performs better performance because the latter have slightly longer
better in the disfluency feature discrimination task, while inference times and require a larger amount of data, re-
the Transducer-based classifier more precisely identi- sources, time to be trained and, most importantly, present
fies filled pauses, which could be related to the scope a more complex behaviour [29]. However, the introduc-
(recurrent/non-recurrent) of the objective function. The tion of different architecture modules like bidirectional
results align with the literature that shows a strong sen- LSTM could improve the detection of prolongation disflu-
sitivity to features concerning words and phone of the encies. This will be part of future developments focused
layers closest to the encoder[4], while the layers clos- on performance and increased neural network complex-
est to the input are more sensitive to features related ity.
to accent and local acoustic characteristics [3, 4]. It is
worth noticing that, in a recent work [9], sensitivity to
syllabic boundaries was found in layers 3-5, with a pat- References
tern similar to the one shown in Figure 2 but without
[1] S. Wang, G. Li, Overview of end-to-end speech
the peak in the last layers. The reason can be found in
recognition, in: Journal of Physics: Conference Se-
the fact that syllables and their boundaries do not have a
ries, volume 1187, IOP Publishing, 2019, p. 052068.
graphic distinction in the transcriptions, conversely, in
[2] T. Viglino, P. Motlicek, M. Cernak, End-to-end
the case of disfluencies, there is a form of transcription
accented speech recognition., in: Interspeech, 2019,
that identifies them within a language model.
pp. 2140–2144.
The exploratory analysis of the errors highlighted
[3] A. Prasad, P. Jyothi, How accents confound: Prob-
that prolongations are more difficult to detect than filled
ing for accent information in end-to-end speech
pauses, which could depend on their being an integral
recognition systems, in: Proceedings of the 58th
(though lengthened) part of ‘fluent’words while filled
Annual Meeting of the Association for Computa-
pauses are mostly realized as independent elements. Also,
tional Linguistics, 2020, pp. 3739–3753.
instances of prolongation are mostly non-recognized or
[4] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy-
misclassified as filled pauses when characterized by pecu-
sis of a self-supervised speech representation model,
liar ‘non-prototypical’phonation features, such as creaky
in: 2021 IEEE Automatic Speech Recognition and
phonations, or filler-like features, as in the case of mono-
Understanding Workshop (ASRU), IEEE, 2021, pp.
syllabic word-final prolongations. Also, previous studies
914–921.
on the segmental quality of prolongations in Italian [28]
[5] P. C. English, J. Kelleher, J. Carson-Berndsen,
showed that prolongations, especially when concerning
Domain-informed probing of wav2vec 2.0 embed-
consonantal sounds, can be realised with schwa sounds
dings for phonetic features, in: Proceedings of the
similar to those that characterize most filled pauses. This
19th SIGMORPHON Workshop on Computational
filler-like quality could also be considered among the
Research in Phonetics, Phonology, and Morphology,
underlying reasons for the negative correlation between
2022, pp. 83–91.
the evaluation metrics of prolongations misclassification
[6] K. Martin, J. Gauthier, C. Breiss, R. Levy, Prob-
and their duration. Another possible motivation could
ing self-supervised speech models for phonetic
reside in a bias in the dataset combined with the classifier
and phonemic information: A case study in aspira-
architecture (LSTM), which easily recognises prolonga-
tion, in: INTERSPEECH 2023, 2023, pp. 251–255.
tions responding to a specific length pattern. This means
doi:10.21437/Interspeech.2023-2359.
that the scarcity of longer prolongations hinders their
[7] G. Shen, M. Watkins, A. Alishahi, A. Bisazza,
modelling leading to their misclassification.
G. Chrupała, Encoding of lexical tone in self- 20–23 July 2009, Liverpool, UK, 2009, pp. 20–23.
supervised models of spoken language, in: K. Duh, [17] R. Eklund, Disfluency in Swedish Human–Human
H. Gomez, S. Bethard (Eds.), Proceedings of the and Human–Machine travel booking dialogues,
2024 Conference of the North American Chapter Ph.D. thesis, Linköping University Electronic Press,
of the Association for Computational Linguistics: 2004.
Human Language Technologies (Volume 1: Long [18] S. Betz, Hesitations in Spoken Dialogue Systems,
Papers), Association for Computational Linguis- Ph.D. thesis, Universität Bielefeld, 2020.
tics, Mexico City, Mexico, 2024, pp. 4250–4261. [19] L. Schettino, The Role of Disfluencies in Italian Dis-
URL: https://aclanthology.org/2024.naacl-long.239. course. Modelling and Speech Synthesis Applica-
doi:10.18653/v1/2024.naacl-long.239. tions., Ph.D. thesis, Università degli Studi di Salerno,
[8] A. Krishnan, B. M. Abdullah, D. Klakow, On the en- 2022.
coding of gender in transformer-based asr represen- [20] R. J. Lickley, Fluency and disfluency, in: M. A. Red-
tations, in: Interspeech 2024, 2024, pp. 3090–3094. ford (Ed.), The handbook of speech production, Wi-
doi:10.21437/Interspeech.2024-2209. ley Online Library, 2015, pp. 445–474. doi:https:
[9] V. N. Vitale, F. Cutugno, A. Origlia, G. Coro, Ex- //doi.org/10.1002/9781118584156.ch20.
ploring emergent syllables in end-to-end automatic [21] J. R. Landis, G. G. Koch, The measurement of ob-
speech recognizers through model explainability server agreement for categorical data, Biometrics
technique, Neural Computing and Applications (1977) 159–174.
(2024) 1–27. [22] S. Hochreiter, J. Schmidhuber, Long short-term
[10] V. N. Vitale, L. Schettino, F. Cutugno, Rich memory, Neural computation 9 (1997) 1735–1780.
speech signal: exploring and exploiting end-to- [23] NVIDIA, Nvidia catalog for pre-trained conformer
end automatic speech recognizers’ ability to model models, 2023. URL: https://catalog.ngc.nvidia.
hesitation phenomena, in: Interspeech 2024, com/orgs/nvidia/teams/nemo/models/stt_en_
2024, pp. 222–226. doi:10.21437/Interspeech. conformer_{transducer|ctc}_large.
2024-2029. [24] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang,
[11] M. Gabrea, D. OShaughnessy, Detection of filled J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu,
pauses in spontaneous conversational speech, in: et al., Conformer: Convolution-augmented trans-
6th International Conference on Spoken Language former for speech recognition, arXiv preprint
Processing (ICSLP 2000), ISCA, 2000, pp. vol. 3, 678– arXiv:2005.08100 (2020).
681–0. URL: https://www.isca-archive.org/icslp_ [25] A. Graves, S. Fernández, F. Gomez, J. Schmidhu-
2000/gabrea00_icslp.html. doi:10.21437/ICSLP. ber, Connectionist temporal classification: labelling
2000-626. unsegmented sequence data with recurrent neural
[12] E. Shriberg, Spontaneous speech: how people really networks, in: Proceedings of the 23rd international
talk and why engineers should care., in: INTER- conference on Machine learning, 2006, pp. 369–376.
SPEECH, Citeseer, 2005, pp. 1781–1784. [26] A. Graves, Sequence transduction with recurrent
[13] V. Kany, J. Trouvain, Semiautomatic support of neural networks, arXiv preprint arXiv:1211.3711
speech fluency assessment by detecting filler parti- (2012).
cles and determining speech tempo, in: Workshop [27] M. Müller, Dynamic time warping, Information
on prosodic features of language learners’ fluency, retrieval for music and motion (2007) 69–84.
2024. [28] L. Schettino, R. Eklund, Prolongation in italian, in:
[14] A. Origlia, R. Savy, I. Poggi, F. Cutugno, I. Alfano, Proceedings of Disfluency in Spontaneous Speech
F. D’Errico, L. Vincze, V. Cataldo, An audiovisual Workshop 2023 (DiSS 2023), 28–30 August 2023,
corpus of guided tours in cultural sites: Data collec- Bielefeld, Germany, 2023, pp. 81–85.
tion protocols in the CHROME project, in: Proceed- [29] S. Siami-Namini, N. Tavakoli, A. S. Namin, The
ings of the 2018 AVI-CH Workshop on Advanced Vi- performance of lstm and bilstm in forecasting time
sual Interfaces for Cultural Heritage, volume 2091, series, in: 2019 IEEE International conference on
2018, pp. 1–4. big data (Big Data), IEEE, 2019, pp. 3285–3292.
[15] G. Sarro, The many ways to search for an Italian
frog. The Manner encoding in an Italian corpus
collected with Modokit., Master’s thesis, Università
degli Studi dell’Aquila., 2023.
[16] R. Savy, F. Cutugno, Diatopic, diamesic and diapha-
sic variations in spoken Italian, in: M. Mahlberg,
V. González-Díaz, C. Smith (Eds.), Proceedings of
CL2009, The 5th Corpus Linguistics Conference,