Modelling filled particles and prolongation using end-to-end Automatic Speech Recognition systems: a quantitative and qualitative analysis. Vincenzo Norman Vitale1,† , Loredana Schettino2,† and Francesco Cutugno1 1 University of Naples Federico II, Naples, Italy 2 Free University of Bozen-Bolzano, Bozen, Italy Abstract State-of-the-art automatic speech recognition systems based on End-to-End models (E2E-ASRs) achieve remarkable perfor- mances. However, phenomena that characterize spoken language such as fillers ( ) or segmental prolongations (the) are still mostly considered as disrupting objects that should not be included to obtain optimal transcriptions, despite their acknowledged regularity and communicative value. A recent study showed that two types of pre-trained systems with the same Conformer-based encoding architecture but different decoders – a Connectionist Temporal Classification (CTC) decoder and a Transducer decoder – tend to model some speech features that are functional for the identification of filled pauses and prolongation in speech. This work builds upon these findings by investigating which of the two systems is better at fillers and prolongations detection tasks and by conducting an error analysis to deepen our understanding of how these systems work. Keywords disfluences, speech recognition, probing, interpretability, explainability 1. Introduction In recent works on Automatic Speech Recognition (ASR) systems based on the computing power of Deep Neu- ral Networks (DNN), a great deal of effort is focused on incrementing the systems’ performances by employing increasingly complex, hence hardly interpretable, DNN models that require huge amounts of data for the train- ing, like End-to-End Automatic Speech Recognition (E2E- ASR) models which represent the state-of-the-art. An E2E-ASR model directly converts a sequence of input acoustic feature vectors (or possibly raw audio samples) into a series of graphemes or words that represent the Figure 1: E2E ASRs are based on an encoder-decoder archi- transcription of the audio signal [1], as represented in tecture. The speech signal is fed to the encoder, producing an encoded representation that contains the information needed figure 1. In contrast, traditional ASR systems typically by the decoder to provide the sequence of words/characters/- train the acoustic, pronunciation, and language mod- subwords and build the transcription. els separately, requiring distinct modelling and training for each component. These systems usually aim to ob- that state-of-the-art systems based on End-to-End mod- tain speech transcriptions ‘cleaned’from phenomena that els (E2E-ASRs) can model linguistic and acoustic features characterise spoken language such as discourse mark- of spoken language, which can be investigated to explain ers, particles, pauses, or other phenomena commonly their internal dynamics. Several probing techniques have referred to as ‘disfluencies’. Studies on the interpretabil- been designed to inspect and better understand the in- ity of the dynamics underlying neural models showed ternal behavior of DNN layers at different depths. With these techniques, investigations on the internals of Deep- CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Speech2 [2, 3] revealed the influence of diatopic pronunci- Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. ation variation in various English varieties and provided † These authors contributed equally. evidence that intermediate layers contain information $ vincenzonorman.vitale@unina.it (V. N. Vitale); crucial for their classification. Later, a study [4] on the lschettino@unibz.it (L. Schettino); cutugno@unina.it (F. Cutugno) layerwise capacity to encode information about acoustic  0000-0002-0365-8575 (V. N. Vitale); 0000-0002-3788-3754 features, phone identity, word identity, and word mean- (L. Schettino); 0000-0001-9457-6243 (F. Cutugno) ing based on the context of occurrence highlighted that © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings the last layer right before the decoding module retains in- ers of the Neapolitan variety of Italian, the speakers from formation about word meaning information, rather than [15] come from different Italian regions. local acoustic features and phone identity information More specifically, the considered speech data include: that are captured by the first layers and intermediate audio-visual recordings of guided tours at San Martino layers respectively. Then, other studies have further in- Charterhouse (in Naples) led by three female expert vestigated the capacity of state-of-the-art models to en- guides (CHROME corpus [14]), which consists of infor- code phonetic/phonemic information[5, 6], lexical tone mative semi-monologic, semi-spontaneous speech char- [7] and gender [8]. Finally, [9] investigated the inter- acterized by a high degree of discourse planning and an nal dynamics of three pre-trained E2E-ASRs evidencing asymmetrical relationship between the speakers; audio- the emergence of syllable-related features by training visual recordings of 10 speakers narrating ‘Frog Sto- an acoustic-syllable boundary detector. Following this ries’from a picture book [15], which elicited unplanned line of research, a recent study [10] investigated the abil- descriptive speech; four task-oriented dialogues from the ity of two types of pre-trained systems with the same CLIPS corpus [16], which provides mainly descriptive Conformer-based encoding architecture but different de- semi-spontaneous speech characterized by a low degree coders – a Connectionist Temporal Classification (CTC) of discourse planning and a high degree of collaboration decoder and a Transducer decoder – to model features between the interlocutors. that distinguish filled pauses and prolongations in speech and showed that, despite not being originally trained to 2.2. Annotation detect disfluencies, these systems tend to model some speech features that are functional for their identifica- Filled Pauses (FPs), defined as non-verbal fillers realized tion. Rather than disregarding the ability of E2E-ASRs as vocalization and/or nasalization, and Prolongations to model the acoustic information tied to such speech (PRLs), defined as marked lengthening of segmental mate- phenomena as a dispensable noise source, it could be rial [17, 18] were manually annotated along with pauses, exploited to achieve different ends. On the one hand, it lexical fillers, repetitions, deletions, insertions, and sub- could be used to obtain more accurate transcriptions that stitutions following the annotation scheme described provide better, or rather more faithful, representations in [19]. This is a multilevel annotation system devel- of the speech signal, which would also support linguis- oped to account for both formal and functional features tic annotation processes. On the other hand, exploring of phenomena used to manage the own speech produc- the systems’ modelling ability leads to deepening our tion. The identification of different types of phenomena understanding of their underlying dynamics. In the last was based on a ‘pragmatic approach’[20], which means 20 years, disfluency detection tasks have been conducted that it did not rely on absolute measures but on percep- to improve speech recognition performances [11, 12] and tual judgments given the specific contexts of occurrence. different recent approaches to filler detection achieve The reliability of the annotation and the Inter-Annotator rather high performances, see [13]. However, these in- Agreement was evaluated by measuring Cohen’s 𝜅. It vestigations mostly concern filler particles and, to our yielded 0.92 for dialogic data and 0.82 for monologic data, knowledge, no such system has been tested on Italian which stands for ‘high agreement’[21]. data so far. The proposed work aims to build upon these findings by investigating which of the two decoding sys- 2.3. Data Preparation tems is better at performing a detection task for fillers and prolongations. Moreover, a quantitative and qualitative The considered dataset has been prepared based on a set error analysis is conducted to deepen our understanding of praat TextGrid annotation files indicating the speaker of the way these systems work. and the type of disfluency according to the speech signal. More specifically, considering only the PRLs and the FPs, the resulting dataset has a dimension of 1900 segments. 2. Materials and Method For each segment, the contextual information preceding and following the disfluency phenomenon has been con- 2.1. Data sidered, giving each segment a length of 4 seconds. Then, based on the combination of the so-composed dataset In this study, we employed approximately 210 minutes with each of the considered pre-trained models’ encoders of expert annotated speech respectively divided into ∼ (details reported in Section 3.1), for each combination of 80 minutes of informative speech [14], 90 minutes of de- segment and on each intermediate encoding layer the scriptive speech [15] and approximately 40 minutes of following elements were extracted: dialogic speech [16], that is dyads where two speakers recorded on different channels interact. While the data • A sequence of intermediate layer emissions/embed- from [14] and [16] consists of speech produced by speak- ding representing the input segment in the layer’s (a) Average Dynamic time warping distance measured between sequences of labels with standard error (shade). (b) Average Weighted F1 measure measured between sequences of labels with standard error (shade). Figure 2: Dynamic Time Warping distance (figure a) and Weighted F1 (figure b) for all the trained classifers. The x-axis indicates the index (starting from index 0) of the intermediate layer from which the distilled features have been extracted to train the corresponding classifier. vectorial space. Each emission in the sequence 3. Results represents a portion of 40 milliseconds of the in- put signal due to the considered model’s charac- 3.1. Disfluency Identification Through teristics. Model Probing • A sequence of labels associated with each se- quence of emissions, indicating whether an inter- Building upon recent studies that make use of probes to mediate emission belongs to a particular class of better understand the internal behavior of pre-trained disfluencies (1 for FP and 2 for PRL) or not (labelE2E-ASR models’[9, 4, 3], we apply a similar approach to 0 if the segment does not belong to a disfluency). investigate if and to which extent a pre-trained model (𝑚) can codify disfluencies-related features in the encoding The resulting dataset consists of pairs of sequences module, even if they are not trained to do so. The em- of emissions (i.e., distilled features) and correspond- ployed approach is aimed at building specific classifiers ing labels identified by the model and the layer from whose inputs are represented by intermediate emissions which they were extracted. Note that each sequence of the considered model’s encoder layers (𝑙), combined of intermediate layer emissions has a length ℎ = with the appropriate sequence of labels based on dataset 4𝑠𝑒𝑐𝑜𝑛𝑑𝑠/40𝑚𝑖𝑙𝑙𝑖𝑠𝑒𝑐𝑜𝑛𝑑𝑠, as it represents the tempo- annotation. Internally, each classifier consists of a Long ral succession of segments before, during, and after dis- Short Term Memory (LSTM) module followed by a Feed fluency phenomena. We use the term emission [10, 9] to Forward Neural Network (FFNN). Given that our prob- indicate intermediate layer neurons fire, instead of the lem can be related to sequence classification, the LSTMs more commonly used term embedding [8], as the latter seem to be the most naturally suited model [22]; usually, is widely used to indicate the output of an entire module an LSTM consists of one computational unit that itera- rather than a layer. tively processes all input time series vectors. This unit (a) CTC-based classifier with hidden size 640 trained on (b) RNN-T-based classifier with hidden size 640 trained on distilled features from layer 18 (index 17 in F1,DTW distilled features from layer 16 (index 15 in F1,DTW plots). plots). Figure 3: Confusion matrix for the best classifiers obtained for each of the considered decoding approaches. comprises three gates processing one vector at a time and considered. The RNN-T is an auto-regressive speech tran- combining it with information extracted from previous scription technique that overcomes CTC’s limitations, vectors. One of the most crucial parameters for an LSTM being non-auto-regressive and subject to limited label se- is the hidden layer, therefore we investigate the impact of quence length. The Transducer decoding technique can three different layer sizes (hidden-layer size, 𝑛), namely produce label-transcription sequences longer than the 160, 320 and 640. So, an LSTM-based classifier processes input vector sequence and models inter-dependency in a sequence of {𝑒𝑙,𝑚 } emission vectors (each of length long-term transcription elements. A Transducer typically ℎ) and produces a new sequence of vectors with size 𝑛. comprises two sub-modules: one that forecasts the next The two sequences are aligned over time. At each time transcription label based on the previous transcriptions step 𝑡, the FFNN produces a label indicating whether the (prediction network) and the other that combines the considered input represents a specific disfluency segment encoder and prediction-network outputs to produce a (label 1 for filled pause or 2 for prolongation) or not (with new transcription label (joiner network). These features label 0) based on the LSTM hidden-layer output. In sum- improve transcription speed and performance compared mary, we train and evaluate many different LSTM-based to CTC while requiring more training and computational disfluencies classifiers/detectors (𝐿𝑛,𝑚,𝑙 ) for all possible resources [26]. Note that both pre-trained models rely 𝑛, 𝑚, and 𝑙 combinations to search for the evidence of on the same encoder architecture, but the Conformer- disfluencies-related properties in the models’ decisions. CTC model has 18 encoding layers, while the Conformer- The goal is to explore which of the considered pre- Transducer encoder has 17 layers. trained E2E ASR models, based on different decoding In this study, ∼ 100 classifiers (2 models * ∼17 lay- systems, better encodes characteristics associated with ers * 3 classifier sizes) were trained to investigate which disfluent speech segments to perform a fillers and prolon- of the considered pre-trained models, differing only by gations detection task. To this end, two publicly available the decoding approach, encodes enough information to [23] Conformer-based models [24] with 120 million pa- perform a disfluency detection task. rameters each, built with the NVIDIA Nemo toolkit and To evaluate the alignment between the output of the differing only in the decoding strategy, were selected. On classifier and the reference label sequence we employ the the one hand, a Conformer-based model with a Connec- Dynamic Time Warping Distance (DTW distance) [27], tionist Temporal Classification (CTC) [25] decoder has reported in figure 2a. The DTW results highlight that lay- been considered, as the CTC is one of the most popular ers closer to the decoding module seem to contain most decoding techniques. Such a decoding technique is a non- of the information needed to perform a correct detec- auto-regressive speech transcription technique that col- tion of the considered disfluencies, obtaining an average lapses consecutive, all-equal, transcription labels (char- DTW distance of approximately 1.39 in all the cases, with acter, word piece, etc.) to one label unless a special label a considerably low standard error. Then, to evaluate the separates these. The result is a sequence of labels shorter capability of each classifier to provide a correct as well or equal to the input vector sequence length. Being non- as aligned labels sequence, we employed the weighted F1 auto-regressive, it is also considered computationally ef- measure, reported in figure 2b. Also in this case, F1 results fective as it requires less time and resources for training confirm that layers closer to the decoding module seem and inference phases. On the other hand, a Conformer- to be those containing most of the information needed to based model with the Recurrent Neural Network Trans- correctly identify the disfluency segment. The combina- ducer (RNN-T), commonly known as Transducer has been tion of F1 and DTW provides an integrated perspective (a) (b) (c) (d) Figure 4: The plots in (a) for CTC and (b) for RNN-T report the F1 measure related to the frequency of FP (yellow) and PRL (purple). Scatterplots for CTC (c) and RNN-T (d) compare the duration of the PRL segments with the respective F1 measure. on the system’s ability to classify and align segments FP than for PRL phenomena. Focusing on the PRL in- correctly. Finally, in Figure 3 (a and b), we report the con- stances, a negative correlation is observed between the F1 fusion matrix of the best classifiers obtained from each weighted scores and PRLs’ duration (CTC non-recognized considered model. On the one side, the CTC seems to r = - 0.91, figure 4c; RNN Transducer non-recognized r = be better at discriminating non-disfluent segments (ND), - 0.87, figure 4d). while showing the worst performance in disfluency iden- The error analysis was supported by an auditory in- tification. On the other side, the RNN-T-based classifier spection of the unrecognized and misclassified samples shows considerable performance at identifying FPs and filtered based on the average DTW distance, namely, is the worst in discriminating ND segments, while PRL 1.39 for the Transducer-based and 1.40 for the CTC- performance is comparable to the CTC classifier. Both based classifier. Issues in PRL recognition mostly con- matrices highlight that the most difficult disfluency phe- cerned shorter instances, those characterized by peculiar nomena to classify are prolongations, which is the focus ‘non-prototypical’phonation features (such as unsteady, of our preliminary exploratory error analysis. creaky phonation) and the alignment of PRL-predicted occurrences. Also, several PRL phenomena were misclas- 3.2. Qualitative Analysis sified as FP when occurring with monosyllabic words, such as ‘o’, ‘un po’, ‘che’, ‘e’. In fact, The qualitative analysis is based on the best classifier the phonetic realization of these instances is closer to the for each of the considered models used to generate the ones that characterize FP for their vowel quality and as distilled features. In particular, for the CTC version, the being, to a certain extent, independent elements from the best classifier resulted in the one with 640 hidden neurons phonetic environment trained on 18-th layer features. Among the transducer- based versions, the one with 640 hidden neurons trained on 17-th layer features emerged as the best version. 4. Discussion and Conclusions The visual inspection of the distribution of the consid- ered phenomena highlights that for both the CTC (4a) In this work, we build upon a previous study that investi- and the RNN Transducer classifiers (4b), FP phenomena gated to what extent modern ASR E2Es encode features concentrate on higher F1 weighted values, whereas wider related to disfluency phenomena, even if they are not distributions are observed for PRL phenomena, which directly trained to do so. We showed that pre-trained shows that both classifiers work better when dealing with models with the same audio encoder but with two differ- ent state-of-the-art decoding strategies (CTC and Trans- ducer) capture disfluency-related features, especially in These findings could be used to improve transcription the latest encoding layer, and both model features that applications by enriching them with disfluency anno- can be used for the identification and positioning of dis- tation (including filler particles and prolongation phe- fluent speech segments [10]. Although there seems to be nomena), which are still rather costly processes for stud- a tendency to forget this information with subsequent ies concerning hesitation phenomena and (own) speech layers, as the trends for DTW (figure 2a) and F1-measure management in typical as well as atypical speech (e.g., (figure 2b) would suggest, the last layers, which are those pathological or language learners’ speech. Indeed, an closest to the objective function represented by the de- immediate development of the described work consists coding module, seem the most prone to retain character- of increasing the capabilities of the pre-trained E2E-ASRs istics useful to locate and identify disfluency phenomena. by adding a simple disfluency identification module to Interestingly, despite the differences between the two complement the existing decoder, thus enriching the re- decoding modules which are respectively non-recurrent sulting transcriptions. (CTC) and recurrent (RNN-T), the performances for the Our work is built upon unidirectional LSTMs rather chosen task are comparable. However, the confusion ma- than on bidirectional LSTMs (BiLSTMs), which provide trices highlight that the CTC-based classifier performs better performance because the latter have slightly longer better in the disfluency feature discrimination task, while inference times and require a larger amount of data, re- the Transducer-based classifier more precisely identi- sources, time to be trained and, most importantly, present fies filled pauses, which could be related to the scope a more complex behaviour [29]. However, the introduc- (recurrent/non-recurrent) of the objective function. The tion of different architecture modules like bidirectional results align with the literature that shows a strong sen- LSTM could improve the detection of prolongation disflu- sitivity to features concerning words and phone of the encies. This will be part of future developments focused layers closest to the encoder[4], while the layers clos- on performance and increased neural network complex- est to the input are more sensitive to features related ity. to accent and local acoustic characteristics [3, 4]. It is worth noticing that, in a recent work [9], sensitivity to syllabic boundaries was found in layers 3-5, with a pat- References tern similar to the one shown in Figure 2 but without [1] S. Wang, G. Li, Overview of end-to-end speech the peak in the last layers. The reason can be found in recognition, in: Journal of Physics: Conference Se- the fact that syllables and their boundaries do not have a ries, volume 1187, IOP Publishing, 2019, p. 052068. graphic distinction in the transcriptions, conversely, in [2] T. Viglino, P. Motlicek, M. Cernak, End-to-end the case of disfluencies, there is a form of transcription accented speech recognition., in: Interspeech, 2019, that identifies them within a language model. pp. 2140–2144. The exploratory analysis of the errors highlighted [3] A. Prasad, P. Jyothi, How accents confound: Prob- that prolongations are more difficult to detect than filled ing for accent information in end-to-end speech pauses, which could depend on their being an integral recognition systems, in: Proceedings of the 58th (though lengthened) part of ‘fluent’words while filled Annual Meeting of the Association for Computa- pauses are mostly realized as independent elements. Also, tional Linguistics, 2020, pp. 3739–3753. instances of prolongation are mostly non-recognized or [4] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy- misclassified as filled pauses when characterized by pecu- sis of a self-supervised speech representation model, liar ‘non-prototypical’phonation features, such as creaky in: 2021 IEEE Automatic Speech Recognition and phonations, or filler-like features, as in the case of mono- Understanding Workshop (ASRU), IEEE, 2021, pp. syllabic word-final prolongations. Also, previous studies 914–921. on the segmental quality of prolongations in Italian [28] [5] P. C. English, J. Kelleher, J. Carson-Berndsen, showed that prolongations, especially when concerning Domain-informed probing of wav2vec 2.0 embed- consonantal sounds, can be realised with schwa sounds dings for phonetic features, in: Proceedings of the similar to those that characterize most filled pauses. This 19th SIGMORPHON Workshop on Computational filler-like quality could also be considered among the Research in Phonetics, Phonology, and Morphology, underlying reasons for the negative correlation between 2022, pp. 83–91. the evaluation metrics of prolongations misclassification [6] K. Martin, J. Gauthier, C. Breiss, R. Levy, Prob- and their duration. Another possible motivation could ing self-supervised speech models for phonetic reside in a bias in the dataset combined with the classifier and phonemic information: A case study in aspira- architecture (LSTM), which easily recognises prolonga- tion, in: INTERSPEECH 2023, 2023, pp. 251–255. tions responding to a specific length pattern. This means doi:10.21437/Interspeech.2023-2359. that the scarcity of longer prolongations hinders their [7] G. Shen, M. Watkins, A. Alishahi, A. Bisazza, modelling leading to their misclassification. G. Chrupała, Encoding of lexical tone in self- 20–23 July 2009, Liverpool, UK, 2009, pp. 20–23. supervised models of spoken language, in: K. Duh, [17] R. Eklund, Disfluency in Swedish Human–Human H. Gomez, S. Bethard (Eds.), Proceedings of the and Human–Machine travel booking dialogues, 2024 Conference of the North American Chapter Ph.D. thesis, Linköping University Electronic Press, of the Association for Computational Linguistics: 2004. Human Language Technologies (Volume 1: Long [18] S. Betz, Hesitations in Spoken Dialogue Systems, Papers), Association for Computational Linguis- Ph.D. thesis, Universität Bielefeld, 2020. tics, Mexico City, Mexico, 2024, pp. 4250–4261. [19] L. Schettino, The Role of Disfluencies in Italian Dis- URL: https://aclanthology.org/2024.naacl-long.239. course. Modelling and Speech Synthesis Applica- doi:10.18653/v1/2024.naacl-long.239. tions., Ph.D. thesis, Università degli Studi di Salerno, [8] A. Krishnan, B. M. Abdullah, D. Klakow, On the en- 2022. coding of gender in transformer-based asr represen- [20] R. J. Lickley, Fluency and disfluency, in: M. A. Red- tations, in: Interspeech 2024, 2024, pp. 3090–3094. ford (Ed.), The handbook of speech production, Wi- doi:10.21437/Interspeech.2024-2209. ley Online Library, 2015, pp. 445–474. doi:https: [9] V. N. Vitale, F. Cutugno, A. Origlia, G. Coro, Ex- //doi.org/10.1002/9781118584156.ch20. ploring emergent syllables in end-to-end automatic [21] J. R. Landis, G. G. Koch, The measurement of ob- speech recognizers through model explainability server agreement for categorical data, Biometrics technique, Neural Computing and Applications (1977) 159–174. (2024) 1–27. [22] S. Hochreiter, J. Schmidhuber, Long short-term [10] V. N. Vitale, L. Schettino, F. Cutugno, Rich memory, Neural computation 9 (1997) 1735–1780. speech signal: exploring and exploiting end-to- [23] NVIDIA, Nvidia catalog for pre-trained conformer end automatic speech recognizers’ ability to model models, 2023. URL: https://catalog.ngc.nvidia. hesitation phenomena, in: Interspeech 2024, com/orgs/nvidia/teams/nemo/models/stt_en_ 2024, pp. 222–226. doi:10.21437/Interspeech. conformer_{transducer|ctc}_large. 2024-2029. [24] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, [11] M. Gabrea, D. OShaughnessy, Detection of filled J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, pauses in spontaneous conversational speech, in: et al., Conformer: Convolution-augmented trans- 6th International Conference on Spoken Language former for speech recognition, arXiv preprint Processing (ICSLP 2000), ISCA, 2000, pp. vol. 3, 678– arXiv:2005.08100 (2020). 681–0. URL: https://www.isca-archive.org/icslp_ [25] A. Graves, S. Fernández, F. Gomez, J. Schmidhu- 2000/gabrea00_icslp.html. doi:10.21437/ICSLP. ber, Connectionist temporal classification: labelling 2000-626. unsegmented sequence data with recurrent neural [12] E. Shriberg, Spontaneous speech: how people really networks, in: Proceedings of the 23rd international talk and why engineers should care., in: INTER- conference on Machine learning, 2006, pp. 369–376. SPEECH, Citeseer, 2005, pp. 1781–1784. [26] A. Graves, Sequence transduction with recurrent [13] V. Kany, J. Trouvain, Semiautomatic support of neural networks, arXiv preprint arXiv:1211.3711 speech fluency assessment by detecting filler parti- (2012). cles and determining speech tempo, in: Workshop [27] M. Müller, Dynamic time warping, Information on prosodic features of language learners’ fluency, retrieval for music and motion (2007) 69–84. 2024. [28] L. Schettino, R. Eklund, Prolongation in italian, in: [14] A. Origlia, R. Savy, I. Poggi, F. Cutugno, I. Alfano, Proceedings of Disfluency in Spontaneous Speech F. D’Errico, L. Vincze, V. Cataldo, An audiovisual Workshop 2023 (DiSS 2023), 28–30 August 2023, corpus of guided tours in cultural sites: Data collec- Bielefeld, Germany, 2023, pp. 81–85. tion protocols in the CHROME project, in: Proceed- [29] S. Siami-Namini, N. Tavakoli, A. S. Namin, The ings of the 2018 AVI-CH Workshop on Advanced Vi- performance of lstm and bilstm in forecasting time sual Interfaces for Cultural Heritage, volume 2091, series, in: 2019 IEEE International conference on 2018, pp. 1–4. big data (Big Data), IEEE, 2019, pp. 3285–3292. [15] G. Sarro, The many ways to search for an Italian frog. The Manner encoding in an Italian corpus collected with Modokit., Master’s thesis, Università degli Studi dell’Aquila., 2023. [16] R. Savy, F. Cutugno, Diatopic, diamesic and diapha- sic variations in spoken Italian, in: M. Mahlberg, V. González-Díaz, C. Smith (Eds.), Proceedings of CL2009, The 5th Corpus Linguistics Conference,