<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Modelling filled particles and prolongation using end-to-end Automatic Speech Recognition systems: a quantitative and qualitative analysis</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Vincenzo</forename><forename type="middle">Norman</forename><surname>Vitale</surname></persName>
							<email>vitale@unina.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Naples Federico II</orgName>
								<address>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Loredana</forename><surname>Schettino</surname></persName>
							<email>lschettino@unibz.it</email>
							<affiliation key="aff1">
								<orgName type="institution">Free University of Bozen-Bolzano</orgName>
								<address>
									<settlement>Bozen</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Francesco</forename><surname>Cutugno</surname></persName>
							<email>cutugno@unina.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Naples Federico II</orgName>
								<address>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Modelling filled particles and prolongation using end-to-end Automatic Speech Recognition systems: a quantitative and qualitative analysis</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">2304C853A80AE9D356696A4B4FA2D2A3</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:37+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>disfluences, speech recognition, probing, interpretability, explainability (F. Cutugno) 0000-0002-0365-8575 (V. N. Vitale)</term>
					<term>0000-0002-3788-3754 (L. Schettino)</term>
					<term>0000-0001-9457-6243 (F. Cutugno)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>State-of-the-art automatic speech recognition systems based on End-to-End models (E2E-ASRs) achieve remarkable performances. However, phenomena that characterize spoken language such as fillers (&lt;eeh&gt; &lt;ehm&gt;) or segmental prolongations (the&lt;ee&gt;) are still mostly considered as disrupting objects that should not be included to obtain optimal transcriptions, despite their acknowledged regularity and communicative value. A recent study showed that two types of pre-trained systems with the same Conformer-based encoding architecture but different decoders -a Connectionist Temporal Classification (CTC) decoder and a Transducer decoder -tend to model some speech features that are functional for the identification of filled pauses and prolongation in speech. This work builds upon these findings by investigating which of the two systems is better at fillers and prolongations detection tasks and by conducting an error analysis to deepen our understanding of how these systems work.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent works on Automatic Speech Recognition (ASR) systems based on the computing power of Deep Neural Networks (DNN), a great deal of effort is focused on incrementing the systems' performances by employing increasingly complex, hence hardly interpretable, DNN models that require huge amounts of data for the training, like End-to-End Automatic Speech Recognition (E2E-ASR) models which represent the state-of-the-art. An E2E-ASR model directly converts a sequence of input acoustic feature vectors (or possibly raw audio samples) into a series of graphemes or words that represent the transcription of the audio signal <ref type="bibr" target="#b0">[1]</ref>, as represented in figure <ref type="figure" target="#fig_0">1</ref>. In contrast, traditional ASR systems typically train the acoustic, pronunciation, and language models separately, requiring distinct modelling and training for each component. These systems usually aim to obtain speech transcriptions 'cleaned'from phenomena that characterise spoken language such as discourse markers, particles, pauses, or other phenomena commonly referred to as 'disfluencies'. Studies on the interpretability of the dynamics underlying neural models showed that state-of-the-art systems based on End-to-End models (E2E-ASRs) can model linguistic and acoustic features of spoken language, which can be investigated to explain their internal dynamics. Several probing techniques have been designed to inspect and better understand the internal behavior of DNN layers at different depths. With these techniques, investigations on the internals of Deep-Speech2 <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref> revealed the influence of diatopic pronunciation variation in various English varieties and provided evidence that intermediate layers contain information crucial for their classification. Later, a study <ref type="bibr" target="#b3">[4]</ref> on the layerwise capacity to encode information about acoustic features, phone identity, word identity, and word meaning based on the context of occurrence highlighted that the last layer right before the decoding module retains information about word meaning information, rather than local acoustic features and phone identity information that are captured by the first layers and intermediate layers respectively. Then, other studies have further investigated the capacity of state-of-the-art models to encode phonetic/phonemic information <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>, lexical tone <ref type="bibr" target="#b6">[7]</ref> and gender <ref type="bibr" target="#b7">[8]</ref>. Finally, <ref type="bibr" target="#b8">[9]</ref> investigated the internal dynamics of three pre-trained E2E-ASRs evidencing the emergence of syllable-related features by training an acoustic-syllable boundary detector. Following this line of research, a recent study <ref type="bibr" target="#b9">[10]</ref> investigated the ability of two types of pre-trained systems with the same Conformer-based encoding architecture but different decoders -a Connectionist Temporal Classification (CTC) decoder and a Transducer decoder -to model features that distinguish filled pauses and prolongations in speech and showed that, despite not being originally trained to detect disfluencies, these systems tend to model some speech features that are functional for their identification. Rather than disregarding the ability of E2E-ASRs to model the acoustic information tied to such speech phenomena as a dispensable noise source, it could be exploited to achieve different ends. On the one hand, it could be used to obtain more accurate transcriptions that provide better, or rather more faithful, representations of the speech signal, which would also support linguistic annotation processes. On the other hand, exploring the systems' modelling ability leads to deepening our understanding of their underlying dynamics. In the last 20 years, disfluency detection tasks have been conducted to improve speech recognition performances <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref> and different recent approaches to filler detection achieve rather high performances, see <ref type="bibr" target="#b12">[13]</ref>. However, these investigations mostly concern filler particles and, to our knowledge, no such system has been tested on Italian data so far. The proposed work aims to build upon these findings by investigating which of the two decoding systems is better at performing a detection task for fillers and prolongations. Moreover, a quantitative and qualitative error analysis is conducted to deepen our understanding of the way these systems work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Materials and Method</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Data</head><p>In this study, we employed approximately 210 minutes of expert annotated speech respectively divided into ∼ 80 minutes of informative speech <ref type="bibr" target="#b13">[14]</ref>, 90 minutes of descriptive speech <ref type="bibr" target="#b14">[15]</ref> and approximately 40 minutes of dialogic speech <ref type="bibr" target="#b15">[16]</ref>, that is dyads where two speakers recorded on different channels interact. While the data from <ref type="bibr" target="#b13">[14]</ref> and <ref type="bibr" target="#b15">[16]</ref> consists of speech produced by speak-ers of the Neapolitan variety of Italian, the speakers from <ref type="bibr" target="#b14">[15]</ref> come from different Italian regions.</p><p>More specifically, the considered speech data include: audio-visual recordings of guided tours at San Martino Charterhouse (in Naples) led by three female expert guides (CHROME corpus <ref type="bibr" target="#b13">[14]</ref>), which consists of informative semi-monologic, semi-spontaneous speech characterized by a high degree of discourse planning and an asymmetrical relationship between the speakers; audiovisual recordings of 10 speakers narrating 'Frog Stories'from a picture book <ref type="bibr" target="#b14">[15]</ref>, which elicited unplanned descriptive speech; four task-oriented dialogues from the CLIPS corpus <ref type="bibr" target="#b15">[16]</ref>, which provides mainly descriptive semi-spontaneous speech characterized by a low degree of discourse planning and a high degree of collaboration between the interlocutors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Annotation</head><p>Filled Pauses (FPs), defined as non-verbal fillers realized as vocalization and/or nasalization, and Prolongations (PRLs), defined as marked lengthening of segmental material <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref> were manually annotated along with pauses, lexical fillers, repetitions, deletions, insertions, and substitutions following the annotation scheme described in <ref type="bibr" target="#b18">[19]</ref>. This is a multilevel annotation system developed to account for both formal and functional features of phenomena used to manage the own speech production. The identification of different types of phenomena was based on a 'pragmatic approach' <ref type="bibr" target="#b19">[20]</ref>, which means that it did not rely on absolute measures but on perceptual judgments given the specific contexts of occurrence. The reliability of the annotation and the Inter-Annotator Agreement was evaluated by measuring Cohen's 𝜅. It yielded 0.92 for dialogic data and 0.82 for monologic data, which stands for 'high agreement' <ref type="bibr" target="#b20">[21]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Data Preparation</head><p>The considered dataset has been prepared based on a set of praat TextGrid annotation files indicating the speaker and the type of disfluency according to the speech signal. More specifically, considering only the PRLs and the FPs, the resulting dataset has a dimension of 1900 segments. For each segment, the contextual information preceding and following the disfluency phenomenon has been considered, giving each segment a length of 4 seconds. Then, based on the combination of the so-composed dataset with each of the considered pre-trained models' encoders (details reported in Section 3.1), for each combination of segment and on each intermediate encoding layer the following elements were extracted:  The resulting dataset consists of pairs of sequences of emissions (i.e., distilled features) and corresponding labels identified by the model and the layer from which they were extracted. Note that each sequence of intermediate layer emissions has a length ℎ = 4𝑠𝑒𝑐𝑜𝑛𝑑𝑠/40𝑚𝑖𝑙𝑙𝑖𝑠𝑒𝑐𝑜𝑛𝑑𝑠, as it represents the temporal succession of segments before, during, and after disfluency phenomena. We use the term emission <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b8">9]</ref> to indicate intermediate layer neurons fire, instead of the more commonly used term embedding <ref type="bibr" target="#b7">[8]</ref>, as the latter is widely used to indicate the output of an entire module rather than a layer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Disfluency Identification Through Model Probing</head><p>Building upon recent studies that make use of probes to better understand the internal behavior of pre-trained E2E-ASR models' <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b2">3]</ref>, we apply a similar approach to investigate if and to which extent a pre-trained model (𝑚) can codify disfluencies-related features in the encoding module, even if they are not trained to do so. The employed approach is aimed at building specific classifiers whose inputs are represented by intermediate emissions of the considered model's encoder layers (𝑙), combined with the appropriate sequence of labels based on dataset annotation. Internally, each classifier consists of a Long Short Term Memory (LSTM) module followed by a Feed Forward Neural Network (FFNN). Given that our problem can be related to sequence classification, the LSTMs seem to be the most naturally suited model <ref type="bibr" target="#b21">[22]</ref>; usually, an LSTM consists of one computational unit that iteratively processes all input time series vectors. This unit  The goal is to explore which of the considered pretrained E2E ASR models, based on different decoding systems, better encodes characteristics associated with disfluent speech segments to perform a fillers and prolongations detection task. To this end, two publicly available <ref type="bibr" target="#b22">[23]</ref> Conformer-based models <ref type="bibr" target="#b23">[24]</ref> with 120 million parameters each, built with the NVIDIA Nemo toolkit and differing only in the decoding strategy, were selected. On the one hand, a Conformer-based model with a Connectionist Temporal Classification (CTC) <ref type="bibr" target="#b24">[25]</ref> decoder has been considered, as the CTC is one of the most popular decoding techniques. Such a decoding technique is a nonauto-regressive speech transcription technique that collapses consecutive, all-equal, transcription labels (character, word piece, etc.) to one label unless a special label separates these. The result is a sequence of labels shorter or equal to the input vector sequence length. Being nonauto-regressive, it is also considered computationally effective as it requires less time and resources for training and inference phases. On the other hand, a Conformerbased model with the Recurrent Neural Network Transducer (RNN-T), commonly known as Transducer has been considered. The RNN-T is an auto-regressive speech transcription technique that overcomes CTC's limitations, being non-auto-regressive and subject to limited label sequence length. The Transducer decoding technique can produce label-transcription sequences longer than the input vector sequence and models inter-dependency in long-term transcription elements. A Transducer typically comprises two sub-modules: one that forecasts the next transcription label based on the previous transcriptions (prediction network) and the other that combines the encoder and prediction-network outputs to produce a new transcription label (joiner network). These features improve transcription speed and performance compared to CTC while requiring more training and computational resources <ref type="bibr" target="#b25">[26]</ref>. Note that both pre-trained models rely on the same encoder architecture, but the Conformer-CTC model has 18 encoding layers, while the Conformer-Transducer encoder has 17 layers.</p><p>In this study, ∼ 100 classifiers (2 models * ∼17 layers * 3 classifier sizes) were trained to investigate which of the considered pre-trained models, differing only by the decoding approach, encodes enough information to perform a disfluency detection task.</p><p>To evaluate the alignment between the output of the classifier and the reference label sequence we employ the Dynamic Time Warping Distance (DTW distance) <ref type="bibr" target="#b26">[27]</ref>, reported in figure <ref type="figure" target="#fig_2">2a</ref>. The DTW results highlight that layers closer to the decoding module seem to contain most of the information needed to perform a correct detection of the considered disfluencies, obtaining an average DTW distance of approximately 1.39 in all the cases, with a considerably low standard error. Then, to evaluate the capability of each classifier to provide a correct as well as aligned labels sequence, we employed the weighted F1 measure, reported in figure <ref type="figure" target="#fig_2">2b</ref>. Also in this case, F1 results confirm that layers closer to the decoding module seem to be those containing most of the information needed to correctly identify the disfluency segment. The combination of F1 and DTW provides an integrated perspective on the system's ability to classify and align segments correctly. Finally, in Figure <ref type="figure" target="#fig_4">3</ref> (a and b), we report the confusion matrix of the best classifiers obtained from each considered model. On the one side, the CTC seems to be better at discriminating non-disfluent segments (ND), while showing the worst performance in disfluency identification. On the other side, the RNN-T-based classifier shows considerable performance at identifying FPs and is the worst in discriminating ND segments, while PRL performance is comparable to the CTC classifier. Both matrices highlight that the most difficult disfluency phenomena to classify are prolongations, which is the focus of our preliminary exploratory error analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Qualitative Analysis</head><p>The qualitative analysis is based on the best classifier for each of the considered models used to generate the distilled features. In particular, for the CTC version, the best classifier resulted in the one with 640 hidden neurons trained on 18-th layer features. Among the transducerbased versions, the one with 640 hidden neurons trained on 17-th layer features emerged as the best version. The visual inspection of the distribution of the considered phenomena highlights that for both the CTC (4a) and the RNN Transducer classifiers (4b), FP phenomena concentrate on higher F1 weighted values, whereas wider distributions are observed for PRL phenomena, which shows that both classifiers work better when dealing with FP than for PRL phenomena. Focusing on the PRL instances, a negative correlation is observed between the F1 weighted scores and PRLs' duration (CTC non-recognized r = -0.91, figure <ref type="figure" target="#fig_5">4c</ref>; RNN Transducer non-recognized r = -0.87, figure <ref type="figure" target="#fig_5">4d</ref>).</p><p>The error analysis was supported by an auditory inspection of the unrecognized and misclassified samples filtered based on the average DTW distance, namely, 1.39 for the Transducer-based and 1.40 for the CTCbased classifier. Issues in PRL recognition mostly concerned shorter instances, those characterized by peculiar 'non-prototypical'phonation features (such as unsteady, creaky phonation) and the alignment of PRL-predicted occurrences. Also, several PRL phenomena were misclassified as FP when occurring with monosyllabic words, such as 'o&lt;oo&gt;', 'un po&lt;oo&gt;', 'che&lt;ee&gt;', 'e&lt;ee&gt;'. In fact, the phonetic realization of these instances is closer to the ones that characterize FP for their vowel quality and as being, to a certain extent, independent elements from the phonetic environment</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion and Conclusions</head><p>In this work, we build upon a previous study that investigated to what extent modern ASR E2Es encode features related to disfluency phenomena, even if they are not directly trained to do so. We showed that pre-trained models with the same audio encoder but with two different state-of-the-art decoding strategies (CTC and Trans-ducer) capture disfluency-related features, especially in the latest encoding layer, and both model features that can be used for the identification and positioning of disfluent speech segments <ref type="bibr" target="#b9">[10]</ref>. Although there seems to be a tendency to forget this information with subsequent layers, as the trends for DTW (figure <ref type="figure" target="#fig_2">2a</ref>) and F1-measure (figure <ref type="figure" target="#fig_2">2b</ref>) would suggest, the last layers, which are those closest to the objective function represented by the decoding module, seem the most prone to retain characteristics useful to locate and identify disfluency phenomena. Interestingly, despite the differences between the two decoding modules which are respectively non-recurrent (CTC) and recurrent (RNN-T), the performances for the chosen task are comparable. However, the confusion matrices highlight that the CTC-based classifier performs better in the disfluency feature discrimination task, while the Transducer-based classifier more precisely identifies filled pauses, which could be related to the scope (recurrent/non-recurrent) of the objective function. The results align with the literature that shows a strong sensitivity to features concerning words and phone of the layers closest to the encoder <ref type="bibr" target="#b3">[4]</ref>, while the layers closest to the input are more sensitive to features related to accent and local acoustic characteristics <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>. It is worth noticing that, in a recent work <ref type="bibr" target="#b8">[9]</ref>, sensitivity to syllabic boundaries was found in layers 3-5, with a pattern similar to the one shown in Figure <ref type="figure" target="#fig_2">2</ref> but without the peak in the last layers. The reason can be found in the fact that syllables and their boundaries do not have a graphic distinction in the transcriptions, conversely, in the case of disfluencies, there is a form of transcription that identifies them within a language model. The exploratory analysis of the errors highlighted that prolongations are more difficult to detect than filled pauses, which could depend on their being an integral (though lengthened) part of 'fluent'words while filled pauses are mostly realized as independent elements. Also, instances of prolongation are mostly non-recognized or misclassified as filled pauses when characterized by peculiar 'non-prototypical'phonation features, such as creaky phonations, or filler-like features, as in the case of monosyllabic word-final prolongations. Also, previous studies on the segmental quality of prolongations in Italian <ref type="bibr" target="#b27">[28]</ref> showed that prolongations, especially when concerning consonantal sounds, can be realised with schwa sounds similar to those that characterize most filled pauses. This filler-like quality could also be considered among the underlying reasons for the negative correlation between the evaluation metrics of prolongations misclassification and their duration. Another possible motivation could reside in a bias in the dataset combined with the classifier architecture (LSTM), which easily recognises prolongations responding to a specific length pattern. This means that the scarcity of longer prolongations hinders their modelling leading to their misclassification. These findings could be used to improve transcription applications by enriching them with disfluency annotation (including filler particles and prolongation phenomena), which are still rather costly processes for studies concerning hesitation phenomena and (own) speech management in typical as well as atypical speech (e.g., pathological or language learners' speech. Indeed, an immediate development of the described work consists of increasing the capabilities of the pre-trained E2E-ASRs by adding a simple disfluency identification module to complement the existing decoder, thus enriching the resulting transcriptions.</p><p>Our work is built upon unidirectional LSTMs rather than on bidirectional LSTMs (BiLSTMs), which provide better performance because the latter have slightly longer inference times and require a larger amount of data, resources, time to be trained and, most importantly, present a more complex behaviour <ref type="bibr" target="#b28">[29]</ref>. However, the introduction of different architecture modules like bidirectional LSTM could improve the detection of prolongation disfluencies. This will be part of future developments focused on performance and increased neural network complexity.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: E2E ASRs are based on an encoder-decoder architecture. The speech signal is fed to the encoder, producing an encoded representation that contains the information needed by the decoder to provide the sequence of words/characters/subwords and build the transcription.</figDesc><graphic coords="1,312.79,339.12,183.02,122.81" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>•</head><label></label><figDesc>A sequence of intermediate layer emissions/embedding representing the input segment in the layer's (a) Average Dynamic time warping distance measured between sequences of labels with standard error (shade). (b) Average Weighted F1 measure measured between sequences of labels with standard error (shade).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Dynamic Time Warping distance (figure a) and Weighted F1 (figure b) for all the trained classifers. The x-axis indicates the index (starting from index 0) of the intermediate layer from which the distilled features have been extracted to train the corresponding classifier. vectorial space. Each emission in the sequence represents a portion of 40 milliseconds of the input signal due to the considered model's characteristics. • A sequence of labels associated with each sequence of emissions, indicating whether an intermediate emission belongs to a particular class of disfluencies (1 for FP and 2 for PRL) or not (label 0 if the segment does not belong to a disfluency).</figDesc><graphic coords="3,89.29,229.73,416.69,151.83" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>(a) CTC-based classifier with hidden size 640 trained on distilled features from layer 18 (index 17 in F1,DTW plots). (b) RNN-T-based classifier with hidden size 640 trained on distilled features from layer 16 (index 15 in F1,DTW plots).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Confusion matrix for the best classifiers obtained for each of the considered decoding approaches.comprises three gates processing one vector at a time and combining it with information extracted from previous vectors. One of the most crucial parameters for an LSTM is the hidden layer, therefore we investigate the impact of three different layer sizes (hidden-layer size, 𝑛), namely 160, 320 and 640. So, an LSTM-based classifier processes a sequence of {𝑒 𝑙,𝑚 } emission vectors (each of length ℎ) and produces a new sequence of vectors with size 𝑛. The two sequences are aligned over time. At each time step 𝑡, the FFNN produces a label indicating whether the considered input represents a specific disfluency segment (label 1 for filled pause or 2 for prolongation) or not (with label 0) based on the LSTM hidden-layer output. In summary, we train and evaluate many different LSTM-based disfluencies classifiers/detectors (𝐿 𝑛,𝑚,𝑙 ) for all possible 𝑛, 𝑚, and 𝑙 combinations to search for the evidence of disfluencies-related properties in the models' decisions.The goal is to explore which of the considered pretrained E2E ASR models, based on different decoding systems, better encodes characteristics associated with disfluent speech segments to perform a fillers and prolongations detection task. To this end, two publicly available<ref type="bibr" target="#b22">[23]</ref> Conformer-based models<ref type="bibr" target="#b23">[24]</ref> with 120 million parameters each, built with the NVIDIA Nemo toolkit and differing only in the decoding strategy, were selected. On the one hand, a Conformer-based model with a Connectionist Temporal Classification (CTC)<ref type="bibr" target="#b24">[25]</ref> decoder has been considered, as the CTC is one of the most popular decoding techniques. Such a decoding technique is a nonauto-regressive speech transcription technique that collapses consecutive, all-equal, transcription labels (character, word piece, etc.) to one label unless a special label separates these. The result is a sequence of labels shorter or equal to the input vector sequence length. Being nonauto-regressive, it is also considered computationally effective as it requires less time and resources for training and inference phases. On the other hand, a Conformerbased model with the Recurrent Neural Network Transducer (RNN-T), commonly known as Transducer has been</figDesc><graphic coords="4,89.29,61.51,187.51,140.63" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: The plots in (a) for CTC and (b) for RNN-T report the F1 measure related to the frequency of FP (yellow) and PRL (purple). Scatterplots for CTC (c) and RNN-T (d) compare the duration of the PRL segments with the respective F1 measure.</figDesc><graphic coords="5,89.29,199.37,197.91,117.16" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of end-to-end speech recognition</title>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Journal of Physics: Conference Series</title>
				<imprint>
			<publisher>IOP Publishing</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">1187</biblScope>
			<biblScope unit="page">52068</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">End-to-end accented speech recognition</title>
		<author>
			<persName><forename type="first">T</forename><surname>Viglino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Motlicek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cernak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Interspeech</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="2140" to="2144" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">How accents confound: Probing for accent information in end-to-end speech recognition systems</title>
		<author>
			<persName><forename type="first">A</forename><surname>Prasad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Jyothi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="3739" to="3753" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Layer-wise analysis of a self-supervised speech representation model</title>
		<author>
			<persName><forename type="first">A</forename><surname>Pasad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-C</forename><surname>Chou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Livescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="914" to="921" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Domain-informed probing of wav2vec 2.0 embeddings for phonetic features</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">C</forename><surname>English</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kelleher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Carson-Berndsen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology</title>
				<meeting>the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="83" to="91" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Probing self-supervised speech models for phonetic and phonemic information: A case study in aspiration</title>
		<author>
			<persName><forename type="first">K</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gauthier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Breiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Levy</surname></persName>
		</author>
		<idno type="DOI">10.21437/Interspeech.2023-2359</idno>
	</analytic>
	<monogr>
		<title level="m">INTERSPEECH 2023</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="251" to="255" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Encoding of lexical tone in selfsupervised models of spoken language</title>
		<author>
			<persName><forename type="first">G</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Watkins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alishahi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bisazza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chrupała</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2024.naacl-long.239</idno>
		<ptr target="https://aclanthology.org/2024.naacl-long.239.doi:10.18653/v1/2024.naacl-long.239" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Duh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Gomez</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bethard</surname></persName>
		</editor>
		<meeting>the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>Mexico City, Mexico</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4250" to="4261" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">On the encoding of gender in transformer-based asr representations</title>
		<author>
			<persName><forename type="first">A</forename><surname>Krishnan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">M</forename><surname>Abdullah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Klakow</surname></persName>
		</author>
		<idno type="DOI">10.21437/Interspeech.2024-2209</idno>
	</analytic>
	<monogr>
		<title level="m">Interspeech 2024</title>
				<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="3090" to="3094" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Exploring emergent syllables in end-to-end automatic speech recognizers through model explainability technique</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">N</forename><surname>Vitale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cutugno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Origlia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Coro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Computing and Applications</title>
		<imprint>
			<biblScope unit="page" from="1" to="27" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Rich speech signal: exploring and exploiting end-toend automatic speech recognizers&apos; ability to model hesitation phenomena</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">N</forename><surname>Vitale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schettino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cutugno</surname></persName>
		</author>
		<idno type="DOI">10.21437/Interspeech.2024-2029</idno>
	</analytic>
	<monogr>
		<title level="m">Interspeech 2024</title>
				<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="222" to="226" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Detection of filled pauses in spontaneous conversational speech</title>
		<author>
			<persName><forename type="first">M</forename><surname>Gabrea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Oshaughnessy</surname></persName>
		</author>
		<idno type="DOI">10.21437/ICSLP.2000-626</idno>
		<ptr target="https://www.isca-archive.org/icslp_2000/gabrea00_icslp.html.doi:10.21437/ICSLP.2000-626" />
	</analytic>
	<monogr>
		<title level="m">6th International Conference on Spoken Language Processing (ICSLP 2000)</title>
				<imprint>
			<publisher>ISCA</publisher>
			<date type="published" when="2000">2000</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="678" to="681" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Spontaneous speech: how people really talk and why engineers should care</title>
		<author>
			<persName><forename type="first">E</forename><surname>Shriberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">INTER-SPEECH</title>
				<imprint>
			<publisher>Citeseer</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="1781" to="1784" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Semiautomatic support of speech fluency assessment by detecting filler particles and determining speech tempo</title>
		<author>
			<persName><forename type="first">V</forename><surname>Kany</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Trouvain</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on prosodic features of language learners&apos; fluency</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">An audiovisual corpus of guided tours in cultural sites: Data collection protocols in the CHROME project</title>
		<author>
			<persName><forename type="first">A</forename><surname>Origlia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Savy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Poggi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cutugno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Alfano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>D'errico</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Vincze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Cataldo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 AVI-CH Workshop on Advanced Visual Interfaces for Cultural Heritage</title>
				<meeting>the 2018 AVI-CH Workshop on Advanced Visual Interfaces for Cultural Heritage</meeting>
		<imprint>
			<date type="published" when="2018">2091. 2018</date>
			<biblScope unit="page" from="1" to="4" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Sarro</surname></persName>
		</author>
		<title level="m">The Manner encoding in an Italian corpus collected with Modokit</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
		<respStmt>
			<orgName>Università degli Studi dell&apos;Aquila</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Master&apos;s thesis</note>
	<note>The many ways to search for an Italian frog</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Diatopic, diamesic and diaphasic variations in spoken Italian</title>
		<author>
			<persName><forename type="first">R</forename><surname>Savy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cutugno</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of CL2009, The 5th Corpus Linguistics Conference</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Mahlberg</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>González-Díaz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Smith</surname></persName>
		</editor>
		<meeting>CL2009, The 5th Corpus Linguistics Conference<address><addrLine>Liverpool, UK</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009-07-23">20-23 July 2009. 2009</date>
			<biblScope unit="page" from="20" to="23" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Eklund</surname></persName>
		</author>
		<title level="m">Disfluency in Swedish Human-Human and Human-Machine travel booking dialogues</title>
				<imprint>
			<publisher>Electronic Press</publisher>
			<date type="published" when="2004">2004</date>
		</imprint>
		<respStmt>
			<orgName>Linköping University</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Hesitations in Spoken Dialogue Systems</title>
		<author>
			<persName><forename type="first">S</forename><surname>Betz</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
		<respStmt>
			<orgName>Universität Bielefeld</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">The Role of Disfluencies in Italian Discourse</title>
		<author>
			<persName><forename type="first">L</forename><surname>Schettino</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Modelling and Speech Synthesis Applications</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
		<respStmt>
			<orgName>Università degli Studi di Salerno</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Fluency and disfluency</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Lickley</surname></persName>
		</author>
		<idno type="DOI">10.1002/9781118584156.ch20</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1002/9781118584156.ch20" />
	</analytic>
	<monogr>
		<title level="m">The handbook of speech production</title>
				<editor>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Redford</surname></persName>
		</editor>
		<imprint>
			<publisher>Wiley Online Library</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="445" to="474" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">The measurement of observer agreement for categorical data</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Landis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">G</forename><surname>Koch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Biometrics</title>
		<imprint>
			<biblScope unit="page" from="159" to="174" />
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Long short-term memory</title>
		<author>
			<persName><forename type="first">S</forename><surname>Hochreiter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural computation</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="1735" to="1780" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Nvidia catalog for pre-trained conformer models</title>
		<ptr target="https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_{transducer|ctc}_large" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Gulati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-C</forename><surname>Chiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2005.08100</idno>
		<title level="m">Conformer: Convolution-augmented transformer for speech recognition</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Fernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd international conference on Machine learning</title>
				<meeting>the 23rd international conference on Machine learning</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="369" to="376" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Sequence transduction with recurrent neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1211.3711</idno>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<title level="m" type="main">Dynamic time warping, Information retrieval for music and motion</title>
		<author>
			<persName><forename type="first">M</forename><surname>Müller</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="69" to="84" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Prolongation in italian</title>
		<author>
			<persName><forename type="first">L</forename><surname>Schettino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Eklund</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Disfluency in Spontaneous Speech Workshop 2023</title>
				<meeting>Disfluency in Spontaneous Speech Workshop 2023<address><addrLine>DiSS; Bielefeld, Germany</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023-08">2023. August 2023. 2023</date>
			<biblScope unit="page" from="81" to="85" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">The performance of lstm and bilstm in forecasting time series</title>
		<author>
			<persName><forename type="first">S</forename><surname>Siami-Namini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tavakoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Namin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2019 IEEE International conference on big data (Big Data)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="3285" to="3292" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
