<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Towards a responsible usage of AI-based Large Acoustic Models for Automatic Speech Recognition: on the importance of data in the self-supervised era</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Vincenzo</forename><forename type="middle">Norman</forename><surname>Vitale</surname></persName>
							<email>vitale@unina.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Naples</orgName>
								<address>
									<addrLine>Federico II, Corso Umberto I, 40</addrLine>
									<postCode>80138</postCode>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">UrbanECO Research Center</orgName>
								<orgName type="institution">University of Naples</orgName>
								<address>
									<addrLine>Federico II, via Tarsia, 31</addrLine>
									<postCode>80134</postCode>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Emilia</forename><surname>Tanda</surname></persName>
							<email>e.tanda@studenti.unina.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Naples</orgName>
								<address>
									<addrLine>Federico II, Corso Umberto I, 40</addrLine>
									<postCode>80138</postCode>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Francesco</forename><surname>Cutugno</surname></persName>
							<email>cutugno@unina.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Naples</orgName>
								<address>
									<addrLine>Federico II, Corso Umberto I, 40</addrLine>
									<postCode>80138</postCode>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">UrbanECO Research Center</orgName>
								<orgName type="institution">University of Naples</orgName>
								<address>
									<addrLine>Federico II, via Tarsia, 31</addrLine>
									<postCode>80134</postCode>
									<settlement>Naples</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Towards a responsible usage of AI-based Large Acoustic Models for Automatic Speech Recognition: on the importance of data in the self-supervised era</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">A0F7744AC80619D024C5BE56D44D0F82</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>End-to-End ASR</term>
					<term>self-supervised</term>
					<term>quality of data</term>
					<term>communication style</term>
					<term>responsible AI</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The evolution of artificial intelligence models has made them tools of everyday use in many fields. However, the enormous capabilities demonstrated by these models have, on the one hand, some apparent costs in terms of money, computational resources, or data. On the other hand, there are some hidden costs for end users who rely on models trained by third parties, sacrifice awareness and control of a tool, and try to evaluate its performance in their specific contexts. This is the case of supervised End-to-End (E2E) ASR systems and self-supervised E2E-ASR, also referred to as Large Acoustic Models (LAM). On the one hand, they provide an important starting point for building information systems oriented to speech interaction and, on the other hand, are complex to evaluate, use and adapt in specific contexts.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Modern Automatic Speech Recognition (ASR) systems, among the other Natural Language Processing (NLP) systems, achieve remarkable performances thanks to the computing potential enabled by Deep Neural Networks (DNN). Indeed, over the last decade, the automatic speech recognition community has made great strides <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3]</ref>, moving from traditional hybrid modelling (Acoustic Model+Language Moel) to end-to-end (E2E) modelling that directly translates an input speech sequence into a sequence of output tokens using a single network, leading to self-supervised E2E models, also referred to as Large Acoustic Models (LAMs), that can model speech without the aid of labelled data. These revolutionary innovations have completely subverted the traditional architectures of ASR systems used in previous decades. In addition, there has also been a strong impact on the cost-effectiveness and democratization of ASR systems. On the one hand, the change in architecture has made it more economical to collect and create the data sets necessary for training, which previously required the use of a large number of experts in the field of speech analysis involved in long and expensive pro-cesses of manual labelling. On the other hand, it has allowed the creation of a large number of freely available and open-source general-purpose ASRs, bringing these systems within the reach of a greater number of institutions and companies. However, their use remains limited due to the lack of benchmarks oriented towards specific contexts and communication styles. In this work, we will analyze the evolution of ASR systems, how the nature of the data used for their training has changed, and the limitations of modern ASR systems. Finally, we will propose an initiative aimed at collecting high-quality data in Italian aimed at both performance verification and training based on specific communicative styles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">The evolution of ASR systems</head><p>ASR systems have been the subject of several revolutions, which have impacted their internal architecture and the nature of the data employed for their training. Traditional ASR systems rely on two separate components <ref type="bibr" target="#b8">[9]</ref>: The Acoustic Model (AM), which is aimed at converting the voice signal into a sequence of phones, and the Language Model (LM), aimed at transforming the sequence of phones received from the AM, in the most likely and reliable transcription. These two models were initially realised with techniques such as Hidden Markov Models (HMM) or Gaussian Mixture Models (GMM). Then, with the advent of Deep Neural Networks (DNN), both have been realized as supervised DNNs. Still, the output of both components was the same: the AM produces the most likely sequence of phones given the input voice signal, while the LM provides the most reliable transcription given the input sequence of phones. This means that the two components had separate objectives and relied on different kinds of high-quality and costly datasets. On the one hand, the AM needs well-aligned sound-to-phone transcriptions. On the other hand, the LM needs a statistically representative set of phone-to-word samples in order to provide meaningful transcription. Providing adequate quality data requires highly specialised professionals to hand label in both cases. This type of ASR system requires tens to hundreds of hours of speech to train the AM and a few million words to train an LM (depending on the context). The aim is to transcribe fairly long sentences with an accuracy linked to specific application contexts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head><p>The turning point that led to the recent End-to-End ASR (E2E-ASR) <ref type="bibr" target="#b1">[2]</ref> was the introduction of the Transformer <ref type="bibr" target="#b9">[10]</ref> network architecture, on which most actual AI models rely. Compared to traditional systems, in E2E-ASRs, the voice signal is directly converted into its corresponding transcription without any intermediate, human-readable format. This evolution results in systems with a single objective needing only one cheaper dataset to be trained since the intermediate phones transcription and the alignment parts have been removed. The Transformer architecture <ref type="bibr" target="#b9">[10]</ref> opens up the possibility of building a combination of AM and LM, now referred to as the Encoder and Decoder, which directly maps an unaligned sequence of sounds to its transcription. With a few hundred hours of non-aligned transcribed speech through a supervised learning process, E2E-ASR systems outperform the previous generation on average by providing an error of up to 5% in the case of pure Transformer-Encoder systems, or up to 4% in the case of Conformer-Encoder systems <ref type="bibr" target="#b4">[5]</ref> (see table <ref type="table">1</ref> for performance). Clearly, the Decoder module implementation choice strongly impacts E2E ASR performances, such module is usually implemented as a Connectionist Temporal Classification (CTC) model <ref type="bibr" target="#b10">[11]</ref> or as a Recurrent Neural Network Transducer (RNN-T) <ref type="bibr" target="#b11">[12]</ref>. CTC is a non-auto-regressive speech transcription technique which collapses consecutive, all-equal, transcription la-bels (character, word piece, etc.) to one label unless a special label separates these. The result is a sequence of labels shorter or equal to the input vector sequence length. The CTC is one of the most diffused decoding techniques. As non-auto-regressive, it is also considered computationally effective as it requires less time and resources for training and inference phases. Conversely, the RNN-T (also named Transducer) is an auto-regressive speech transcription technique which overcomes CTC's limitations, i.e., non-auto-regressive and limited label sequence length. An RNN-T is a speech transcription technique which can produce label-transcription sequences longer than the input vector sequence and models long-term transcription elements' inter-dependency. A Transducer typically comprises two sub-decoding modules: one that forecasts the next transcription label based on the previous transcriptions (prediction network); the other that combines the encoder and prediction-network outputs to produce a new transcription label (joiner network). These features improve transcription speed and performance with respect to CTC at the expense of more training and computational resources required <ref type="bibr" target="#b11">[12]</ref>.</p><p>Finally, the most recent advancement consists of the employment of self-supervised training techniques, giving rise to what could be defined as the first truly Endto-End ASR, namely Wav2Vec2 <ref type="bibr" target="#b5">[6]</ref> and after a while to HuBERT <ref type="bibr" target="#b6">[7]</ref>, both are also referred to as Large Acoustic Model (LAM) <ref type="bibr" target="#b12">[13]</ref> because of their training process which usually involves two main phases. The first one is the pre-training phase, during which vast amounts of untranscribed speech data are employed in order to recognize and discretize hidden acoustic units' representations by employing different processes such as quantization (Wav2Vec2 <ref type="bibr" target="#b5">[6]</ref>) directly from the raw audio sample, or clustering (HuBERT <ref type="bibr" target="#b6">[7]</ref>) on MFCC features. Then, during the last phase, a transcription module could be trained on smaller datasets (few hours) in order to obtain an error rate of about 2% (see table <ref type="table">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Self-supervised E2E Solutions (?) to data shortages</head><p>Undeniably, by committing the model to learn all parts automatically, E2E-ASRs overcome the difficulties and cost-ineffectiveness of the data preparation and modelling phases of conventional systems, while requiring far more training data <ref type="bibr" target="#b13">[14]</ref>. This shift significantly impacted ASR systems; on the one hand, it significantly reduced training data costs while increasing their volume, as shown by the availability of plenty of generalpurpose training datasets <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b2">3]</ref>. On the other hand, in spite of the cheapness of training data, ASR systems are now accessible to a wider public. Clearly, these innovations present some expenses, which in this case consist of higher computational costs, longer training times, and loss of modularity <ref type="bibr" target="#b2">[3]</ref> compared to traditional ASR systems. Indeed, adapting such a general-purpose E2E-ASR to specific contexts means, in some cases, updating the Decoder (LM) to a special-purpose field or updating the Encoder (AM) to handle a special type of speech, which requires fine-tuning and, in the worst cases, training the model from scratch.</p><p>Then, the advent of Self-supervised systems impacted the adaptability aspects of general-purpose E2E ASR, giving rise to Large Acoustic Models (LAMs), which basically are Encoders trained on vast amounts of non-transcribed cheaper datasets, compared to data needed by simple E2E-ASR, which are then combined with an Encoder part trained on small quantities of language-specific transcribed data. The result is a large, general-purpose model that can be easily deployed in most contexts. Although they are publicly available and, therefore, freely adaptable, the necessary computational resources are so prohibitive that they are within the reach of a few companies and institutions, even for simple fine-tuning.</p><p>A further point to be considered is that the advantages of both simple E2E ASR and Self-supervised ones come at the expense of lower interpretability of systems' internals, making it difficult to diagnose errors and limiting their usage in critical contexts <ref type="bibr" target="#b2">[3]</ref>. However, some studies in the field of eXplainable AI (XAI) <ref type="bibr" target="#b14">[15]</ref> try to provide explanations and methodologies for analysing behaviours and phenomena modelled by various E2E ASR systems, aiming to make them more interpretable <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b18">19]</ref>, still based on special purpose data.</p><p>To summarize, although the innovations introduced by E2E and self-supervised E2E systems have allowed their fast diffusion, still their industrial and institutional deployment remains subject to limitations <ref type="bibr" target="#b2">[3]</ref> which, in some cases, are strongly related to special-purpose data availability. Indeed, employing a general-purpose E2E ASR system in a specific domain requires evaluation and potential fine-tuning /training on domain-specific data, which is usually unavailable. Another aspect to consider is how and to what extent the democratisation of ASR systems has been impacted. In fact, if, on the one hand, it is possible to obtain much more data for the same cost, on the other hand, the same quantity of resources is no longer sufficient, especially for training purposes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">High-quality data for context-specific assessment</head><p>Clearly, the availability of good-quality and wellcategorized data is paramount in the current application landscape. On the one hand, such data is essential to evaluate pre-trained systems in specific contexts with speaking styles related to different communication situations. On the other hand, such data is crucial for training and fine-tuning modern supervised and self-supervised E2E ASR. To this end, the Phoné consortium was born as a voluntary initiative to collect, verify and distribute transcribed and non-transcribed Italian speech datasets in various application contexts. Table <ref type="table" target="#tab_1">2</ref> shows the actual amount of data collected and verified by the consortium to provide Italian institutions and companies with adequate instruments to evaluate these promising tools, which are, however, assessed in contexts and communication styles that do not reflect the target ones. Currently, data is divided into two macro-categories, namely, Transcribed and Untranscribed, to enable the future training of self-supervised E2E-ASR. Then, datasets are further divided into specific communication styles <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21]</ref>:</p><p>• Monologic speech involves only one person speaking without interacting with an interlocutor. This type of speech is characterized by consistency and structuring, as it typically consists of lectures, speeches or situations that require preliminary preparation. As a result, the speech appears cohesive and well-organized. The language register tends to be higher and more formal. • Dialogic speech involves two or more people in a conversation, characterized by exchanges of messages and information. It is thus configured as a communicative act with a dynamic structure. Unlike monologic speech, dialogic speech does not involve prior preparation; therefore, the speech tends to be simpler from a syntactic point of view, the articulation of words tends to be less precise (hypoarticulation), and it is also characterized by greater conciseness of expression. • In Read speech, the speaker reads a written text aloud (as in the case of audiobooks), therefore this type of speech is characterized by clear pronunciation (there is a tendency towards hyperarticulation), complete syntax and greater coherence and cohesion of the text. Furthermore, another feature is given by the modulation of reading speed and the use of strategic pauses and intonations to improve communicative effectiveness. Behind ASR-related aspects, the consortium's purposes also extend to other voice-related tasks, which include, but are not limited to, Text-To-Speech (TTS), Speaker Identification (SI), Speaker Verification (SV), and others.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Material</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this work, we present the current panorama related to E2E ASR systems, how their data usage evolved along with technological improvements and the current issues that these improvements solved or introduced. Firstly, we observe the significant improvement in models' performances while pointing out issues connected to the models' capacity assessment related to specific communication styles and domains. We observe the shift in model training costs, moving away from data becoming cheaper and easier to collect towards computing resources growing in quantity and costs. Then, we observed how the advantages introduced by modern E2E (supervised and self-supervised) ASRs come at the expense of an increase in their complexity, which consequently reduces their interpretability. Finally, we propose a voluntary, highquality data collection initiative to evaluate and train systems related to various speech communication styles to enable more informed use and greater accessibility of E2E-ASR systems.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>List of material collected and verified for the evaluation and training of E2E-ASR systems (both supervised and selfsupervised) in specific contexts for the Italian language.</figDesc><table><row><cell cols="2">Type Speech Type</cell><cell>Minutes</cell></row><row><cell>Transcribed</cell><cell>Monologic</cell><cell>500 Minutes</cell></row><row><cell>Transcribed</cell><cell>Dialogic</cell><cell>400 Minutes</cell></row><row><cell>Transcribed</cell><cell>Read</cell><cell>120 Minutes</cell></row><row><cell>Untranscribed</cell><cell>Monologic</cell><cell>10000 Minutes</cell></row><row><cell>Untranscribed</cell><cell>Dialogic</cell><cell>500 Minutes</cell></row><row><cell>Untranscribed</cell><cell>Read</cell><cell>2200 Minutes</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Automatic speech recognition: a survey</title>
		<author>
			<persName><forename type="first">M</forename><surname>Malik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Malik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mehmood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Makhdoom</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Multimedia Tools and Applications</title>
		<imprint>
			<biblScope unit="volume">80</biblScope>
			<biblScope unit="page" from="9411" to="9457" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Recent advances in end-to-end automatic speech recognition</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">APSIPA Transactions on Signal and Information Processing</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">End-to-end speech recognition: A survey</title>
		<author>
			<persName><forename type="first">R</forename><surname>Prabhavalkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">N</forename><surname>Sainath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Schlüter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Watanabe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Sak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tripathi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mcdermott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Koo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kumar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="7829" to="7833" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Conformer: Convolution-augmented transformer for speech recognition</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gulati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-C</forename><surname>Chiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">wav2vec 2.0: A framework for self-supervised learning of speech representations</title>
		<author>
			<persName><forename type="first">A</forename><surname>Baevski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Auli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="12449" to="12460" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Self-supervised speech representation learning by masked prediction of hidden units</title>
		<author>
			<persName><forename type="first">W.-N</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bolte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-H</forename><forename type="middle">H</forename><surname>Tsai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lakhotia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hubert</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="3451" to="3460" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">W2v-bert: Combining contrastive learning and masked language modeling for selfsupervised speech pre-training</title>
		<author>
			<persName><forename type="first">Y.-A</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-C</forename><surname>Chiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="244" to="250" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A review on automatic speech recognition architecture and approaches</title>
		<author>
			<persName><forename type="first">S</forename><surname>Karpagavalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chandra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Signal Processing, Image Processing and Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="393" to="404" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Fernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schmidhuber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd international conference on Machine learning</title>
				<meeting>the 23rd international conference on Machine learning</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="369" to="376" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Sequence transduction with recurrent neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Graves</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1211.3711</idno>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Large scale acoustic models: A new perspective</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Giordano Orsini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">N</forename><surname>Vitale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cutugno</surname></persName>
		</author>
		<idno type="DOI">10.1422/108137</idno>
	</analytic>
	<monogr>
		<title level="j">Sistemi intelligenti</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Psycho-acoustics inspired automatic speech recognition</title>
		<author>
			<persName><forename type="first">G</forename><surname>Coro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">V</forename><surname>Massoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Origlia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cutugno</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computers &amp; Electrical Engineering</title>
		<imprint>
			<biblScope unit="volume">93</biblScope>
			<biblScope unit="page">107238</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Gunning</surname></persName>
		</author>
		<title level="m">Explainable artificial intelligence (xai), Defense advanced research projects agency (DARPA)</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">How accents confound: Probing for accent information in end-to-end speech recognition systems</title>
		<author>
			<persName><forename type="first">A</forename><surname>Prasad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Jyothi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="3739" to="3753" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Probing acoustic representations for phonetic properties</title>
		<author>
			<persName><forename type="first">D</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Liberman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="311" to="315" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Layer-wise analysis of a self-supervised speech representation model</title>
		<author>
			<persName><forename type="first">A</forename><surname>Pasad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-C</forename><surname>Chou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Livescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="914" to="921" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Exploring emergent syllables in end-to-end automatic speech recognizers through model explainability technique</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">N</forename><surname>Vitale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cutugno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Origlia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Coro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Computing and Applications</title>
		<imprint>
			<biblScope unit="page" from="1" to="27" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance</title>
		<author>
			<persName><forename type="first">M</forename><surname>Nakamura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Iwano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Furui</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Speech &amp; Language</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="page" from="171" to="184" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Linguistic analysis and learning of dialogical speech in literary texts</title>
		<author>
			<persName><forename type="first">P</forename><surname>Azizova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">JETT</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="86" to="94" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
