<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Towards an ASR System for Documenting Endangered Languages: A Preliminary Study on Sardinian</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ilaria</forename><surname>Chizzoni</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Free University of Bozen-Bolzano</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Vietti</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Free University of Bozen-Bolzano</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Towards an ASR System for Documenting Endangered Languages: A Preliminary Study on Sardinian</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">4960D7D301049506E26D640B86A5A356</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Speech recognition</term>
					<term>Campidanese Sardinian</term>
					<term>Resource and evaluation</term>
					<term>Spoken language documentation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Speech recognition systems are still highly dependent on textual orthographic resources, posing a challenge for low-resource languages. Recent research leverages self-supervised learning of unlabeled data or employs multilingual models pre-trained on high resource languages for fine-tuning on the target low-resource language. These are effective approaches when the target language has a shared writing tradition, but when we are confronted with mainly spoken languages, being them endangered minority languages, dialects, or regional varieties, other than labeled data, we lack a shared metric to assess speech recognition performance. We first provide a research background on ASR for low-resource languages and describe the specific linguistic situation of Campidanese Sardinian, we then evaluate five multilingual ASR models using traditional evaluation metrics and an exploratory linguistic analysis. The paper addresses key challenges in developing a tool for researchers to document and analyze the phonetics and phonology of spoken (endangered) languages.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The growing interest in understudied languages has led to categorizing them on the basis of resource availability, defining them as high, low, or zero-resource languages. In the narrowest sense, zero and low-resource languages are those lacking sufficient data to train statistical and machine learning models <ref type="bibr" target="#b0">[1]</ref> [2] <ref type="bibr" target="#b2">[3]</ref>. However, such a technical definition is not adequate to account for the different linguistic scenarios of world languages. As a matter of fact, in the literature, the term low and zero resource languages is still used inconsistently. Sometimes, it is used to describe standard, widely spoken languages with a shared orthography, that cannot rely on many hours of transcribed or annotated speech, see Afrikaans, Icelandic, and Swahili in <ref type="bibr" target="#b3">[4]</ref>. Sometimes, it is used for non-standard, widely spoken languages, lacking a shared orthography (no orthography or multiple proposed orthographies) as for Swiss German dialects <ref type="bibr" target="#b4">[5]</ref> or Nasal and Besemah <ref type="bibr" target="#b5">[6]</ref>. And sometimes to refer to non-standard, endangered languages lacking a shared orthography, like Bribri, Mi'kmaq and Veps <ref type="bibr" target="#b2">[3]</ref>.</p><p>These scenarios are mainly being addressed with two approaches: The first leverages self-supervised learning, and uses unlabeled data from the target language to learn linguistic structures <ref type="bibr" target="#b6">[7]</ref>. Self-supervised learning CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 -06, 2024, Pisa, Italy * Corresponding author. † These authors contributed equally. Envelope ilaria.chizzoni@unibz.it (I. Chizzoni); alessandro.vietti@unibz.it (A. Vietti) Orcid 0009-0009-9936-1220 (I. Chizzoni); 0000-0002-4166-540X (A. Vietti) is an optimal choice in low-resource settings because only requires to gather more audio data. However, it seems costly and prone to catastrophic forgetting <ref type="bibr" target="#b5">[6]</ref>  <ref type="bibr" target="#b3">[4]</ref>. The second approach involves training a multilingual model on labeled data from highly-resourced languages and then applying the trained model to transcribe unseen target languages. This includes the benefits of a supervised learning setting and proved to be effective <ref type="bibr" target="#b7">[8]</ref>. Pre-trained multilingual models can then be fine-tuned on just a smaller dataset of labeled data in the target language. Since fine-tuning is a straightforward, efficient approach, it is the preferred one to address the problem of low-resource languages <ref type="bibr" target="#b5">[6]</ref>. However, the success of this approach still depends on the amount of available labeled data in the target language or whether or not it is possible to generate more, e.g., via data augmentation.</p><p>Several data augmentation approaches for lowresource languages are currently being explored, including self-learning <ref type="bibr" target="#b5">[6]</ref>, text-to-speech (TTS) <ref type="bibr" target="#b5">[6]</ref> or optimized dataset creation approaches <ref type="bibr" target="#b8">[9]</ref>. Bartelds and colleagues <ref type="bibr" target="#b5">[6]</ref> propose data augmentation techniques to develop ASR for minority languages, regional languages or dialects. They employ a self-training method on Besemah and Nasal two Austronesian languages spoken in Indonesia. In self-training, a teacher XLS-R model is fine-tuned on manually transcribed data, the teacher model is used to transcribe unlabeled speech and then a student model is fine-tuned on the combined datasets of manually and automatically transcribed data. Since the collected 4 hours of manually transcribed speech for Besemah and Nasal followed different orthography conventions, the transcriptions were first normalized to working orthographies and then used for fine-tuning. In the same framework, they leveraged a pre-existing TTS system available for Gronings, a Low-Saxon language variant spoken in the province of Groningen in the Netherlands, to generate more synthetic training data from textual sources and they achieved great results <ref type="bibr" target="#b5">[6]</ref>.</p><p>While fine-tuning paired with data augmentation techniques works for low-resource, widely-spoken languages, developing a speech recognition system for endangered spoken languages also involves ethical considerations towards the local community. More participatory research is required to understand the native speakers' relationship with the written form of their language, as well as with language technologies. In their position paper <ref type="bibr" target="#b2">[3]</ref> Liu and colleagues emphasize the importance of creating language technologies in consultation with speakers, activists, and community language workers. They present a case study on Cayuga, an endangered indigenous language of Canada with approximately 50 native elder speakers and an increasing number of young L2 speakers. After gaining insights from the community, they began collaborating on a morphological parser. This tool aids teachers and young L2 students in language learning while gradually providing morphological annotations and segmentations useful for developing ASR systems for researchers. Blaschke and colleagues <ref type="bibr" target="#b9">[10]</ref> surveyed over 327 native speakers of German dialects and regional varieties, finding that respondents prefer tools that process speech over text and favor language technology that handles dialect speech input rather than output. Understanding the needs of the speech community and differentiating them from those of linguistic researchers can guide research more effectively.</p><p>This paper outlines the first steps towards a speech recognition system for researchers to aid the systematic analysis of the phonetics and phonology of Campidanese, an endangered language spoken in southern Sardinia. To achieve this goal, we first describe the situation of the speech community of the target language, we then select five speech recognition multilingual and ready for inference models and evaluate them on Campidanese Sardinian. When multilingual models were not available for speech recognition task, we chose multilingual models fine-tuned on Italian, which we assume to be a relatively close language both genealogically and structurally. We assess the goodness of the models' inferences, first by computing the traditional evaluation metrics, i.e., average Word Error Rate (WER) and Character Error Rate (CER), and then carrying out a qualitative linguistic analysis to have better insights of which model best meets the needs for language documentation and research. This work is part of "New Perspectives on Diphthong Dynamics (DID)", a joint project between the University of Bozen and the Ludwig-Maximilians-Universität München, focusing on the study of diphthongs dynamics in two understudied languages, i.e., Campidanese Sardinian and Tyrolean and aims to build a corpus for the linguistic documentation of these two languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Campidanese Sardinian</head><p>Sardinian is a Romance language spoken on the Sardinia island in Italy <ref type="bibr" target="#b10">[11]</ref>; it is considered an official minority language and is protected by National Law n.482/1999 and Regional Law n.26/1997 but does not have a written standard <ref type="bibr" target="#b11">[12]</ref>. Sardinia has a high internal linguistic diversity but the two main macro varieties are Logudorese (ISO code 639-3 src), spoken in the northern sub-region and Campidanese (ISO code 639-3-sro), spoken in the southern sub-region of Sardinia <ref type="bibr" target="#b11">[12]</ref>. To date, there are no quantitative studies on the real number of Sardinian speakers. The first sociolinguistic survey <ref type="bibr" target="#b12">[13]</ref> carried out by Regione Sardegna in 2007 on 2437 speakers states that 68.4% of the respondents claim to know and speak a variety of the local languages. However, the survey was based on the speakers' self-assessment. As far as Campidanese Sardinian is concerned, Ethnologue lists it as an endangered indigenous language <ref type="bibr" target="#b13">[14]</ref> and research <ref type="bibr" target="#b11">[12]</ref> claims it is used as a first language just by some elder adults in the ethnic community, and not taught to children anymore. In 2017, Rattu <ref type="bibr" target="#b14">[15]</ref> carried out a sociolinguistic survey on 310 Cagliari speakers, where a selfassessment questionnaire was followed by a language test (mostly translation tasks from Italian to Sardinian) and only a minority of respondents over the age of 45 achieved good or excellent results. The Sardinian Regional Administration presented two proposals for an official standard language: the first in 2001, presented as a linguistic compromise but actually over representative of Logudorese (Limba Sarda Unificada LSU), and the second in 2006, mainly based on the central regional variety (Limba Sarda Comuna LSC) <ref type="bibr" target="#b11">[12]</ref>. The latter remains the one used for communication by the Regional Administration, while in the Cagliari Province a proposal of orthographic rules for Campidanese called Sa Norma Campidanesa has been put forward in 2009 by the Comitau Scientìficu po sa normalisadura de sa bariedadi campidanesa de sa lìngua sarda <ref type="bibr" target="#b15">[16]</ref>. Without discussing the issue of the orthographic norm, which is inherently political, we would like to point out that these proposals do not seem to have become part of everyday language use by the speech community <ref type="bibr" target="#b16">[17]</ref>. This is primarily because they were not based on any official data regarding the linguistic and sociolinguistic situation or language use <ref type="bibr" target="#b17">[18]</ref>. Therefore, these standards remained limited to administrative communications. Some tendencies in the speakers' linguistic attitudes emerged from the DID project data collection fieldwork conducted in 2023 in the city of Sinnai. Native speakers of Campidanese are often unfamiliar with the written version of their language. Elder native speakers had no way or need to write the language, except in the last decade through social networks. Whereas, the few young people who use the language even in its written version to communicate with friends and family via message service apps, do not use Sa Norma Campidanesa, but rather use a transcription that intuitively approximates their pronunciation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Campidanese Sardinian dataset</head><p>We decided to evaluate the speech recognition models on a small sample of highly controlled Sardinian data, in order to carry out a qualitative linguistic analysis of the output transcription. The dataset includes short audios of read speech with an average length of 3.5 seconds (read_short), long audios of read speech with an average length of 23 seconds (read_long), and short audios of spontaneous speech with an average length of 5.3 seconds (spontaneous). Read speech is a subset of the corpus gathered during the DID project fieldwork in Sinnai. For the read_short, participants were asked to read aloud short sentences developed by the research group, using an orthography close to Sa Norma Campidanesa. In particular, twenty audio clips of four native speakers (2F and 2M) were selected. Two longer audio clips were selected from the same corpus: one of a female speaker reading an autograph poem, and another of a male speaker reading an excerpt of an autograph story. To have speech style variability, chunks of spontaneous speech from ethnographic interviews collected by Mereu <ref type="bibr" target="#b18">[19]</ref> in Cagliari in 2016 were included. Twelve audio chunks were extracted from two of the interviews conducted with two male native speakers of Campidanese. The orthographic transcripts followed different Campidanese conventions either being written or validated by native speakers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Methods</head><p>From HuggingFace's Open ASR Leader board <ref type="bibr" target="#b19">[20]</ref>, readyto-test models with low Real-Time-Factor (RTF) values were selected. Out of the five tested models, two are multilingual models containing at least one Romance language in their training dataset i.e., whisper-large-v2 and multilingual-fastconformer-hybrid-large; and three were multilingual models fine-tuned on Italian datasets and ready for inference, this is the case for it-fastconformer-hybrid-large from NVIDIA and wav2vec2-large-xlsr-53-italian and wav2vec2xlsr-53-espeak-cv-ft from Facebook.</p><p>Open AI Whisper is a Transformer sequence-tosequence multilingual and multitask model trained on performing multilingual speech recognition, speech translation, spoken language identification, and voice activity detection <ref type="bibr" target="#b20">[21]</ref>. We tested it without passing a specific language.</p><p>The multilingual FastConformer Hybrid Transducer-CTC model is a model developed by NVIDIA, combining the FastConformer architecture with a hybrid Transducer-CTC approach <ref type="bibr" target="#b21">[22]</ref>. NVIDIA FastConformers come across as very competitive for their efficiency and computational speed. We tested both the multilingual model version 1.20.0, trained on Belarusian, German, English, Spanish, French, Croatian, Italian, Polish, Russian, and Ukrainian <ref type="bibr" target="#b21">[22]</ref>, and the Italian model version 1.20.0 trained specifically on Italian (Mozilla Common Voice 12, Multilingual LibriSpeech and VoxPopuli) <ref type="bibr" target="#b22">[23]</ref>.</p><p>By Facebook we chose Wav2Vec 2.0 XLSR, a model that learns cross-lingual speech representations from the raw waveform of speech in multiple languages during pre-training <ref type="bibr" target="#b23">[24]</ref>. We use wav2vec2-large-xlsr-53italian, the Wav2Vec 2.0 model pre-trained on multilingual data from Multilingual LibriSpeech, Mozilla Common Voice and BABEL and fine-tuned on Italian <ref type="bibr" target="#b24">[25]</ref>.</p><p>To attempt an automatic phonetic transcription we used wav2vec2-xlsr-53-espeak-cv-ft, the same Wav2Vec 2.0 Large XLSR model, fine-tuned on multilingual Common Voice dataset to recognize phonetic labels <ref type="bibr" target="#b7">[8]</ref>.</p><p>In order to have a standard reference, traditional evaluation metrics for speech recognition systems like WER and CER were computed via the evaluate HuggingFace library <ref type="bibr" target="#b25">[26]</ref>. Since the output text was normalized differently by the different models, a text normalization was done on both reference and hypothesis transcriptions, removing every special characters (non-alphanumeric characters) before computing WER and removing special characters and spaces (tabs, spaces and new lines) before computing CER. We made no additional changes to the inferences, and no default parameters of the models were modified. All tests were run locally to respect data privacy policies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Models evaluation</head><p>Regarding the WER metric, we assume models to perform possible word recognition based on the inventory of multilingual or Italian tokens, since the model has not been trained or fine-tuned on any Sardinian data. This is why in our case average WER is poorly significant. We therefore evaluate performance mainly by looking at CER.</p><p>In Table <ref type="table" target="#tab_0">1</ref> we can see there is little difference in the performance between Whisper medium and large-v2. Surprisingly, however, Whisper medium performs better on long read-speech data, reaching a CER of 0.22 versus Whisper large-v2 only achieving 0.36. This could be due to a better performance of the translation task in Whisper large-v2. However, the larger model performs better on spontaneous speech (CER 0.39) then the medium model  <ref type="table">3</ref>). Overall, CER is relatively low on long read speech, which is intuitively understandable, considering the selected models have all been trained mainly on read speech (Mozilla Common Voice data and audio books). Poor performance on short audios was also expected, since all the tested models where pre-trained on longer audio chunks, ranging from 20 to 30 seconds <ref type="bibr" target="#b26">[27]</ref> [21] <ref type="bibr" target="#b6">[7]</ref>. Given the similar average length of the audio inputs, it is surprising that every model performs better on short spontaneous speech than on short read speech.</p><p>The relatively low CER values suggest promising potential, particularly for the multilingual models. Therefore, we decided to get more phonetically informative outputs to evaluate how well these models generalize beyond word boundaries and language-specific spelling conventions. We select wav2vec2-xlsr-53-espeak-cvft, a Wav2Vec 2.0 XLSR model fine-tuned on multilingual Common Voice dataset to recognize phonetic labels <ref type="bibr" target="#b27">[28]</ref>.</p><p>While using the exact same architecture as Wav2Vec2, Wav2Vec2Phoneme maps phonemes of the training languages to the target language using articulatory features <ref type="bibr" target="#b7">[8]</ref>. Since the model outputs a string of tab-separated phonetic labels, we computed the CER metric only. As a reference, we used the story Sa tramuntana e su soli which was phonemically and phonetically transcription provided by Mereu <ref type="bibr" target="#b11">[12]</ref>. The input file is a single 43-second audio of a young female native speaker of Campidanese Sardinian. When comparing the Wav2VecPhoneme predictions with the human phonemic transcription we get a Phoneme Error Rate (PER) of 0.28, while when comparing it with the phonetic human transcription, PER decreases to 0.23. This results suggest that an automatic transcription into phonemes rather than characters would be a path worth exploring, allowing a systematic description of the phonetics and phonology of endangered spoken languages, while bypassing the orthography issue. These results align with recent work on cross-lingual transfer <ref type="bibr" target="#b28">[29]</ref> proposing a very similar solution to develop a multilingual phoneme recognizer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Exploratory Linguistic Analysis</head><p>In this section, we present an exploratory linguistic analysis to evaluate to what extent the orthographic transcriptions from the tested ASR models capture the phonetic events present in the speech signal. The analysis is based on the inventory of phonological phenomena described for Campidanese Sardinian spoken in Cagliari <ref type="bibr" target="#b11">[12]</ref>.</p><p>In multilingual FastConformer's predictions some known phonological processes of Campidanese can be recognized. For instance, in Campidanese Sardininan the alveolar tap [R] is an allophone of /r/ in word-medial intervocalic position and a sociophonetic variant of /t/ and /d/ in the Cagliari variety <ref type="bibr" target="#b11">[12]</ref>. In examples 1 and 4, the intervocalic /t/ across word boundaries (si lui and ma lui) is transcribed as /l/, which can be considered a good orthographic approximation to an alveolar tap. Following a process of lenition of voiceless plosives and fricatives, the intervocalic labiodental fricatives /f/ across word boundaries are also consistently transcribed as their voiced counterpart /v/, see example 1 asivato, example 4 con savorza and deno vusti. Voiceless plosives /p/, /t/, and /k/ in word-medial intervocalic positions are expected to be realized with a long duration, in the predictions are recognized as geminate sounds, see example 5 in deppidi and mascetti, yet not always, see example 1 depidi. We also notice the insertion of paragogic vowels, which in Campidanese are inserted after a final consonant to avoid consonant in word-final coda position <ref type="bibr" target="#b11">[12]</ref>, as in example 1 depidi and zinotenesi or a rosasa in example 3. Except for esaminat in exemple 1 where it was expected and actually produced in the audio.</p><p>Although this model seems to propose an orthographic transcription close enough to the phonetic one, it sometimes makes systematic choices that are unfaithful to the acoustic signal. We provide an example where /u/ both in word medial and final position is generally transcribed as /o/, not only when there is an Italian equivalent or phonetically close lexical item e.g, antunietta&gt;antonietta; coru&gt;coro; su&gt;suo; cun&gt;con, but also when the item is unknown to the model ollastu&gt;ollasto; dentradura&gt;dentradora, giving reason to believe that the model might have information about the phonotactic constraints in Italian, e.g. no [u] in word final position.</p><p>1. esaminat si tui as fatu su percursu cumenti si depit <ref type="foot" target="#foot_0">1</ref>examina si lui asivato subercurso come zi depidi</p><p>2. e si non tenis atrus problemas in sa vida in foras<ref type="foot" target="#foot_1">2</ref> e zinotenesi a tus problema in savira in forez 3. sa vida no es stettia tuttu arrosas<ref type="foot" target="#foot_2">3</ref> savidano e stetti a dotto a rosasa 4. ma tui con sa forza de unu fusti di ollastu<ref type="foot" target="#foot_3">4</ref> ma lui con savorza deno vusti di ollasto 5. no si deppiti imperai ma sceti castiai <ref type="foot" target="#foot_4">5</ref>nosi deppidi imperai mascetti gastiai</p><p>Regarding Whisper large-v2, we notice in some cases a near-perfect Italian translation of the Sardinian input audios, see example 5 and 6 below; in others cases, a poorer Italian translation with the deletion of repetitions, as in 7. Surprisingly, in example 8 and 9 we see how the tentative translations (or identifications with the phonetically most similar lexical items in a known language) also happens to Portuguese. Similar behavior is observed in Whisper medium: tentative Italian and Portuguese translations, and hallucinations both in spontaneous and read short input audios.</p><p>5. esaminat si tui as fatu su percursu cumenti si depit esamina se lui ha fatto il suo percorso come si deve Similarly to multilingual FastConformer, Wav2Vec XLSR accounts for many of the phonological phenomena of Campidanese. The voiceless plosives /k/ and /p/, lenited to voiced fricatives [ɣ] and [β] when found in intervocalic environment across word boundaries <ref type="bibr" target="#b11">[12]</ref>, are transcribed as /g/ and /v/ in gusta vingiara and sugauli in example 13. While in Wav2Vec model the alveolar tap [ɾ] is rendered as /r/ instead of /l/ see sirui in example 10.</p><p>10. esaminat si tui as fatu su percursu cumenti si depit einasidu sirui ha sivato su bercursu come zi deperi</p><p>11. e si non tenis atrus problemas in sa vida in foras esino tenesi atosproblema sainsavvira in forese 12. su boi est un animali de meda importantzia <ref type="foot" target="#foot_8">9</ref>su boe e un animale de meda importanza 13. su cauli coit mellus in custa pingiada <ref type="foot" target="#foot_9">10</ref>sugauli coi melusu in gusta vingiara 14. ma tui con sa forza de unu fusti di ollastu madoi con savorza de unovusti diolastu Unlike Whisper large-v2, Wav2Vec XLSR never performs translations and, unlike the FastConformer finetuned on Italian, does not seem to respect the Italian phonotactic constraints, see diolastu in example 14.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Future steps</head><p>The preliminary analysis carried out in this paper provided insight into how various speech recognition models transcribe data in a Romance language not encountered in the model training. All evaluated models improve their performance as the audio length increases. Best CER values are achieved on audio of read speech longer than 20 seconds. However, short audios of spontaneous speech with an average length of 5.3 seconds achieved a remarkably low CER, meaning better precision compared to the similarly short (3.5 seconds) read speech chunks. These results suggest that speech style might also play a role. To investigate whether the models are sensitive to speech style, other linguistic, speaker-specific, or technical variables, such as the topic, age, gender of the speaker, or the acoustic quality of the audio data, should be taken into account. For example, both datasets of spontaneous speech are produced by males over 45, and models might be biased toward an adult male speaker profile. For the time being, we attribute it to the poor representativeness of the dataset and will investigate it in future work.</p><p>A controlled yet diverse dataset facilitated a qualitative linguistic analysis of the predictions. Interestingly, some models seem to follow the phonotactic constraints of the languages they have been trained on, but at the same time they generalize well to unfamiliar languages, providing quite accurate phonetically-like orthographic transcription of Campidanese Sardinian. These initial considerations should be validated with tests on a larger corpus to eliminate data bias and a more systematic linguistic analysis to avoid cherry-picking. We also plan to look in detail at the speech recognition models' architectures in order to make a informed choice at the fine-tuning phase.</p><p>In conclusion, it seems that state-of-the-art transcription models, especially multilingual ones, produce a phonetically accurate orthographic transcription of Campidanese Sardinian and thus provide a promising basis for fine-tuning. Specifically, Wav2Vec2 large XLSR-53 and STT Multilingual FastConformer Hybrid proved to be the best models according to the evaluation metrics and preliminary linguistic analysis. STT Multilingual FastConformer Hybrid was the best and most efficient in terms of computational resources, which makes it our first choice for further testing and fine-tuning. However, it is worth noticing, speech recognition systems with orthographic output can be costly in terms of human and computational resources, poorly informative for speech researchers and uninteresting to native speakers; whereas recent work on multilingual automatic phonemic recognition seems a viable alternative worth exploring for documenting endangered spoken languages.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Whisper Models</figDesc><table><row><cell>Model</cell><cell>Style</cell><cell cols="3">Length (s) CER WER</cell></row><row><cell>large-v2</cell><cell>read_short</cell><cell>3.5</cell><cell>0.69</cell><cell>1.02</cell></row><row><cell>large-v2</cell><cell>read_long</cell><cell>23.5</cell><cell>0.36</cell><cell>0.76</cell></row><row><cell>large-v2</cell><cell>spontaneous</cell><cell>5.3</cell><cell>0.39</cell><cell>0.90</cell></row><row><cell cols="2">medium read_short</cell><cell>3.5</cell><cell>0.70</cell><cell>1.00</cell></row><row><cell cols="2">medium read_long</cell><cell>23.5</cell><cell>0.22</cell><cell>0.79</cell></row><row><cell cols="2">medium spontaneous</cell><cell>5.3</cell><cell>0.52</cell><cell>1.12</cell></row><row><cell>Table 2</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">FastConformer NVIDIA Models</cell><cell></cell><cell></cell></row><row><cell cols="2">Model Style</cell><cell cols="3">Length (s) CER WER</cell></row><row><cell cols="2">FC-ML read_short</cell><cell>3.5</cell><cell>0.69</cell><cell>1.00</cell></row><row><cell cols="2">FC-ML read_long</cell><cell>23.5</cell><cell>0.22</cell><cell>0.79</cell></row><row><cell cols="2">FC-ML spontaneous</cell><cell>5.3</cell><cell>0.34</cell><cell>0.88</cell></row><row><cell>FC-IT</cell><cell>read_short</cell><cell>3.5</cell><cell>0.69</cell><cell>1.00</cell></row><row><cell>FC-IT</cell><cell>read_long</cell><cell>23.5</cell><cell>0.28</cell><cell>0.83</cell></row><row><cell>FC-IT</cell><cell>spontaneous</cell><cell>5.3</cell><cell>0.41</cell><cell>0.97</cell></row><row><cell>Table 3</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">Wav2Vec XLSR Italian</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Model</cell><cell>Style</cell><cell cols="3">Length (s) CER WER</cell></row><row><cell cols="2">W2V-IT read_short</cell><cell>3.5</cell><cell>0.68</cell><cell>1.00</cell></row><row><cell cols="2">W2V-IT read_long</cell><cell>23.5</cell><cell>0.25</cell><cell>0.81</cell></row><row><cell cols="2">W2V-IT spontaneous</cell><cell>5.3</cell><cell>0.36</cell><cell>0.90</cell></row><row><cell cols="5">(CER 0.52). As shown in Table 2, both NVIDIA Fast Con-</cell></row><row><cell cols="5">former models achieve low values on long audios of read</cell></row><row><cell cols="5">speech. While multilingual FastConformer reaches the</cell></row><row><cell cols="5">best values overall, Wav2Vec XLSR fine-tuned on Ital-</cell></row><row><cell cols="5">ian performs better than the multilingual FastConformer</cell></row><row><cell cols="3">fine-tuned on Italian (see Table</cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">[He/she] makes sure you have done the proper training.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">And if you have no other problems in your life in general.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">Life has not been all roses.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">Yet you, with the strength of a wild olive trunk.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">It is not to be used but only looked at.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">That it is either one or two years long, and so on and so forth -that it has to -that it has to</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">February sees the start of cheese making.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7">The eel hides in the mud.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8">The ox is a very important animal.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_9">The cabbage cooks best in this pan.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Work funded by the New Perspectives on Diphthong Dynamics (DID) project #I83C22000390005.</p><p>We would like to extend our gratitude to Daniela Mereu for providing the essential data for this research and for her invaluable perspective. We also thank Loredana Schettino and Aleese Block for their support and helpful insights.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Magueresse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Carles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Heetderks</surname></persName>
		</author>
		<idno>arXiv</idno>
		<ptr target="https://arxiv.org/abs/2006.07264.arXiv:2006.07264" />
		<title level="m">Lowresource languages: A review of past work and future challenges</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">The state and fate of linguistic diversity and inclusion in the NLP world</title>
		<author>
			<persName><forename type="first">P</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Santy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Budhiraja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Choudhury</surname></persName>
		</author>
		<idno>CoRR abs/2004.09095</idno>
		<ptr target="https://arxiv.org/abs/2004.09095.arXiv:2004.09095" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Not always about you: Prioritizing community needs when developing endangered language technology</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Richardson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Hatcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">T</forename><surname>Prudhommeaux</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:248118721" />
	</analytic>
	<monogr>
		<title level="m">Annual Meeting of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Exploration of whisper fine-tuning strategies for low-resource asr</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Qu</surname></persName>
		</author>
		<idno type="DOI">10.1186/s13636-024-00349-3</idno>
		<idno>doi:</idno>
		<ptr target="10.1186/s13636-024-00349-3" />
	</analytic>
	<monogr>
		<title level="j">EURASIP Journal on Audio, Speech, and Music Processing</title>
		<imprint>
			<biblScope unit="page">29</biblScope>
			<date type="published" when="2024">2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Spaiche: Extending state-of-the-art asr models to swiss german dialects</title>
		<author>
			<persName><forename type="first">C</forename><surname>Sicard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Pyszkowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Gillioz</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2304.11075</idno>
		<idno type="arXiv">arXiv:2304.11075</idno>
		<ptr target="https://arxiv.org/abs/2304.11075.doi:10.48550/arXiv.2304.11075" />
	</analytic>
	<monogr>
		<title level="m">Swiss Text Analytics Conference</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Making more of little data: Improving low-resource automatic speech recognition using data augmentation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bartelds</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>San</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mcdonnell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Wieling</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2305.10951</idno>
		<ptr target="https://api.semanticscholar.org/CorpusID:258762740.doi:10.48550/arXiv.2305.10951" />
	</analytic>
	<monogr>
		<title level="m">Annual Meeting of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">wav2vec 2.0: A framework for self-supervised learning of speech representations</title>
		<author>
			<persName><forename type="first">A</forename><surname>Baevski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Auli</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2006.11477</idno>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="12449" to="12460" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Simple and effective zero-shot cross-lingual phoneme recognition</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Baevski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Auli</surname></persName>
		</author>
		<idno type="DOI">10.21437/interspeech.2022-60</idno>
		<ptr target="https://arxiv.org/abs/2109.11680.doi:10.21437/interspeech.2022-60" />
	</analytic>
	<monogr>
		<title level="m">Interspeech</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Yeroyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Karpov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2406.01446</idno>
		<ptr target="https://arxiv.org/abs/2406.01446.arXiv:2406.01446" />
		<title level="m">Enabling asr for lowresource languages: A comprehensive dataset creation approach</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">V</forename><surname>Blaschke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Purschke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Plank</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2402.11968</idno>
		<idno type="arXiv">arXiv:2402.11968</idno>
		<title level="m">What do dialect speakers want? a survey of attitudes towards language technology for german dialects</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Mensching</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E.-M</forename><surname>Remberger</surname></persName>
		</author>
		<idno type="DOI">10.1093/acprof:oso/9780199677108.003.0017</idno>
		<ptr target="https://doi.org/10.1093/acprof:oso/9780199677108.003.0017.doi:10.1093/acprof:oso/9780199677108.003.0017" />
		<title level="m">The Oxford Guide to the Romance Languages</title>
				<imprint>
			<publisher>Oxford University Press</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="270" to="291" />
		</imprint>
	</monogr>
	<note>270sar-dinian</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title/>
		<author>
			<persName><forename type="first">D</forename><surname>Mereu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cagliari</forename><surname>Sardinian</surname></persName>
		</author>
		<idno type="DOI">10.1017/S0025100318000385</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of the International Phonetic Association</title>
		<imprint>
			<biblScope unit="volume">50</biblScope>
			<biblScope unit="page" from="389" to="405" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Oppo</surname></persName>
		</author>
		<title level="m">Le lingue dei sardi. una ricerca sociolinguistica</title>
				<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Sardinian</forename><surname>Ethnologue</surname></persName>
		</author>
		<ptr target="https://www.ethnologue.com/language/sro/" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Rattu</surname></persName>
		</author>
		<title level="m">Repertorio Plurilingue e Variazione Linguistica a Cagliari: I Quartieri di Castello, Marina, Villanova, Stampace, Bonaria e Monte Urpinu, Master</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
		<respStmt>
			<orgName>Università degli Studi di Cagliari</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">&apos;s thesis</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">F</forename><surname>Eduardo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Amos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Stefano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Nicola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Massimo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Michele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Francesco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ivo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Pietro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Oreste</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Antonella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Paola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Marco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Paolo</surname></persName>
		</author>
		<title level="m">Arrègulas po ortografia, fonètica, morfologia e fueddàriu de sa Norma Campidanesa de sa Lìngua Sarda</title>
				<imprint>
			<publisher>ALFA EDITRICE</publisher>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Efforts to standardise minority languages: The case of sardinian, Europäisches Journal für Minderheitenfragen</title>
		<author>
			<persName><forename type="first">D</forename><surname>Mereu</surname></persName>
		</author>
		<idno type="DOI">10.35998/ejm-2021-0004</idno>
	</analytic>
	<monogr>
		<title level="j">European Journal of Minority Studies</title>
		<imprint>
			<biblScope unit="page" from="76" to="95" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">La distribuzione delle parti del discorso nel parlato e nello scritto campidanese e fenomeni del parlato in una lingua minoritaria di contatto</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gunsch</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
		<respStmt>
			<orgName>Free University of Bozen-Bolzano</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Master&apos;s thesis</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Mereu</surname></persName>
		</author>
		<title level="m">Il sardo parlato a Cagliari: una ricerca sociofonetica</title>
				<meeting><address><addrLine>Milano</addrLine></address></meeting>
		<imprint>
			<publisher>FrancoAngeli</publisher>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">V</forename><surname>Srivastav</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Majumdar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Koluguri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moumen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gandhi</surname></persName>
		</author>
		<ptr target="https://huggingface.co/spaces/hf-audio/open_asr_leaderboard" />
		<title level="m">Open automatic speech recognition leaderboard</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Robust speech recognition via large-scale weak supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Brockman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mcleavey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2212.04356</idno>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="28492" to="28518" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Stt multilingual fastconformer hybrid large pc</title>
		<ptr target="https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_multilingual_fastconformer_hybrid_large_pc" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Stt it fastconformer hybrid large pc</title>
		<ptr target="https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_it_fastconformer_hybrid_large_pc" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Face</surname></persName>
		</author>
		<ptr target="https://huggingface.co/docs/transformers/en/model_doc/xlsr_wav2vec2" />
		<title level="m">Xls-r wav2vec2 model documentation</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Face</surname></persName>
		</author>
		<ptr target="https://huggingface.co/facebook/wav2vec2-large-xlsr-53-italian" />
		<title level="m">wav2vec2-large-xlsr-53-italian</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Evaluate: A library for evaluation in machine learning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Face</surname></persName>
		</author>
		<ptr target="https://github.com/huggingface/evaluate" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Fast conformer with linearly scalable attention for efficient speech recognition</title>
		<author>
			<persName><forename type="first">D</forename><surname>Rekesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kriman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Majumdar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Noroozi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Juang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Hrinchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ginsburg</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:258564901" />
	</analytic>
	<monogr>
		<title level="m">IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)</title>
				<imprint>
			<date type="published" when="2023">2023. 2023</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Face</surname></persName>
		</author>
		<ptr target="https://huggingface.co/facebook/wav2vec2-xlsr-53-espeak-cv-ft" />
		<title level="m">wav2vec2-xlsr-53-espeak-cv-ft</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Allophant: Cross-lingual phoneme recognition with articulatory attributes</title>
		<author>
			<persName><forename type="first">K</forename><surname>Glocker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Herygers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Georges</surname></persName>
		</author>
		<idno type="DOI">10.21437/interspeech.2023-772</idno>
		<ptr target="http://dx.doi.org/10.21437/Interspeech.2023-772.doi:10.21437/interspeech.2023-772" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of Interspeech</title>
				<meeting>Interspeech</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
