<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>for Documenting Endangered Languages: A Preliminary Study on Sardinian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilaria Chizzoni</string-name>
          <email>ilaria.chizzoni@unibz.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Vietti</string-name>
          <email>alessandro.vietti@unibz.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Free University of Bozen-Bolzano</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Speech recognition systems are still highly dependent on textual orthographic resources, posing a challenge for low-resource languages. Recent research leverages self-supervised learning of unlabeled data or employs multilingual models pre-trained on high resource languages for fine-tuning on the target low-resource language. These are efective approaches when the target language has a shared writing tradition, but when we are confronted with mainly spoken languages, being them endangered minority languages, dialects, or regional varieties, other than labeled data, we lack a shared metric to assess speech recognition performance. We first provide a research background on ASR for low-resource languages and describe the specific linguistic situation of Campidanese Sardinian, we then evaluate five multilingual ASR models using traditional evaluation metrics and an exploratory linguistic analysis. The paper addresses key challenges in developing a tool for researchers to document and analyze the phonetics and phonology of spoken (endangered) languages.</p>
      </abstract>
      <kwd-group>
        <kwd>Speech recognition</kwd>
        <kwd>Campidanese Sardinian</kwd>
        <kwd>Resource and evaluation</kwd>
        <kwd>Spoken language documentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>The growing interest in understudied languages has led</title>
        <p>to categorizing them on the basis of resource availability,
defining them as high, low, or zero-resource languages.
In the narrowest sense, zero and low-resource languages
are those lacking suficient data to train statistical and
machine learning models [1] [2] [3]. However, such a
technical definition is not adequate to account for the
diferent linguistic scenarios of world languages. As a
matter of fact, in the literature, the term low and zero
times, it is used to describe standard, widely spoken
languages with a shared orthography, that cannot rely
on many hours of transcribed or annotated speech, see
Afrikaans, Icelandic, and Swahili in [4]. Sometimes, it is
used for non-standard, widely spoken languages,
lacking a shared orthography (no orthography or multiple
proposed orthographies) as for Swiss German dialects
[5] or Nasal and Besemah [6]. And sometimes to refer
to non-standard, endangered languages lacking a shared
orthography, like Bribri, Mi’kmaq and Veps [3].</p>
      </sec>
      <sec id="sec-1-2">
        <title>These scenarios are mainly being addressed with two approaches: The first leverages self-supervised learning, and uses unlabeled data from the target language to learn linguistic structures [7]. Self-supervised learning</title>
        <p>(A. Vietti)
(A. Vietti)
ing self-learning [6], text-to-speech (TTS) [6] or
optimized dataset creation approaches [9]. Bartelds and
colleagues [6] propose data augmentation techniques to
develop ASR for minority languages, regional languages
or dialects. They employ a self-training method on
Besemah and Nasal two Austronesian languages spoken
in Indonesia. In self-training, a teacher XLS-R model
is fine-tuned on manually transcribed data, the teacher
model is used to transcribe unlabeled speech and then
a student model is fine-tuned on the combined datasets
of manually and automatically transcribed data. Since
the collected 4 hours of manually transcribed speech
for Besemah and Nasal followed diferent orthography
conventions, the transcriptions were first normalized to
working orthographies and then used for fine-tuning.
CEUR</p>
        <p>ceur-ws.org</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Campidanese Sardinian</title>
      <p>TTS system available for Gronings, a Low-Saxon lan- documentation of these two languages.
guage variant spoken in the province of Groningen in
the Netherlands, to generate more synthetic training data
from textual sources and they achieved great results [6].</p>
      <p>While fine-tuning paired with data augmentation
techniques works for low-resource, widely-spoken languages,
developing a speech recognition system for endangered
spoken languages also involves ethical considerations
towards the local community. More participatory
research is required to understand the native speakers’
relationship with the written form of their language, as
well as with language technologies. In their position
paper [3] Liu and colleagues emphasize the importance
of creating language technologies in consultation with
speakers, activists, and community language workers.</p>
      <p>They present a case study on Cayuga, an endangered
indigenous language of Canada with approximately 50
native elder speakers and an increasing number of young
L2 speakers. After gaining insights from the
community, they began collaborating on a morphological parser.</p>
      <p>This tool aids teachers and young L2 students in language
learning while gradually providing morphological
annotations and segmentations useful for developing ASR
systems for researchers. Blaschke and colleagues [10]
surveyed over 327 native speakers of German dialects
and regional varieties, finding that respondents prefer
tools that process speech over text and favor language
technology that handles dialect speech input rather than
output. Understanding the needs of the speech
community and diferentiating them from those of linguistic
researchers can guide research more efectively.</p>
      <p>This paper outlines the first steps towards a speech
recognition system for researchers to aid the systematic
analysis of the phonetics and phonology of Campidanese,
an endangered language spoken in southern Sardinia.</p>
      <p>To achieve this goal, we first describe the situation of
the speech community of the target language, we then
select five speech recognition multilingual and ready for
inference models and evaluate them on Campidanese
Sardinian. When multilingual models were not available for
speech recognition task, we chose multilingual models
ifne-tuned on Italian, which we assume to be a relatively
close language both genealogically and structurally. We
assess the goodness of the models’ inferences, first by
computing the traditional evaluation metrics, i.e., average
Word Error Rate (WER) and Character Error Rate (CER),
and then carrying out a qualitative linguistic analysis to
have better insights of which model best meets the needs
for language documentation and research. This work
is part of “New Perspectives on Diphthong Dynamics
(DID)”, a joint project between the University of Bozen
and the Ludwig-Maximilians-Universität München,
focusing on the study of diphthongs dynamics in two
understudied languages, i.e., Campidanese Sardinian and
Tyrolean and aims to build a corpus for the linguistic
Sardinian is a Romance language spoken on the Sardinia
island in Italy [11]; it is considered an oficial minority
language and is protected by National Law n.482/1999
and Regional Law n.26/1997 but does not have a written
standard [12]. Sardinia has a high internal linguistic
diversity but the two main macro varieties are Logudorese
(ISO code 639-3 src), spoken in the northern sub-region
and Campidanese (ISO code 639-3-sro), spoken in the
southern sub-region of Sardinia [12]. To date, there are
no quantitative studies on the real number of Sardinian
speakers. The first sociolinguistic survey [ 13] carried
out by Regione Sardegna in 2007 on 2437 speakers states
that 68.4% of the respondents claim to know and speak
a variety of the local languages. However, the survey
was based on the speakers’ self-assessment. As far as
Campidanese Sardinian is concerned, Ethnologue lists it
as an endangered indigenous language [14] and research
[12] claims it is used as a first language just by some
elder adults in the ethnic community, and not taught to
children anymore. In 2017, Rattu [15] carried out a
sociolinguistic survey on 310 Cagliari speakers, where a
selfassessment questionnaire was followed by a language
test (mostly translation tasks from Italian to Sardinian)
and only a minority of respondents over the age of 45
achieved good or excellent results.</p>
      <p>The Sardinian Regional Administration presented two
proposals for an oficial standard language: the first in
2001, presented as a linguistic compromise but actually
over representative of Logudorese (Limba Sarda
Unificada LSU), and the second in 2006, mainly based on the
central regional variety (Limba Sarda Comuna LSC) [12].</p>
      <p>The latter remains the one used for communication by the
Regional Administration, while in the Cagliari Province a
proposal of orthographic rules for Campidanese called Sa
Norma Campidanesa has been put forward in 2009 by the
Comitau Scientìficu po sa normalisadura de sa bariedadi
campidanesa de sa lìngua sarda [16]. Without discussing
the issue of the orthographic norm, which is inherently
political, we would like to point out that these proposals
do not seem to have become part of everyday language
use by the speech community [17]. This is primarily
because they were not based on any oficial data regarding
the linguistic and sociolinguistic situation or language
use [18]. Therefore, these standards remained limited to
administrative communications.</p>
      <p>Some tendencies in the speakers’ linguistic attitudes
emerged from the DID project data collection fieldwork
conducted in 2023 in the city of Sinnai. Native speakers
of Campidanese are often unfamiliar with the written
version of their language. Elder native speakers had
no way or need to write the language, except in the last activity detection [21]. We tested it without passing a
decade through social networks. Whereas, the few young specific language.
people who use the language even in its written version The multilingual FastConformer Hybrid
Transducerto communicate with friends and family via message CTC model is a model developed by NVIDIA,
comservice apps, do not use Sa Norma Campidanesa, but bining the FastConformer architecture with a hybrid
rather use a transcription that intuitively approximates Transducer-CTC approach [22]. NVIDIA
FastConformtheir pronunciation. ers come across as very competitive for their eficiency
and computational speed. We tested both the
multilingual model version 1.20.0, trained on Belarusian, German,
3. Experiments English, Spanish, French, Croatian, Italian, Polish,
Russian, and Ukrainian [22], and the Italian model version
3.1. Campidanese Sardinian dataset 1.20.0 trained specifically on Italian (Mozilla Common
We decided to evaluate the speech recognition models Voice 12, Multilingual LibriSpeech and VoxPopuli) [23].
on a small sample of highly controlled Sardinian data, in By Facebook we chose Wav2Vec 2.0 XLSR, a model
order to carry out a qualitative linguistic analysis of the that learns cross-lingual speech representations from the
output transcription. The dataset includes short audios raw waveform of speech in multiple languages during
of read speech with an average length of 3.5 seconds pre-training [24]. We use
wav2vec2-large-xlsr-53(read_short), long audios of read speech with an average italian, the Wav2Vec 2.0 model pre-trained on
multilength of 23 seconds (read_long), and short audios of lingual data from Multilingual LibriSpeech, Mozilla
Comspontaneous speech with an average length of 5.3 sec- mon Voice and BABEL and fine-tuned on Italian [ 25].
onds (spontaneous). Read speech is a subset of the corpus To attempt an automatic phonetic transcription we used
gathered during the DID project fieldwork in Sinnai. For wav2vec2-xlsr-53-espeak-cv-ft, the same Wav2Vec
the read_short, participants were asked to read aloud 2.0 Large XLSR model, fine-tuned on multilingual
Comshort sentences developed by the research group, using mon Voice dataset to recognize phonetic labels [8].
an orthography close to Sa Norma Campidanesa. In par- In order to have a standard reference, traditional
evalticular, twenty audio clips of four native speakers (2F and uation metrics for speech recognition systems like WER
2M) were selected. Two longer audio clips were selected and CER were computed via the evaluate HuggingFace
from the same corpus: one of a female speaker reading an library [26]. Since the output text was normalized
diferautograph poem, and another of a male speaker reading ently by the diferent models, a text normalization was
an excerpt of an autograph story. To have speech style done on both reference and hypothesis transcriptions,
variability, chunks of spontaneous speech from ethno- removing every special characters (non-alphanumeric
graphic interviews collected by Mereu [19] in Cagliari characters) before computing WER and removing special
in 2016 were included. Twelve audio chunks were ex- characters and spaces (tabs, spaces and new lines)
betracted from two of the interviews conducted with two fore computing CER. We made no additional changes to
male native speakers of Campidanese. The orthographic the inferences, and no default parameters of the models
transcripts followed diferent Campidanese conventions were modified. All tests were run locally to respect data
either being written or validated by native speakers. privacy policies.</p>
      <sec id="sec-2-1">
        <title>3.2. Methods</title>
      </sec>
      <sec id="sec-2-2">
        <title>3.3. Models evaluation</title>
        <p>xlsr-53-espeak-cv-ft from Facebook.</p>
        <p>Open AI Whisper is a Transformer
sequence-tosequence multilingual and multitask model trained on
performing multilingual speech recognition, speech
translation, spoken language identification, and voice
From HuggingFace’s Open ASR Leader board [20], ready- Regarding the WER metric, we assume models to
perto-test models with low Real-Time-Factor (RTF) values form possible word recognition based on the inventory
were selected. Out of the five tested models, two are of multilingual or Italian tokens, since the model has not
multilingual models containing at least one Romance been trained or fine-tuned on any Sardinian data. This
language in their training dataset i.e., whisper-large- is why in our case average WER is poorly significant.
We therefore evaluate performance mainly by looking at
CER.</p>
        <p>In Table 1 we can see there is little diference in the
performance between Whisper medium and large-v2.
Surv2 and multilingual-fastconformer-hybrid-large;
and three were multilingual models fine-tuned on
Italian datasets and ready for inference, this is the case
for it-fastconformer-hybrid-large from NVIDIA
and wav2vec2-large-xlsr-53-italian and wav2vec2- prisingly, however, Whisper medium performs better on
long read-speech data, reaching a CER of 0.22 versus
Whisper large-v2 only achieving 0.36. This could be due
to a better performance of the translation task in Whisper
large-v2. However, the larger model performs better on
spontaneous speech (CER 0.39) then the medium model</p>
        <sec id="sec-2-2-1">
          <title>While using the exact same architecture as Wav2Vec2,</title>
          <p>Wav2Vec2Phoneme maps phonemes of the training
languages to the target language using articulatory features
[8]. Since the model outputs a string of tab-separated
phonetic labels, we computed the CER metric only. As a
reference, we used the story Sa tramuntana e su soli which
was phonemically and phonetically transcription
provided by Mereu [12]. The input file is a single 43-second
audio of a young female native speaker of Campidanese
Sardinian. When comparing the Wav2VecPhoneme
predictions with the human phonemic transcription we get a
Phoneme Error Rate (PER) of 0.28, while when comparing
it with the phonetic human transcription, PER decreases
to 0.23. This results suggest that an automatic
transcription into phonemes rather than characters would be a
path worth exploring, allowing a systematic description
of the phonetics and phonology of endangered spoken
languages, while bypassing the orthography issue. These
results align with recent work on cross-lingual
transfer [29] proposing a very similar solution to develop a
multilingual phoneme recognizer.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Exploratory Linguistic Analysis</title>
      <p>Length (s)
CER
WER</p>
      <sec id="sec-3-1">
        <title>In this section, we present an exploratory linguistic analy</title>
        <p>W2V-IT read_short 3.5 0.68 1.00 sis to evaluate to what extent the orthographic
transcripW2V-IT read_long 23.5 0.25 0.81 tions from the tested ASR models capture the phonetic
W2V-IT spontaneous 5.3 0.36 0.90 events present in the speech signal. The analysis is based
on the inventory of phonological phenomena described
for Campidanese Sardinian spoken in Cagliari [12].
(CER 0.52). As shown in Table 2, both NVIDIA Fast Con- In multilingual FastConformer’s predictions some
former models achieve low values on long audios of read known phonological processes of Campidanese can be
speech. While multilingual FastConformer reaches the recognized. For instance, in Campidanese Sardininan
best values overall, Wav2Vec XLSR fine-tuned on Ital- the alveolar tap [R] is an allophone of /r/ in word-medial
ian performs better than the multilingual FastConformer intervocalic position and a sociophonetic variant of /t/
ifne-tuned on Italian (see Table 3). and /d/ in the Cagliari variety [12]. In examples 1 and</p>
        <p>Overall, CER is relatively low on long read speech, 4, the intervocalic /t/ across word boundaries (si lui and
which is intuitively understandable, considering the se- ma lui) is transcribed as /l/, which can be considered
lected models have all been trained mainly on read speech a good orthographic approximation to an alveolar tap.
(Mozilla Common Voice data and audio books). Poor per- Following a process of lenition of voiceless plosives and
formance on short audios was also expected, since all the fricatives, the intervocalic labiodental fricatives /f/ across
tested models where pre-trained on longer audio chunks, word boundaries are also consistently transcribed as their
ranging from 20 to 30 seconds [27] [21] [7]. Given the voiced counterpart /v/, see example 1 asivato, example 4
similar average length of the audio inputs, it is surprising con savorza and deno vusti. Voiceless plosives /p/, /t/, and
that every model performs better on short spontaneous /k/ in word-medial intervocalic positions are expected to
speech than on short read speech. be realized with a long duration, in the predictions are</p>
        <p>The relatively low CER values suggest promising po- recognized as geminate sounds, see example 5 in deppidi
tential, particularly for the multilingual models. There- and mascetti, yet not always, see example 1 depidi. We
fore, we decided to get more phonetically informative also notice the insertion of paragogic vowels, which in
outputs to evaluate how well these models generalize Campidanese are inserted after a final consonant to avoid
beyond word boundaries and language-specific spelling consonant in word-final coda position [ 12], as in example
conventions. We select wav2vec2-xlsr-53-espeak-cv- 1 depidi and zinotenesi or a rosasa in example 3. Except
ft, a Wav2Vec 2.0 XLSR model fine-tuned on multilingual for esaminat in exemple 1 where it was expected and
Common Voice dataset to recognize phonetic labels [28]. actually produced in the audio.</p>
        <p>Although this model seems to propose an orthographic 9. sanguidda si cuat in mesu e su ludu 8
transcription close enough to the phonetic one, it some- sanguidas igual em mesa sulado açuludo
times makes systematic choices that are unfaithful to the
acoustic signal. We provide an example where /u/ both Similarly to multilingual FastConformer, Wav2Vec
in word medial and final position is generally transcribed XLSR accounts for many of the phonological
phenomas /o/, not only when there is an Italian equivalent or ena of Campidanese. The voiceless plosives /k/ and /p/,
phonetically close lexical item e.g, antunietta&gt;antonietta; lenited to voiced fricatives [ɣ] and [β] when found in
incoru&gt;coro; su&gt;suo; cun&gt;con, but also when the item is tervocalic environment across word boundaries [12], are
unknown to the model ollastu&gt;ollasto; dentradura&gt;den- transcribed as /g/ and /v/ in gusta vingiara and sugauli
tradora, giving reason to believe that the model might in example 13. While in Wav2Vec model the alveolar tap
have information about the phonotactic constraints in [ɾ] is rendered as /r/ instead of /l/ see sirui in example 10.
Italian, e.g. no [u] in word final position.
1[He/she] makes sure you have done the proper training.
2And if you have no other problems in your life in general.
3Life has not been all roses.
4Yet you, with the strength of a wild olive trunk.
5It is not to be used but only looked at.
6That it is either one or two years long, and so on and so forth – that 8The eel hides in the mud.
it has to – that it has to 9The ox is a very important animal.
7February sees the start of cheese making. 10The cabbage cooks best in this pan.</p>
        <p>1. esaminat si tui as fatu su percursu cumenti si depit 1</p>
        <p>examina si lui asivato subercurso come zi depidi
2. e si non tenis atrus problemas in sa vida in foras 2</p>
        <p>e zinotenesi a tus problema in savira in forez</p>
      </sec>
      <sec id="sec-3-2">
        <title>3. sa vida no es stettia tuttu arrosas 3</title>
        <p>savidano e stetti a dotto a rosasa</p>
      </sec>
      <sec id="sec-3-3">
        <title>4. ma tui con sa forza de unu fusti di ollastu 4</title>
        <p>ma lui con savorza deno vusti di ollasto</p>
      </sec>
      <sec id="sec-3-4">
        <title>5. no si deppiti imperai ma sceti castiai 5</title>
        <p>nosi deppidi imperai mascetti gastiai</p>
        <p>Regarding Whisper large-v2, we notice in some cases
a near-perfect Italian translation of the Sardinian input
audios, see example 5 and 6 below; in others cases, a
poorer Italian translation with the deletion of repetitions,
as in 7. Surprisingly, in example 8 and 9 we see how the
tentative translations (or identifications with the
phonetically most similar lexical items in a known language)
also happens to Portuguese. Similar behavior is observed
in Whisper medium: tentative Italian and Portuguese
translations, and hallucinations both in spontaneous and
read short input audios.</p>
        <p>5. esaminat si tui as fatu su percursu cumenti si depit
esamina se lui ha fatto il suo percorso come si deve
6. e si non tenis atrus problemas in sa vida in foras
se non ha altri problemi in vita in forza
7. chi est o de un annu o de duus annus eccetera eccetera
chi depis chi depis 6
chi e di un anno o di due anni chi deve essere
8. in su mesi e friaxu si cumentzat a fai su casu 7
em cima das evriagens o segundo mes ate faz sucesso
10. esaminat si tui as fatu su percursu cumenti si depit
einasidu sirui ha sivato su bercursu come zi deperi
11. e si non tenis atrus problemas in sa vida in foras
esino tenesi atosproblema sainsavvira in forese
12. su boi est un animali de meda importantzia 9
su boe e un animale de meda importanza
13. su cauli coit mellus in custa pingiada 10</p>
        <p>sugauli coi melusu in gusta vingiara
14. ma tui con sa forza de unu fusti di ollastu</p>
        <p>madoi con savorza de unovusti diolastu</p>
      </sec>
      <sec id="sec-3-5">
        <title>Unlike Whisper large-v2, Wav2Vec XLSR never per</title>
        <p>forms translations and, unlike the FastConformer
finetuned on Italian, does not seem to respect the Italian
phonotactic constraints, see diolastu in example 14.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusions and Future steps</title>
      <sec id="sec-4-1">
        <title>The preliminary analysis carried out in this paper pro</title>
        <p>vided insight into how various speech recognition models
transcribe data in a Romance language not encountered
in the model training. All evaluated models improve
their performance as the audio length increases. Best
CER values are achieved on audio of read speech longer
than 20 seconds. However, short audios of spontaneous
speech with an average length of 5.3 seconds achieved a
remarkably low CER, meaning better precision compared
to the similarly short (3.5 seconds) read speech chunks.
These results suggest that speech style might also play a
role. To investigate whether the models are sensitive to
speech style, other linguistic, speaker-specific, or
technical variables, such as the topic, age, gender of the speaker,
or the acoustic quality of the audio data, should be taken
into account. For example, both datasets of spontaneous
speech are produced by males over 45, and models might
be biased toward an adult male speaker profile. For the
time being, we attribute it to the poor representativeness
of the dataset and will investigate it in future work.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>Work funded by the New Perspectives on Diphthong</title>
        <p>Dynamics (DID) project #I83C22000390005.</p>
        <p>We would like to extend our gratitude to Daniela
Mereu for providing the essential data for this research
and for her invaluable perspective. We also thank
Loredana Schettino and Aleese Block for their support
and helpful insights.</p>
        <p>A controlled yet diverse dataset facilitated a
qualitative linguistic analysis of the predictions. Interestingly,
some models seem to follow the phonotactic constraints
of the languages they have been trained on, but at the
same time they generalize well to unfamiliar languages,
providing quite accurate phonetically-like orthographic
transcription of Campidanese Sardinian. These initial
considerations should be validated with tests on a larger
corpus to eliminate data bias and a more systematic
linguistic analysis to avoid cherry-picking. We also plan
to look in detail at the speech recognition models’
architectures in order to make a informed choice at the
ifne-tuning phase.</p>
        <p>In conclusion, it seems that state-of-the-art
transcription models, especially multilingual ones, produce a
phonetically accurate orthographic transcription of
Campidanese Sardinian and thus provide a promising basis for
ifne-tuning. Specifically, Wav2Vec2 large XLSR-53 and
STT Multilingual FastConformer Hybrid proved to
be the best models according to the evaluation metrics
and preliminary linguistic analysis. STT Multilingual
FastConformer Hybrid was the best and most eficient
in terms of computational resources, which makes it our
ifrst choice for further testing and fine-tuning.
However, it is worth noticing, speech recognition systems
with orthographic output can be costly in terms of
human and computational resources, poorly informative for
speech researchers and uninteresting to native speakers;
whereas recent work on multilingual automatic
phonemic recognition seems a viable alternative worth
exploring for documenting endangered spoken languages.
[14] Ethnologue, Sardinian, campidanese, 2024. URL: tomatic Speech Recognition and Understanding
https://www.ethnologue.com/language/sro/. Workshop (ASRU) (2023) 1–8. URL: https://api.
[15] R. Rattu, Repertorio Plurilingue e Variazione Lin- semanticscholar.org/CorpusID:258564901.
guistica a Cagliari: I Quartieri di Castello, Marina, [28] H. Face, wav2vec2-xlsr-53-espeak-cv-ft,
Villanova, Stampace, Bonaria e Monte Urpinu, Mas- 2021. URL: https://huggingface.co/facebook/
ter’s thesis, Università degli Studi di Cagliari, 2017. wav2vec2-xlsr-53-espeak-cv-ft.
[16] B. F. Eduardo, C. Amos, C. Stefano, D. Nicola, [29] K. Glocker, A. Herygers, M. Georges, Allophant:
M. Massimo, M. Michele, M. Francesco, M. Ivo, Cross-lingual phoneme recognition with
articulaP. Pietro, P. Oreste, R. Antonella, S. Paola, S. Marco, tory attributes, in: Proceedings of Interspeech,
Z. Paolo, Arrègulas po ortografia, fonètica, morfolo- 2023. URL: http://dx.doi.org/10.21437/Interspeech.
gia e fueddàriu de sa Norma Campidanesa de sa 2023-772. doi:10.21437/interspeech.2023- 772.</p>
        <p>Lìngua Sarda, ALFA EDITRICE, 2009.
[17] D. Mereu, Eforts to standardise minority
languages: The case of sardinian, Europäisches
Journal für Minderheitenfragen. European Journal
of Minority Studies (2021) 76–95. doi:10.35998/
ejm- 2021- 0004.
[18] S. Gunsch, La distribuzione delle parti del discorso
nel parlato e nello scritto campidanese e fenomeni
del parlato in una lingua minoritaria di contatto,
Master’s thesis, Free University of Bozen-Bolzano,
2022.
[19] D. Mereu, Il sardo parlato a Cagliari: una ricerca</p>
        <p>sociofonetica., FrancoAngeli., Milano, 2019.
[20] V. Srivastav, S. Majumdar, N. Koluguri, A. Moumen,</p>
        <p>S. Gandhi, et al., Open automatic speech
recognition leaderboard, https://huggingface.co/spaces/
hf-audio/open_asr_leaderboard, 2023.
[21] A. Radford, J. W. Kim, T. Xu, G. Brockman,</p>
        <p>C. McLeavey, I. Sutskever, Robust speech
recognition via large-scale weak supervision, in:
International conference on machine learning, PMLR,
2023, pp. 28492–28518. doi:10.48550/arXiv.2212.</p>
        <p>04356.
[22] NVIDIA, Stt multilingual fastconformer hybrid
large pc, 2023. URL: https://catalog.ngc.nvidia.com/
orgs/nvidia/teams/nemo/models/stt_multilingual_
fastconformer_hybrid_large_pc.
[23] NVIDIA, Stt it fastconformer hybrid large pc, 2023.</p>
        <p>URL: https://catalog.ngc.nvidia.com/orgs/nvidia/
teams/nemo/models/stt_it_fastconformer_hybrid_
large_pc.
[24] H. Face, Xls-r wav2vec2 model
documentation, 2024. URL: https://huggingface.co/docs/
transformers/en/model_doc/xlsr_wav2vec2.
[25] H. Face, wav2vec2-large-xlsr-53-italian,
2021. URL: https://huggingface.co/facebook/
wav2vec2-large-xlsr-53-italian.
[26] H. Face, Evaluate: A library for evaluation in
machine learning, 2024. URL: https://github.com/
huggingface/evaluate.
[27] D. Rekesh, S. Kriman, S. Majumdar, V. Noroozi,</p>
        <p>H. Juang, O. Hrinchuk, A. Kumar, B. Ginsburg,
Fast conformer with linearly scalable attention
for eficient speech recognition, 2023 IEEE
Au</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>