Towards an ASR System for Documenting Endangered
                                Languages: A Preliminary Study on Sardinian
                                Ilaria Chizzoni1 , Alessandro Vietti1
                                1
                                    Free University of Bozen-Bolzano


                                                  Abstract
                                                  Speech recognition systems are still highly dependent on textual orthographic resources, posing a challenge for low-resource
                                                  languages. Recent research leverages self-supervised learning of unlabeled data or employs multilingual models pre-trained
                                                  on high resource languages for fine-tuning on the target low-resource language. These are effective approaches when the
                                                  target language has a shared writing tradition, but when we are confronted with mainly spoken languages, being them
                                                  endangered minority languages, dialects, or regional varieties, other than labeled data, we lack a shared metric to assess
                                                  speech recognition performance. We first provide a research background on ASR for low-resource languages and describe
                                                  the specific linguistic situation of Campidanese Sardinian, we then evaluate five multilingual ASR models using traditional
                                                  evaluation metrics and an exploratory linguistic analysis. The paper addresses key challenges in developing a tool for
                                                  researchers to document and analyze the phonetics and phonology of spoken (endangered) languages.

                                                  Keywords
                                                  Speech recognition, Campidanese Sardinian, Resource and evaluation, Spoken language documentation


                                1. Introduction                                                                                                    is an optimal choice in low-resource settings because
                                                                                                                                                   only requires to gather more audio data. However, it
                                The growing interest in understudied languages has led seems costly and prone to catastrophic forgetting [6] [4].
                                to categorizing them on the basis of resource availability, The second approach involves training a multilingual
                                defining them as high, low, or zero-resource languages. model on labeled data from highly-resourced languages
                                In the narrowest sense, zero and low-resource languages and then applying the trained model to transcribe un-
                                are those lacking sufficient data to train statistical and seen target languages. This includes the benefits of a
                                machine learning models [1] [2] [3]. However, such a supervised learning setting and proved to be effective [8].
                                technical definition is not adequate to account for the Pre-trained multilingual models can then be fine-tuned
                                different linguistic scenarios of world languages. As a on just a smaller dataset of labeled data in the target lan-
                                matter of fact, in the literature, the term low and zero guage. Since fine-tuning is a straightforward, efficient
                                resource languages is still used inconsistently. Some- approach, it is the preferred one to address the problem
                                times, it is used to describe standard, widely spoken of low-resource languages [6]. However, the success of
                                languages with a shared orthography, that cannot rely this approach still depends on the amount of available
                                on many hours of transcribed or annotated speech, see labeled data in the target language or whether or not it
                                Afrikaans, Icelandic, and Swahili in [4]. Sometimes, it is is possible to generate more, e.g., via data augmentation.
                                used for non-standard, widely spoken languages, lack-                                                                 Several data augmentation approaches for low-
                                ing a shared orthography (no orthography or multiple resource languages are currently being explored, includ-
                                proposed orthographies) as for Swiss German dialects ing self-learning [6], text-to-speech (TTS) [6] or opti-
                                [5] or Nasal and Besemah [6]. And sometimes to refer mized dataset creation approaches [9]. Bartelds and col-
                                to non-standard, endangered languages lacking a shared leagues [6] propose data augmentation techniques to
                                orthography, like Bribri, Mi’kmaq and Veps [3].                                                                    develop ASR for minority languages, regional languages
                                              These scenarios are mainly being addressed with two or dialects. They employ a self-training method on Be-
                                approaches: The first leverages self-supervised learn- semah and Nasal two Austronesian languages spoken
                                ing, and uses unlabeled data from the target language to in Indonesia. In self-training, a teacher XLS-R model
                                learn linguistic structures [7]. Self-supervised learning is fine-tuned on manually transcribed data, the teacher
                                                                                                                                                   model is used to transcribe unlabeled speech and then
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                     a student model is fine-tuned on the combined datasets
                                ∗
                                     Corresponding author.                                                                                         of manually and automatically transcribed data. Since
                                †
                                    These authors contributed equally.                                                                             the collected 4 hours of manually transcribed speech
                                Envelope-Open ilaria.chizzoni@unibz.it (I. Chizzoni); alessandro.vietti@unibz.it for Besemah and Nasal followed different orthography
                                (A. Vietti)                                                                                                        conventions, the transcriptions were first normalized to
                                Orcid 0009-0009-9936-1220 (I. Chizzoni); 0000-0002-4166-540X
                                                                                                                                                   working orthographies and then used for fine-tuning.
                                (A. Vietti)
                                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License In the same framework, they leveraged a pre-existing
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
TTS system available for Gronings, a Low-Saxon lan-           documentation of these two languages.
guage variant spoken in the province of Groningen in
the Netherlands, to generate more synthetic training data
from textual sources and they achieved great results [6].     2. Campidanese Sardinian
   While fine-tuning paired with data augmentation tech-
                                                              Sardinian is a Romance language spoken on the Sardinia
niques works for low-resource, widely-spoken languages,
                                                              island in Italy [11]; it is considered an official minority
developing a speech recognition system for endangered
                                                              language and is protected by National Law n.482/1999
spoken languages also involves ethical considerations
                                                              and Regional Law n.26/1997 but does not have a written
towards the local community. More participatory re-
                                                              standard [12]. Sardinia has a high internal linguistic di-
search is required to understand the native speakers’
                                                              versity but the two main macro varieties are Logudorese
relationship with the written form of their language, as
                                                              (ISO code 639-3 src), spoken in the northern sub-region
well as with language technologies. In their position
                                                              and Campidanese (ISO code 639-3-sro), spoken in the
paper [3] Liu and colleagues emphasize the importance
                                                              southern sub-region of Sardinia [12]. To date, there are
of creating language technologies in consultation with
                                                              no quantitative studies on the real number of Sardinian
speakers, activists, and community language workers.
                                                              speakers. The first sociolinguistic survey [13] carried
They present a case study on Cayuga, an endangered
                                                              out by Regione Sardegna in 2007 on 2437 speakers states
indigenous language of Canada with approximately 50
                                                              that 68.4% of the respondents claim to know and speak
native elder speakers and an increasing number of young
                                                              a variety of the local languages. However, the survey
L2 speakers. After gaining insights from the commu-
                                                              was based on the speakers’ self-assessment. As far as
nity, they began collaborating on a morphological parser.
                                                              Campidanese Sardinian is concerned, Ethnologue lists it
This tool aids teachers and young L2 students in language
                                                              as an endangered indigenous language [14] and research
learning while gradually providing morphological anno-
                                                              [12] claims it is used as a first language just by some
tations and segmentations useful for developing ASR
                                                              elder adults in the ethnic community, and not taught to
systems for researchers. Blaschke and colleagues [10]
                                                              children anymore. In 2017, Rattu [15] carried out a soci-
surveyed over 327 native speakers of German dialects
                                                              olinguistic survey on 310 Cagliari speakers, where a self-
and regional varieties, finding that respondents prefer
                                                              assessment questionnaire was followed by a language
tools that process speech over text and favor language
                                                              test (mostly translation tasks from Italian to Sardinian)
technology that handles dialect speech input rather than
                                                              and only a minority of respondents over the age of 45
output. Understanding the needs of the speech commu-
                                                              achieved good or excellent results.
nity and differentiating them from those of linguistic
                                                                 The Sardinian Regional Administration presented two
researchers can guide research more effectively.
                                                              proposals for an official standard language: the first in
   This paper outlines the first steps towards a speech
                                                              2001, presented as a linguistic compromise but actually
recognition system for researchers to aid the systematic
                                                              over representative of Logudorese (Limba Sarda Unifi-
analysis of the phonetics and phonology of Campidanese,
                                                              cada LSU), and the second in 2006, mainly based on the
an endangered language spoken in southern Sardinia.
                                                              central regional variety (Limba Sarda Comuna LSC) [12].
To achieve this goal, we first describe the situation of
                                                              The latter remains the one used for communication by the
the speech community of the target language, we then
                                                              Regional Administration, while in the Cagliari Province a
select five speech recognition multilingual and ready for
                                                              proposal of orthographic rules for Campidanese called Sa
inference models and evaluate them on Campidanese Sar-
                                                              Norma Campidanesa has been put forward in 2009 by the
dinian. When multilingual models were not available for
                                                              Comitau Scientìficu po sa normalisadura de sa bariedadi
speech recognition task, we chose multilingual models
                                                              campidanesa de sa lìngua sarda [16]. Without discussing
fine-tuned on Italian, which we assume to be a relatively
                                                              the issue of the orthographic norm, which is inherently
close language both genealogically and structurally. We
                                                              political, we would like to point out that these proposals
assess the goodness of the models’ inferences, first by
                                                              do not seem to have become part of everyday language
computing the traditional evaluation metrics, i.e., average
                                                              use by the speech community [17]. This is primarily be-
Word Error Rate (WER) and Character Error Rate (CER),
                                                              cause they were not based on any official data regarding
and then carrying out a qualitative linguistic analysis to
                                                              the linguistic and sociolinguistic situation or language
have better insights of which model best meets the needs
                                                              use [18]. Therefore, these standards remained limited to
for language documentation and research. This work
                                                              administrative communications.
is part of “New Perspectives on Diphthong Dynamics
                                                                 Some tendencies in the speakers’ linguistic attitudes
(DID)”, a joint project between the University of Bozen
                                                              emerged from the DID project data collection fieldwork
and the Ludwig-Maximilians-Universität München, fo-
                                                              conducted in 2023 in the city of Sinnai. Native speakers
cusing on the study of diphthongs dynamics in two un-
                                                              of Campidanese are often unfamiliar with the written
derstudied languages, i.e., Campidanese Sardinian and
                                                              version of their language. Elder native speakers had
Tyrolean and aims to build a corpus for the linguistic
no way or need to write the language, except in the last      activity detection [21]. We tested it without passing a
decade through social networks. Whereas, the few young        specific language.
people who use the language even in its written version          The multilingual FastConformer Hybrid Transducer-
to communicate with friends and family via message            CTC model is a model developed by NVIDIA, com-
service apps, do not use Sa Norma Campidanesa, but            bining the FastConformer architecture with a hybrid
rather use a transcription that intuitively approximates      Transducer-CTC approach [22]. NVIDIA FastConform-
their pronunciation.                                          ers come across as very competitive for their efficiency
                                                              and computational speed. We tested both the multilin-
                                                              gual model version 1.20.0, trained on Belarusian, German,
3. Experiments                                                English, Spanish, French, Croatian, Italian, Polish, Rus-
                                                              sian, and Ukrainian [22], and the Italian model version
3.1. Campidanese Sardinian dataset                            1.20.0 trained specifically on Italian (Mozilla Common
We decided to evaluate the speech recognition models          Voice 12, Multilingual LibriSpeech and VoxPopuli) [23].
on a small sample of highly controlled Sardinian data, in        By Facebook we chose Wav2Vec 2.0 XLSR, a model
order to carry out a qualitative linguistic analysis of the   that learns cross-lingual speech representations from the
output transcription. The dataset includes short audios       raw waveform of speech in multiple languages during
of read speech with an average length of 3.5 seconds          pre-training [24]. We use wav2vec2-large-xlsr-53-
(read_short), long audios of read speech with an average      italian , the Wav2Vec 2.0 model pre-trained on multi-
length of 23 seconds (read_long), and short audios of         lingual data from Multilingual LibriSpeech, Mozilla Com-
spontaneous speech with an average length of 5.3 sec-         mon Voice and BABEL and fine-tuned on Italian [25].
onds (spontaneous). Read speech is a subset of the corpus     To attempt an automatic phonetic transcription we used
gathered during the DID project fieldwork in Sinnai. For      wav2vec2-xlsr-53-espeak-cv-ft , the same Wav2Vec
the read_short, participants were asked to read aloud         2.0 Large XLSR model, fine-tuned on multilingual Com-
short sentences developed by the research group, using        mon Voice dataset to recognize phonetic labels [8].
an orthography close to Sa Norma Campidanesa. In par-            In order to have a standard reference, traditional eval-
ticular, twenty audio clips of four native speakers (2F and   uation metrics for speech recognition systems like WER
2M) were selected. Two longer audio clips were selected       and CER were computed via the evaluate HuggingFace
from the same corpus: one of a female speaker reading an      library [26]. Since the output text was normalized differ-
autograph poem, and another of a male speaker reading         ently by the different models, a text normalization was
an excerpt of an autograph story. To have speech style        done on both reference and hypothesis transcriptions,
variability, chunks of spontaneous speech from ethno-         removing every special characters (non-alphanumeric
graphic interviews collected by Mereu [19] in Cagliari        characters) before computing WER and removing special
in 2016 were included. Twelve audio chunks were ex-           characters and spaces (tabs, spaces and new lines) be-
tracted from two of the interviews conducted with two         fore computing CER. We made no additional changes to
male native speakers of Campidanese. The orthographic         the inferences, and no default parameters of the models
transcripts followed different Campidanese conventions        were modified. All tests were run locally to respect data
either being written or validated by native speakers.         privacy policies.


3.2. Methods                                                  3.3. Models evaluation
From HuggingFace’s Open ASR Leader board [20], ready-         Regarding the WER metric, we assume models to per-
to-test models with low Real-Time-Factor (RTF) values         form possible word recognition based on the inventory
were selected. Out of the five tested models, two are         of multilingual or Italian tokens, since the model has not
multilingual models containing at least one Romance           been trained or fine-tuned on any Sardinian data. This
language in their training dataset i.e., whisper-large-       is why in our case average WER is poorly significant.
v2 and multilingual-fastconformer-hybrid-large ;              We therefore evaluate performance mainly by looking at
and three were multilingual models fine-tuned on Ital-        CER.
ian datasets and ready for inference, this is the case           In Table 1 we can see there is little difference in the per-
for it-fastconformer-hybrid-large from NVIDIA                 formance between Whisper medium and large-v2. Sur-
and wav2vec2-large-xlsr-53-italian and wav2vec2-              prisingly, however, Whisper medium performs better on
xlsr-53-espeak-cv-ft from Facebook.                           long read-speech data, reaching a CER of 0.22 versus
   Open AI Whisper is a Transformer sequence-to-              Whisper large-v2 only achieving 0.36. This could be due
sequence multilingual and multitask model trained on          to a better performance of the translation task in Whisper
performing multilingual speech recognition, speech            large-v2. However, the larger model performs better on
translation, spoken language identification, and voice        spontaneous speech (CER 0.39) then the medium model
Table 1                                                        While using the exact same architecture as Wav2Vec2,
Whisper Models                                                 Wav2Vec2Phoneme maps phonemes of the training lan-
                                                               guages to the target language using articulatory features
 Model       Style           Length (s)    CER      WER
                                                               [8]. Since the model outputs a string of tab-separated
 large-v2    read_short           3.5       0.69    1.02       phonetic labels, we computed the CER metric only. As a
 large-v2    read_long           23.5       0.36    0.76       reference, we used the story Sa tramuntana e su soli which
 large-v2    spontaneous          5.3       0.39    0.90       was phonemically and phonetically transcription pro-
 medium      read_short           3.5       0.70    1.00       vided by Mereu [12]. The input file is a single 43-second
 medium      read_long           23.5       0.22    0.79
                                                               audio of a young female native speaker of Campidanese
 medium      spontaneous          5.3       0.52    1.12
                                                               Sardinian. When comparing the Wav2VecPhoneme pre-
                                                               dictions with the human phonemic transcription we get a
Table 2                                                        Phoneme Error Rate (PER) of 0.28, while when comparing
FastConformer NVIDIA Models                                    it with the phonetic human transcription, PER decreases
  Model     Style           Length (s)     CER     WER         to 0.23. This results suggest that an automatic transcrip-
                                                               tion into phonemes rather than characters would be a
  FC-ML     read_short           3.5       0.69     1.00       path worth exploring, allowing a systematic description
  FC-ML     read_long           23.5       0.22     0.79
                                                               of the phonetics and phonology of endangered spoken
  FC-ML     spontaneous          5.3       0.34     0.88
  FC-IT     read_short           3.5       0.69     1.00
                                                               languages, while bypassing the orthography issue. These
  FC-IT     read_long           23.5       0.28     0.83       results align with recent work on cross-lingual trans-
  FC-IT     spontaneous          5.3       0.41     0.97       fer [29] proposing a very similar solution to develop a
                                                               multilingual phoneme recognizer.

Table 3
Wav2Vec XLSR Italian                                           4. Exploratory Linguistic Analysis
  Model      Style           Length (s)    CER     WER
                                                               In this section, we present an exploratory linguistic analy-
  W2V-IT     read_short          3.5       0.68     1.00       sis to evaluate to what extent the orthographic transcrip-
  W2V-IT     read_long          23.5       0.25     0.81       tions from the tested ASR models capture the phonetic
  W2V-IT     spontaneous         5.3       0.36     0.90       events present in the speech signal. The analysis is based
                                                               on the inventory of phonological phenomena described
                                                               for Campidanese Sardinian spoken in Cagliari [12].
(CER 0.52). As shown in Table 2, both NVIDIA Fast Con-            In multilingual FastConformer’s predictions some
former models achieve low values on long audios of read        known phonological processes of Campidanese can be
speech. While multilingual FastConformer reaches the           recognized. For instance, in Campidanese Sardininan
best values overall, Wav2Vec XLSR fine-tuned on Ital-          the alveolar tap [R] is an allophone of /r/ in word-medial
ian performs better than the multilingual FastConformer        intervocalic position and a sociophonetic variant of /t/
fine-tuned on Italian (see Table 3).                           and /d/ in the Cagliari variety [12]. In examples 1 and
   Overall, CER is relatively low on long read speech,         4, the intervocalic /t/ across word boundaries (si lui and
which is intuitively understandable, considering the se-       ma lui) is transcribed as /l/, which can be considered
lected models have all been trained mainly on read speech      a good orthographic approximation to an alveolar tap.
(Mozilla Common Voice data and audio books). Poor per-         Following a process of lenition of voiceless plosives and
formance on short audios was also expected, since all the      fricatives, the intervocalic labiodental fricatives /f/ across
tested models where pre-trained on longer audio chunks,        word boundaries are also consistently transcribed as their
ranging from 20 to 30 seconds [27] [21] [7]. Given the         voiced counterpart /v/, see example 1 asivato, example 4
similar average length of the audio inputs, it is surprising   con savorza and deno vusti. Voiceless plosives /p/, /t/, and
that every model performs better on short spontaneous          /k/ in word-medial intervocalic positions are expected to
speech than on short read speech.                              be realized with a long duration, in the predictions are
   The relatively low CER values suggest promising po-         recognized as geminate sounds, see example 5 in deppidi
tential, particularly for the multilingual models. There-      and mascetti, yet not always, see example 1 depidi. We
fore, we decided to get more phonetically informative          also notice the insertion of paragogic vowels, which in
outputs to evaluate how well these models generalize           Campidanese are inserted after a final consonant to avoid
beyond word boundaries and language-specific spelling          consonant in word-final coda position [12], as in example
conventions. We select wav2vec2-xlsr-53-espeak-cv-             1 depidi and zinotenesi or a rosasa in example 3. Except
ft , a Wav2Vec 2.0 XLSR model fine-tuned on multilingual       for esaminat in exemple 1 where it was expected and
Common Voice dataset to recognize phonetic labels [28].        actually produced in the audio.
   Although this model seems to propose an orthographic                            9. sanguidda si cuat in mesu e su ludu 8
transcription close enough to the phonetic one, it some-                              sanguidas igual em mesa sulado açuludo
times makes systematic choices that are unfaithful to the
acoustic signal. We provide an example where /u/ both                         Similarly to multilingual FastConformer, Wav2Vec
in word medial and final position is generally transcribed                 XLSR accounts for many of the phonological phenom-
as /o/, not only when there is an Italian equivalent or                    ena of Campidanese. The voiceless plosives /k/ and /p/,
phonetically close lexical item e.g, antunietta>antonietta;                lenited to voiced fricatives [ɣ] and [β] when found in in-
coru>coro; su>suo; cun>con, but also when the item is                      tervocalic environment across word boundaries [12], are
unknown to the model ollastu>ollasto; dentradura>den-                      transcribed as /g/ and /v/ in gusta vingiara and sugauli
tradora, giving reason to believe that the model might                     in example 13. While in Wav2Vec model the alveolar tap
have information about the phonotactic constraints in                      [ɾ] is rendered as /r/ instead of /l/ see sirui in example 10.
Italian, e.g. no [u] in word final position.
                                                                                  10. esaminat si tui as fatu su percursu cumenti si depit
     1. esaminat si tui as fatu su percursu cumenti si depit          1               einasidu sirui ha sivato su bercursu come zi deperi
        examina si lui asivato subercurso come zi depidi                          11. e si non tenis atrus problemas in sa vida in foras
     2. e si non tenis atrus problemas in sa vida in foras 2                          esino tenesi atosproblema sainsavvira in forese
        e zinotenesi a tus problema in savira in forez                            12. su boi est un animali de meda importantzia 9
     3. sa vida no es stettia tuttu arrosas      3                                    su boe e un animale de meda importanza
        savidano e stetti a dotto a rosasa                                        13. su cauli coit mellus in custa pingiada 10
     4. ma tui con sa forza de unu fusti di ollastu        4                          sugauli coi melusu in gusta vingiara
        ma lui con savorza deno vusti di ollasto                                  14. ma tui con sa forza de unu fusti di ollastu
     5. no si deppiti imperai ma sceti castiai 5                                      madoi con savorza de unovusti diolastu
        nosi deppidi imperai mascetti gastiai
                                                                             Unlike Whisper large-v2, Wav2Vec XLSR never per-
   Regarding Whisper large-v2, we notice in some cases                     forms translations and, unlike the FastConformer fine-
a near-perfect Italian translation of the Sardinian input                  tuned on Italian, does not seem to respect the Italian
audios, see example 5 and 6 below; in others cases, a                      phonotactic constraints, see diolastu in example 14.
poorer Italian translation with the deletion of repetitions,
as in 7. Surprisingly, in example 8 and 9 we see how the
tentative translations (or identifications with the phonet-
                                                                           5. Conclusions and Future steps
ically most similar lexical items in a known language)                     The preliminary analysis carried out in this paper pro-
also happens to Portuguese. Similar behavior is observed                   vided insight into how various speech recognition models
in Whisper medium: tentative Italian and Portuguese                        transcribe data in a Romance language not encountered
translations, and hallucinations both in spontaneous and                   in the model training. All evaluated models improve
read short input audios.                                                   their performance as the audio length increases. Best
                                                                           CER values are achieved on audio of read speech longer
     5. esaminat si tui as fatu su percursu cumenti si depit
        esamina se lui ha fatto il suo percorso come si deve
                                                                           than 20 seconds. However, short audios of spontaneous
                                                                           speech with an average length of 5.3 seconds achieved a
     6. e si non tenis atrus problemas in sa vida in foras                 remarkably low CER, meaning better precision compared
        se non ha altri problemi in vita in forza                          to the similarly short (3.5 seconds) read speech chunks.
                                                                           These results suggest that speech style might also play a
     7. chi est o de un annu o de duus annus eccetera eccetera
        chi depis chi depis 6
                                                                           role. To investigate whether the models are sensitive to
        chi e di un anno o di due anni chi deve essere                     speech style, other linguistic, speaker-specific, or techni-
                                                                           cal variables, such as the topic, age, gender of the speaker,
     8. in su mesi e friaxu si cumentzat a fai su casu 7                   or the acoustic quality of the audio data, should be taken
        em cima das evriagens o segundo mes ate faz sucesso                into account. For example, both datasets of spontaneous
1
                                                                           speech are produced by males over 45, and models might
  [He/she] makes sure you have done the proper training.
2                                                                          be biased toward an adult male speaker profile. For the
  And if you have no other problems in your life in general.
3
  Life has not been all roses.                                             time being, we attribute it to the poor representativeness
4
  Yet you, with the strength of a wild olive trunk.                        of the dataset and will investigate it in future work.
5
  It is not to be used but only looked at.
6                                                                          8
  That it is either one or two years long, and so on and so forth – that       The eel hides in the mud.
                                                                           9
  it has to – that it has to                                                   The ox is a very important animal.
7                                                                          10
  February sees the start of cheese making.                                     The cabbage cooks best in this pan.
   A controlled yet diverse dataset facilitated a qualita-     [3] Z. Liu, C. Richardson, R. J. Hatcher, E. T. Prudhom-
tive linguistic analysis of the predictions. Interestingly,        meaux, Not always about you: Prioritizing commu-
some models seem to follow the phonotactic constraints             nity needs when developing endangered language
of the languages they have been trained on, but at the             technology, in: Annual Meeting of the Association
same time they generalize well to unfamiliar languages,            for Computational Linguistics, 2022. URL: https:
providing quite accurate phonetically-like orthographic            //api.semanticscholar.org/CorpusID:248118721.
transcription of Campidanese Sardinian. These initial          [4] Y. Liu, X. Yang, D. Qu, Exploration of whis-
considerations should be validated with tests on a larger          per fine-tuning strategies for low-resource asr,
corpus to eliminate data bias and a more systematic lin-           EURASIP Journal on Audio, Speech, and Mu-
guistic analysis to avoid cherry-picking. We also plan             sic Processing 2024 (2024) 29. URL: https://
to look in detail at the speech recognition models’ ar-            doi.org/10.1186/s13636-024-00349-3. doi:10.1186/
chitectures in order to make a informed choice at the              s13636- 024- 00349- 3 .
fine-tuning phase.                                             [5] C. Sicard, K. Pyszkowski, V. Gillioz, Spaiche: Ex-
   In conclusion, it seems that state-of-the-art transcrip-        tending state-of-the-art asr models to swiss ger-
tion models, especially multilingual ones, produce a pho-          man dialects, in: Swiss Text Analytics Conference,
netically accurate orthographic transcription of Campi-            2023. URL: https://arxiv.org/abs/2304.11075. doi:10.
danese Sardinian and thus provide a promising basis for            48550/arXiv.2304.11075 . arXiv:2304.11075 .
fine-tuning. Specifically, Wav2Vec2 large XLSR-53 and          [6] M. Bartelds, N. San, B. McDonnell, D. Jurafsky,
STT Multilingual FastConformer Hybrid proved to                    M. B. Wieling, Making more of little data: Im-
be the best models according to the evaluation metrics             proving low-resource automatic speech recogni-
and preliminary linguistic analysis. STT Multilingual              tion using data augmentation, in: Annual Meet-
FastConformer Hybrid was the best and most efficient               ing of the Association for Computational Linguis-
in terms of computational resources, which makes it our            tics, 2023. URL: https://api.semanticscholar.org/
first choice for further testing and fine-tuning. How-             CorpusID:258762740. doi:10.48550/arXiv.2305.
ever, it is worth noticing, speech recognition systems             10951 .
with orthographic output can be costly in terms of hu-         [7] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec
man and computational resources, poorly informative for            2.0: A framework for self-supervised learning of
speech researchers and uninteresting to native speakers;           speech representations, Advances in neural infor-
whereas recent work on multilingual automatic phone-               mation processing systems 33 (2020) 12449–12460.
mic recognition seems a viable alternative worth explor-           doi:10.48550/arXiv.2006.11477 .
ing for documenting endangered spoken languages.               [8] Q. Xu, A. Baevski, M. Auli, Simple and effective
                                                                   zero-shot cross-lingual phoneme recognition, in:
                                                                   Interspeech, 2021. URL: https://arxiv.org/abs/2109.
Acknowledgments                                                    11680. doi:10.21437/interspeech.2022- 60 .
                                                               [9] A. Yeroyan, N. Karpov, Enabling asr for low-
Work funded by the New Perspectives on Diphthong
                                                                   resource languages: A comprehensive dataset
Dynamics (DID) project #I83C22000390005.
                                                                   creation approach, arXiv preprint arXiv:2406.01446
  We would like to extend our gratitude to Daniela
                                                                   (2024). URL: https://arxiv.org/abs/2406.01446.
Mereu for providing the essential data for this research
                                                                   arXiv:2406.01446 .
and for her invaluable perspective. We also thank
                                                              [10] V. Blaschke, C. Purschke, H. Schütze, B. Plank,
Loredana Schettino and Aleese Block for their support
                                                                   What do dialect speakers want? a survey of at-
and helpful insights.
                                                                   titudes towards language technology for german
                                                                   dialects, arXiv preprint arXiv:2402.11968 (2024).
References                                                         doi:10.48550/arXiv.2402.11968 .
                                                              [11] G. Mensching, E.-M. Remberger,                270sar-
 [1] A. Magueresse, V. Carles, E. Heetderks, Low-                  dinian,     in: The Oxford Guide to the Ro-
     resource languages: A review of past work and                 mance Languages, Oxford University Press, 2016,
     future challenges, arXiv (2020). URL: https://arxiv.          p. 270–291. URL: https://doi.org/10.1093/acprof:
     org/abs/2006.07264. arXiv:2006.07264 .                        oso/9780199677108.003.0017. doi:10.1093/acprof:
 [2] P. Joshi, S. Santy, A. Budhiraja, K. Bali, M. Choud-          oso/9780199677108.003.0017 .
     hury, The state and fate of linguistic diver-            [12] D. Mereu, Cagliari sardinian, Journal of the Inter-
     sity and inclusion in the NLP world, CoRR                     national Phonetic Association 50 (2020) 389–405.
     abs/2004.09095 (2020). URL: https://arxiv.org/abs/            doi:10.1017/S0025100318000385 .
     2004.09095. arXiv:2004.09095 .                           [13] A. Oppo, Le lingue dei sardi. una ricerca sociolin-
                                                                   guistica (2007).
[14] Ethnologue, Sardinian, campidanese, 2024. URL:             tomatic Speech Recognition and Understanding
     https://www.ethnologue.com/language/sro/.                  Workshop (ASRU) (2023) 1–8. URL: https://api.
[15] R. Rattu, Repertorio Plurilingue e Variazione Lin-         semanticscholar.org/CorpusID:258564901.
     guistica a Cagliari: I Quartieri di Castello, Marina, [28] H.     Face,       wav2vec2-xlsr-53-espeak-cv-ft,
     Villanova, Stampace, Bonaria e Monte Urpinu, Mas-          2021. URL: https://huggingface.co/facebook/
     ter’s thesis, Università degli Studi di Cagliari, 2017.    wav2vec2-xlsr-53-espeak-cv-ft.
[16] B. F. Eduardo, C. Amos, C. Stefano, D. Nicola, [29] K. Glocker, A. Herygers, M. Georges, Allophant:
     M. Massimo, M. Michele, M. Francesco, M. Ivo,              Cross-lingual phoneme recognition with articula-
     P. Pietro, P. Oreste, R. Antonella, S. Paola, S. Marco,    tory attributes, in: Proceedings of Interspeech,
     Z. Paolo, Arrègulas po ortografia, fonètica, morfolo-      2023. URL: http://dx.doi.org/10.21437/Interspeech.
     gia e fueddàriu de sa Norma Campidanesa de sa              2023-772. doi:10.21437/interspeech.2023- 772 .
     Lìngua Sarda, ALFA EDITRICE, 2009.
[17] D. Mereu, Efforts to standardise minority lan-
     guages: The case of sardinian, Europäisches
     Journal für Minderheitenfragen. European Journal
     of Minority Studies (2021) 76–95. doi:10.35998/
     ejm- 2021- 0004 .
[18] S. Gunsch, La distribuzione delle parti del discorso
     nel parlato e nello scritto campidanese e fenomeni
     del parlato in una lingua minoritaria di contatto,
     Master’s thesis, Free University of Bozen-Bolzano,
     2022.
[19] D. Mereu, Il sardo parlato a Cagliari: una ricerca
     sociofonetica., FrancoAngeli., Milano, 2019.
[20] V. Srivastav, S. Majumdar, N. Koluguri, A. Moumen,
     S. Gandhi, et al., Open automatic speech recog-
     nition leaderboard, https://huggingface.co/spaces/
     hf-audio/open_asr_leaderboard, 2023.
[21] A. Radford, J. W. Kim, T. Xu, G. Brockman,
     C. McLeavey, I. Sutskever, Robust speech recog-
     nition via large-scale weak supervision, in: Inter-
     national conference on machine learning, PMLR,
     2023, pp. 28492–28518. doi:10.48550/arXiv.2212.
     04356 .
[22] NVIDIA, Stt multilingual fastconformer hybrid
     large pc, 2023. URL: https://catalog.ngc.nvidia.com/
     orgs/nvidia/teams/nemo/models/stt_multilingual_
     fastconformer_hybrid_large_pc.
[23] NVIDIA, Stt it fastconformer hybrid large pc, 2023.
     URL: https://catalog.ngc.nvidia.com/orgs/nvidia/
     teams/nemo/models/stt_it_fastconformer_hybrid_
     large_pc.
[24] H. Face, Xls-r wav2vec2 model documenta-
     tion, 2024. URL: https://huggingface.co/docs/
     transformers/en/model_doc/xlsr_wav2vec2.
[25] H.      Face,        wav2vec2-large-xlsr-53-italian,
     2021. URL: https://huggingface.co/facebook/
     wav2vec2-large-xlsr-53-italian.
[26] H. Face, Evaluate: A library for evaluation in
     machine learning, 2024. URL: https://github.com/
     huggingface/evaluate.
[27] D. Rekesh, S. Kriman, S. Majumdar, V. Noroozi,
     H. Juang, O. Hrinchuk, A. Kumar, B. Ginsburg,
     Fast conformer with linearly scalable attention
     for efficient speech recognition, 2023 IEEE Au-