-

Nesciun Lengaz Lascià Endò: Machine Translation for Fassa Ladin⋆

Giovanni Valer

Nicolò Penzo

0 1

Jacopo Staiano

1 0 Fondazione Bruno Kessler , Trento , Italy 1 University of Trento , Italy

Despite the remarkable success recently obtained by Large Language Models, a significant gap in performance still exists when dealing with low-resource languages which are often poorly supported by of-the-shelf models. In this work we focus on Fassa Ladin, a Rhaeto-Romance linguistic variety spoken by less than ten thousand people in the Dolomitic regions, and set to build the first bidirectional Machine Translation system supporting Italian, English, and Fassa Ladin. To this end, we collected a small though representative corpus compounding 1135 parallel sentences in these three languages, and spanning five domains. We evaluated several models including the open (Meta AI's No Language Left Behind, NLLB-200) and commercial (OpenAI's gpt-4o) state-of-the-art, and indeed found that both obtain unsatisfactory performance. We therefore proceeded to fine-tune the NLLB-200 model on the data collected, using diferent approaches. We report a comparative analysis of the results obtained, showing that 1) jointly training for multilingual translation (Ladin-Italian and Ladin-English) significantly improves the performance, and 2) knowledge-transfer is highly efective (e.g., leveraging similarities between Ladin and Friulian), highlighting the importance of targeted data collection and model adaptation in the context of low-resource/endangered languages for which little textual data is available.

eol>Machine Translation Low Resource Languages Dialects Ladin

1. Introduction e.g., wrong translations or mixed up Ladin varieties.1 Further, previous works have mainly focused on the The growing scale of Large Language Models, based on two South Tyrolean varieties, Gherdëina and Badiot [6]: the Transformer architecture, has led to models with despite having a standardized written form and being surprising capabilities in a number of tasks, including oficially recognized as a minority language, the Fassa Machine Translation (MT). However, most of the NLP variety (Fascian) has been mostly overlooked [7], while community efort is focused on high-resource standard- its speakers rightfully expect access to the same digital ized languages, leaving behind the vast majority of local tools available for other languages [8]. under-resourced languages. Recent works have demon- We introduce the first dataset of parallel Fassa Ladinstrated the utility of creating language-specific datasets Italian-English sentences, spanning over multiple dofor MT [1] and the efectiveness of relatively small quan- mains: literature, news, laws, brochures, and game rules. tities of high-quality translation data to teach a new lan- We evaluate several out-of-the-box translation systems, guage to pre-trained LLMs [2, 3]. To date, little work has including the open (Meta AI’s No Language Left Behind, addressed the Ladin language: even the most recent mod- NLLB-200) and commercial (OpenAI’s gpt-4o) state-ofels that have included a great number of languages have the-art models, and experiment with both zero-shot pivotnot been trained with Ladin data [4], due to the scarcity of based and multilingual strategies to obtain satisfactory freely available parallel corpora (to our knowledge, only performances in bidirectional translation between Fassa the OPUS corpora [5]), which are also poorly curated – Ladin and Italian/English. Figure 1 provides a schematic overview of our experiments, which are thoroughly deCLiC-it 2024: Tenth Italian Conference on Computational Linguistics, scribed in Section 4.

Dec 04 – 06, 2024, Pisa, Italy Our results show how the collection of small quantities ⋆ iNnoFLaasnsaguLaagdeinL.eft Behind translates to Nesciun Lengaz Lascià Endò of parallel data is very efective in ‘adding’ support for * Corresponding author. a previously unsupported language to existing state-of$ giovanni.valer@studenti.unitn.it (G. Valer); the-art models. More specifically, we find that the NLLBnicolo.penzo@unitn.it (N. Penzo); jacopo.staiano@unitn.it 200 model fine-tuned using a multilingual strategy can (J. Staiano) outperform even the most capable commercial LLMs (e.g., httphstt:/p/sn:i/c/goiltohpuebn.zcoo.mgi/tjhou-vba.iloer(N( G..PVenazleor));;https://www.staiano.net OpenAI gpt-4o). (J. Staiano) For reproducibility purposes, we make the dataset and 0009-0002-2145-9497 (G. Valer); 0009-0006-8648-3307 (N. Penzo); 0000-0002-1260-4640 (J. Staiano) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1See Appendix A.

Attribution 4.0 International (CC BY 4.0). g fi u n gnin lld lld en t-en en it ni lld lld lld en ut-enfi en it lld it

Fassa Ladin Parallel Corpora it en fra en fur en fur fur Tain Dev tsTe r

lld

Domains

Laws Games Literature News Brochure ID OOD 4 instruction lld en lld it

Evaluation BLEU chrF++ BERTscore 2. Linguistic background

Ladin3 (ISO 639-3 code: lld) is a Rhaeto-Romance language. It has numerous varieties, each one spoken in a diferent valley: Anpezan (Cortina d’Ampezzo), Badiot (Badia Valley), Fascian (Fassa Valley), Fodom and Col (Upper Cordevole Valley), and Gherdëina (Gardena Valley) [9]. This paper focuses on Fassa Ladin, which is spoken by approximately 8000 people and is further divided in three local varieties: Cazét (upper valey), Brach (lower valley), and Moenat (Moena). However, a standard variety for Fassa Ladin (named Ladin fascian) was established in 1999 and is currently used in oficial contexts; this is the variety considered in our work.

From a linguistic standpoint, Fassa Ladin is related to Italian. It also shares some linguistic phenomena with French, as the fronting of Latin /a/ to /E/, e.g., pater > fr. and lad. père (notice that both Ladin and French are Western Romance languages). Ladin is closely related also to Friulian, another Rhaeto-Romance language [9]. For these reasons we will consider Italian, French and Friulian for our experiments. We report in Table 1 an example of a sentence in Ladin, Italian and English.

3. Data Ladin Italian English

L porta dant azions per didèr dò la medema oportunità anter eles e ic.

Promuove azioni per favorire pari opportunità tra donne e uomini.

It promotes actions to foster equal opportunities between women and men. ature, news, games, laws, and brochures. The literature subset is an excerpt of a collection of poems and stories by Galante et al. [10].

News are sourced from the Province of Trento press ofice releases 4 and from social networks’ news.5 The games subset contains parallel sentences from an online game.6 Laws come from the Statuto del Comune di Moena (Statute of the Municipality of Moena)7 and the Statuto del Comun general de Fascia (Statute of the ‘Comun general de Fascia’).8 Finally, the brochures subset consists in promotional documents for tourists.9 The latter exhibits distinct linguistic characteristics, and is characterized by poorly aligned sentences and more ‘creative’ translations; an example is provided in Table 2.

Thus, we used it for out-of-domain testing (see Section 4.3.1). The dataset compounds to 1135 parallel sentences, unevenly distributed across domains (see Table 3).

We built the first Fassa Ladin-Italian-English parallel cor- 4https://www.uficiostampa.provincia.tn.it/ pus drawing from multiple resources in 5 domains: liter- 5https://www.facebook.com/UalUnionAutonomistaLadina/ 6http://avventuresuimontipallidi.it/ 2https://github.com/jo-valer/machine-translation-ladin-fascian 7https://it.wikisource.org/wiki/Comun_de_Moena_-_Statut 3The term ‘Ladin’ can refer to multiple languages. In this paper we 8https://www.consiglio.provincia.tn.it/_layouts/15/dispatcher/doc_ use it only in reference to the Ladin of the Dolomites, spoken in the dispatcher.aspx?app=clex&at_id=21177 so called Ladinia brissino-tirolese, across the provinces of Trento, 9https://www.giornaletrentino.it/cronaca/fiemme-e-fassa/il-libroBolzano, and Belluno. sui-ladini-di-fascia-spacca-presto-altre-4-mila-copie-1.2242774 en: Especially in winter, when work in the fields was less intense. it: Questi riti venivano celebrati soprattutto in inverno, quando il lavoro nei campi era meno intenso. (These rites were celebrated mainly in winter, when work in the fields was less intense.) lld: Soraldut via per l’invern, ajache zacan l’era na sajon de paussa dal lurier te ciamp.

(Especially during the winter, as it used to be a season of respite from work in the field.) When English translations were not available we used DeepL10 to translate Italian into English.

We chose BLEU and chrF++ metrics in line with previous work by Haberland et al. [1]. Although Multilingual BERT does not explicitly support the Ladin language, we 4. Models and Methods assessed during preliminary analyses its alignment with human similarity judgments on Ladin sentences. For this In our experiments we used the following machine trans- reason we include it as reference for future work. lation model families: More implementation details are in Appendix C.

4.2. Preliminary Experiments • OPUS-MT, which provides unidirectional bilin

gual models [11];11 Firstly, we evaluate the performance of the pre-trained • M2M-100, a Many-to-Many multilingual model models in translating between Italian and English ( → that can translate directly between any pair of and → ), in order to have a reference for subse100 languages [12]; quent experiments. The evaluation is performed using • NLLB-200, Meta AI’s successor of M2M-100, sup- our in-domain test set. We also evaluate the performance porting 200 languages [4]; of the models to translate from Ladin to English, either • gpt-4o, the closed-source, state-of-the-art, considering Ladin sentences as if they were written in general-purpose, instruction-tuned, multilin- Italian, French, or Friulian. Such test allows us to have gual model developed and commercialized by a measure of how much a given model is ‘prepared’ to OpenAI.12 transfer knowledge across these languages. NLLB-200 is the only model pre-trained with Friulian data, thus comparing models with this language is not possible. Nevertheless, this preliminary experiment is a viable way to investigate which language has the highest similarity to Ladin from the model’s perspective. 4.1. Experimental Setup For model evaluation and validation, we prepare two held-out corpora, each of 108 aligned sentences (∼ 10% of the in-domain corpus), randomly sampled from all resources; the brochures subset was excluded from the 10https://www.deepl.com/ 11https://huggingface.co/Helsinki-NLP/opus-mt-it-en 12The prompting strategy used for gpt-4o is presented in Appendix

Preliminary Results The results presented in Table 4 show how M2M-100 has lower scores for all metrics, and suggest that the best model for our experiments is NLLB200; for this reason in the following we will consider 13https://github.com/mjpost/sacrebleu 14https://github.com/Tiiiger/bert_score OPUS-MT M2M-100 NLLB-200 OPUS-MT M2M-100 NLLB-200 OPUS-MT M2M-100 NLLB-200 M2M-100 NLLB-200 55.61 44.18 52.93 this model only. We can notice a lower performance in → , compared to → , according to the untrained metrics; BERTscore provides instead comparable verdicts for the two tasks. This is an important finding and has to be recalled when evaluating subsequent experiments. Moreover, Friulian proves to be the most promising language for our fine-tuning purposes, even though Italian has good scores (BLEU score 21.76 vs. 18.52). investigate if such model performs well in → even though it is not trained with Italian-Ladin pairs. We refer to the model fine-tuned with this approach as ‘NLLBpivot’.

Multilingual Translation We fine-tune the model for joint Ladin-Italian and Ladin-English bidirectional translation. Each batch includes a randomly selected pair of languages, in a single direction. We refer to the model ifne-tuned with this approach as ‘NLLB-multi’. 4.3.1. Transfer Learning Across Domains We evaluated the model ability to generalize in diferent domains by testing it on our out-of-domain test set: the brochures subset (excluded from the training set) compounding to ∼ 5% of the sentences in our entire dataset. 4.3.2. Forgetting of Previous Knowledge Finally, we investigate whether the fine-tuned models sufer a performance drop in translating Italian to English (and vice versa), thus exploring if we encounter catastrophic forgetting [19]. We re-evaluate the models on our test set, and compare the results with the scores obtained in the preliminary experiments.

5. Results

4.3. Transfer Learning Experiments The performances obtained by the fine-tuned models, for each translation task and for each test set, are reported The training set consists of 862 parallel Fassa Ladin- in Table 5. As a strong baseline, we used gpt-4o. Italian-English sentences (i.e., those remaining of the original 1135 sentences after excluding 108 for validation, 5.1. Fine-tuning Approaches 108 for in-domain test and 57 for out-of-domain test). As Ladin is not included in the pre-trained NLLB-200 model, The results show that both fine-tuning approaches are efwe assign it the language code of Friulian, to leverage the fective in adding Fassa Ladin to the pre-trained NLLB-200 similarities between these two languages. In this work model, increasing the BLEU score baseline of → we use our dataset for model fine-tuning, a relatively from 21.76 to 40+, and outperforming gpt-4o (28.19). afordable strategy in terms of computational costs. 15 We The two approaches achieve also similar results in → experiment with the following approaches to add Fassa . Table 6 provides some examples of translated senLadin to the NLLB-200 model: tences.

We do not observe consistently higher scores by usZero-shot Pivot-based Transfer Learning We fine- ing the zero-shot pivot-based transfer learning approach. tune the model to only translate from English to Ladin This might be due to the little amount of data used for (and viceversa), thus ignoring the Italian data. The pivot- fine-tuning, so that training also with Italian-Ladin parbased approach has proven to be efective for several allel sentences helps by providing more data and higher languages [18]. We adopt a zero-shot pivot-based ap- diversity. Since we fixed the number of training steps for proach, meaning we do not fine-tune the model to per- NLLB-pivot and NLLB-multi, the NLLB-multi model has form ⇄ , as we assume not to have the data: we seen about half of the Ladin-English batches compared to NLLB-pivot (the other half being Ladin-Italian).

This suggests that the multilingual translation ap15Nonetheless, the increasing input context length of current LLMs proach might be preferable in the context of endangered ianllothwescfoorncuusirnrgenmt awnoyr-kshooft Aing-acrowntaelxettleaal.rn[1in7g], awphpircohacwheaslesahvoewtno languages for which little data is available, since it acts future works. as a regularization method during training. → → → → gpt-4o NLLB-pivot NLLB-multi gpt-4o NLLB-pivot NLLB-multi gpt-4o NLLB-pivot NLLB-multi gpt-4o NLLB-pivot NLLB-multi

Turning to gpt-4o performances, it proves to better perform in → task than → . Its scores are lower compared to our models, but the most significant finding is that it cannot generate text in Fassa Ladin (/ → ). NLLB-multi performance in → is much higher than → (BLEU score 39.75 vs. 32.23), a finding calls for further analysis, left to future works, to be interpreted. We also observe NLLB-pivot performing poorly in → , but not in → . The zero-shot pivot-based approach appears to work in only one direction, a behavior we discuss in Section 5.3. 5.2. Domain Transfer Unsurprisingly, a relatively lower performance on the out-of-domain test set is observed, since the original data presents less literal translations. As a consequence, the metrics matching the model output against the ground truth tend to lower scores. Still, especially for → , both NLLB-based models produce acceptable outof-domain translations (BLEU scores 21+). The strong out-of-domain performance of gpt-4o, better than our models in understanding out-of-domain Ladin ( → /), shows how the scarcity of fine-tuning data, and its lack of linguistic diversity, has a negative impact on our models’ performance. Another interpretation concerns the robustness of gpt-4o in handling grammatical errors: implicitly casting the source sentences to another similar language, known by the model, and then correctly translating into the / targets (e.g., treating Ladin words as if they were misspelled Italian words). NLLB-200 NLLB-pivot (Δ) NLLB-multi (Δ) 5.3. Forgetting of Previous Knowledge Finally, we present the performance shift in → and → of our fine-tuned models compared to the pretrained NLLB-200 (Table 7). The idea is to evaluate the catastrophic forgetting phenomenon [20] after adding Fassa Ladin to the model, via the diference in BLEU scores. NLLB-multi produces slightly better translations after fine-tuning: this is expected, as it is better fitted to our domain. NLLB-pivot, however, has a strong drop in → (− 32.41), but not in → (+1.78).

This suggests that after fine-tuning the model’s encoder retained the ability to handle Italian inputs, while the decoder ‘forgets’ how to generate Italian outputs.

This also explains the NLLB-pivot low performance in → , but relatively high scores in → .

The problem of ‘forgetting’ can be mitigated by using English-Italian sentence pairs during fine-tuning.

6. Limitations

A major limitation of this work consists in the little amount of data used for fine-tuning, and its lack of linguistic variety (most of the sentences are drawn from laws). This has a considerable impact on our MT model, which struggles on out-of-domain translations.

In general, as suggested by Ramponi [8], it would be important to assess the needs of the local community, in order to focus the eforts towards the most useful domains of application.

7. Conclusions

In this work, we show that it is possible to add a specific language variety to a pre-trained MT model using little amount of data for fine-tuning (fewer than 900 parallel sentences). To add Fassa Ladin, we fine-tune the model using as starting point a similar language included in NLLB-200: Friulian.

This approach significantly improves the performance. Moreover, in such condition, fine-tuning with parallel sentences in more than two languages proves to help regularization and to improve translations, with respect to a zero-shot pivot-based transfer learning approach.

Future work includes extending the dataset with new resources and domains, improving the alignment quality, and including human evaluation of translation quality. Adding data from other Ladin varieties might be a viable solution to improve the low performance caused by unknown words. Moreover, experimenting with translated words from vocabulary entries could be beneficial for Fassa Ladin, a language variety that has scarce parallel data but various publicly accessible vocabularies. [16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/ v1/N19-1423. [17] R. Agarwal, A. Singh, L. M. Zhang, B. Bohnet, L. Rosias, S. Chan, B. Zhang, A. Anand, Z. Abbas, A. Nova, J. D. Co-Reyes, E. Chu, F. Behbahani, A. Faust, H. Larochelle, Many-shot in-context learning, 2024. URL: https://arxiv.org/abs/2404.11018. arXiv:2404.11018. [18] Y. Kim, P. Petrov, P. Petrushkov, S. Khadivi, H. Ney, Pivot-based transfer learning for neural machine translation between non-English languages, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 866–876. URL: https://aclanthology.org/D19-1080. doi:10.18653/v1/D19-1080. [19] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, Y. Zhang, An empirical study of catastrophic forgetting in large language models during continual finetuning, 2024. URL: https://arxiv.org/abs/2308.08747. arXiv:2308.08747. [20] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, Y. Bengio, An empirical investigation of catastrophic forgetting in gradient-based neural networks, 2015. URL: https://arxiv.org/abs/1312.6211. arXiv:1312.6211. [21] N. Shazeer, M. Stern, Adafactor: Adaptive learning rates with sublinear memory cost, in: J. Dy, A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 4596–4604. URL: https://proceedings.mlr. press/v80/shazeer18a.html. Wikipedia QED Sono usciti complessivamente tre numeri.

A total of three issues were released. Ie la prima plata ladina[1].

It’s the first ladin page[1].

E gli uomini delà , Meli esponilo Holly mise San , in estat’ teston’ And the men delà , Meli expose it Holly put San , in estat’ teston’ (sic) Si te serf demò la lum canche la se n va , te mencia l soreie demò canche l taca a fiochèr If you only need light when it goes out , you only miss the sun when it starts snowing

A. Previous Ladin corpora

Three datasets from the OPUS corpora, namely Wikipedia, QED, and Ubuntu, contain parallel LadinItalian data. Unfortunately, none of these provide information about the Language variety of the sentences (e.g., the ones mentioned in Section 2). Some of them also present non-aligned sentences (see examples in Table 8).

B. Prompt for gpt-4o

###INTRODUCTION### You are a expert translator specialized in low-resource languages and dialects.

Your core competence is bidirectional translation between italian (IT), english (EN), and fassa ladin (LLD) languages. ###INSTRUCTIONS### You will be provided with information on the source language (SOURCE_LANG), a textual input (SOURCE_TEXT), and a target language (TARGET_LANG). Your task is to accurately translate SOURCE_TEXT from language SOURCE_LANG to language TARGET_LANG, producing TARGET_TEXT.

Your output is a JSON file with exactly the following schema: { “SOURCE_LANG": str, \\the value of SOURCE_LANG. “TARGET_LANG": str, \\the value of TARGET_LANG. “TARGET_TEXT": str, \\the translation output. }

C. Implementation details

All experiments were conducted on Google Colab16 using a single NVIDIA T4 15GB GPU; the fine-tuning process required approximately 1 hour.

We fine-tune the NLLB-200’s distilled 600M variant 17 using the Adafactor optimizer [21], with a learning rate of 1.5 · 10− 4 and 500 warm-up iterations.18 We use a batch size of 16 sentences.