1. Introduction

Arbuli sunnu: a Sicilian-Italian Parallel Treebank

Caterina Maria Cappello

Sabrina D'Alì

Mario Guglielmetti

Elisa Di Nuovo

Cristina Bosco

1 0 European Commission, Joint Research Centre (JRC) , Via Enrico Fermi, 2749, Ispra (VA), 21027 , Italia 1 Università di Torino , Dipartimento di Informatica, Corso Svizzera 185, Torino, 10149 , Italia

2025

The Natural Language Processing (NLP) community has recently begun to engage with endangered languages and dialects which encode culturally diferent perspectives and local knowledge. Regardless of the usefulness and applicability of NLP tools for such languages, creating resources for dialects increases our knowledge of them, encourages the community to study them further, and supports the preservation of an important heritage. As part of this endeavour, we are focussing on Sicilian, a dialect spoken in Sicily, with a rich cultural history. Sicilian preservation is crucial to maintaining Southern Italy's linguistic diversity. In this paper, we present the first release of a novel treebank called Sicilian3bank. On the one hand, to improve the usability of this resource and provide access to non-Sicilian speakers, all sentences are linked to their translation into Italian, resulting in a 1:1 parallel resource. On the other hand, by applying the Universal Dependencies format, a widely used standard for the annotation of treebanks, we pave the way for data-driven cross-linguistic research. We hope that this work can serve as a basis for further linguistic research and computational applications for the Sicilian dialect.

eol>Sicilian treebank parallel texts Universal Dependencies translation

1. Introduction

languages, since only in the very last years the NLP community has begun to engage with local and endangered Recent developments in generative Artificial Intelligence languages. Therefore the challenges to address are still (genAI) have increasingly highlighted the importance many. of taking more into account a larger variety of the lan- In this paper, we focus on the first steps of developing guages spoken in the world. Developing tools and re- a resource for one of the most spoken Italian dialects, sources to deal with a language has meaningful efects, which is featured in a long tradition of studies in linamong which the most important is an improvement of guistics, but not considered enough in NLP until now.1 the awareness of the underlying cultural heritage, an The aim of this study goes beyond introducing a specific aspect that can be crucial for the achievement of bet- novel resource and consists of starting a discussion on the ter performances by Large Language Models (LLMs) in challenges that can be encountered when NLP meets a several tasks. dialect or a language without a standardised orthography

According to [1], the world’s living languages can be and reference grammar.2 Starting this discussion may categorised into 500 institutional languages and a further be especially relevant in the context of the CLiC-it con6,500 local vernaculars, or oral languages. While institu- ference, since Italy is characterised by a one-of-a-kind tional languages feature standardised orthographies and linguistic diversity in the European landscape, where widespread literacy, the local languages include ancestral diatopic variation implicitly encodes local knowledge, languages, with an unbroken history of oral transmis- cultural traditions, artistic expressions, and the history sion, and languages in danger of disappearing. Most of its speakers [3]. With respect to high-resource lanNatural Language Processing (NLP) tools and resources developed until now are almost only for institutional

1This paper has been revised for English using the LLaMa 3.3 70B

model through the GPT@JRC platform, an internal JRC testbed for LLMs [2].

All cited links were last accessed on the 12th of June 2025.

Some of the reported examples have been shortened due to space constraints. 2In linguistics, the distinction between a language and a dialect is not always clear-cut and is often influenced by political and sociocultural factors rather than purely linguistic ones. A dialect is typically considered a regional or social variety of a language, but varieties such as Sicilian, which may lack oficial status or standardisation, are often labelled as dialects despite possessing many characteristics of a distinct language. For this reason, we use the terms language and dialect interchangeably when referring to Sicilian, to reflect its complex sociopolitical status. guages, which have extensive amounts of digital data and name suggests, UD represents syntax using dependency resources available, Italian dialects are under-resourced, trees, instead of constituency trees. This is because delacking suficient digital representation and support. pendency trees are perceived as better suited to represent

The language observed in our study is Sicilian, as the ti- free or flexible word order languages [ 6]. Furthermore, tle arbuli sunnu suggests, literally ‘trees (they) are’, show- models using dependency representations have achieved ing a common predicate-initial structure. Sicilian is a promising results in many NLP tasks (e.g. in machine vernacular language with local functions that include translation and information extraction) [6, p. 3]. intergenerational knowledge transmission. The resource UD comprises treebanks in more than 100 languages, presented comprises diachronic and diatopic variants, including low-resource languages (see sec. 2.2 for a defienabling the analysis of linguistic changes in certain phe- nition), e.g. Irish, Faroese, Uyghur. Among the UD treenomena along these axes. In addition, it features ortho- banks, there are also parallel treebanks, i.e. treebanks graphic variability due to the non-standardised transfer that have been translated into other languages and subfrom oral to written form. sequently annotated. The biggest efort in this respect

In order to make the resource accessible to a bigger has been done for the PUD treebank [7], which consists audience, we provide the Italian translation in a 1:1 align- of 1,000 sentences in 18 languages (the majority origiment setting. We decided to translate into Italian rather nally in English). Translators were asked to opt for the than English to underline the importance of mitigating translation which is fluent but also sharing the most the over-reliance towards English [4]. grammatical features of the original. Another example of

It is beyond the scope of this article to cover all the parallel treebanks in UD is ParTUT [8], which contains challenges associated with developing a treebank3 for sentences from diferent domains in English, Italian and Sicilian; we focused mainly on the phenomena that have French. In ParTUT, the alignment is not 1:1 for all the a major impact at (morpho-)syntactic level. By showing sentences [9], though the texts coming from a more forsome of the major challenges in the treebank annotation, mal register, i.e. those from the JRC-Acquis corpus [10], we hope to pave the way for the future development of an are almost all aligned 1:1. expanded resource and the discussion about the involved The 1:1 alignment has been considered as especially phenomena. helpful in learning contexts, and has been therefore ap

The paper is organised as follows: the next section plied in the case of the English Second Language (ESL) (Sec. 2) presents an overview of related work, followed [11] or VALICO-UD [12] treebanks, resources which inby Sec. 3, which describes the data collection and anno- clude learner texts in English and Italian, respectively. tation process for the first release of the Sicilian3bank, We decided to follow their example for Sicilian3bank, including the translation of Sicilian sentences into Ital- as it might be used for language learning. ian to create a parallel corpus. In Sec. 4 we show the parallel architecture of the treebank and the annotation 2.2. Language Variation in NLP methodology we followed. This section also highlights the challenges we faced developing a treebank for Sicil- It is possible to distinguish two main groups of languages ian. Finally, the last section (Sec. 5) is about conclusions based on the availability of resources: high-resource lanand future work. guages and low-resource languages [4]. The former are languages (excluding sign languages) that have a large collection of machine-readable texts or, at the very least, 2. Related Work a solid foundation upon which to build corpora, treebanks, and similar linguistic resources [4]. These include This section provides a brief introduction to the Univer- English, Mandarin Chinese, Arabic, and French, as well sal Dependencies (UD) formalism and existing parallel as Portuguese, Italian, Dutch, Standard Arabic, and Czech treebanks in UD, followed by a discussion on language to a somewhat lesser but still significant extent [ 4]. Many variation in NLP, with a focus on dialects and, eventually, languages, particularly local varieties and dialects, are at on Sicilian. risk of disappearing in a relatively short time due to the lack of attention and resources they receive. 2.1. Universal Dependencies and Parallel In the European context, standard languages exhibit Treebanks notable diatopic variation [3]. Failing to prioritise research on language variations in the field of NLP would UD [5] is a framework for annotating morphology and mean losing not only the languages as systems of comsyntax consistently across languages. In recent decades, munication, but also the identities, social values, and UD has become the de facto standard for treebanks. As its heritage of the societies they represent. It is not only a matter of increasing eforts towards these languages, but of doing so with an appropriate approach [3]. A shared 3A treebank is a corpus enriched with (morpho-)syntactic annotations. goal should be established, knowledge must be made ac- as Named Entity Recognition (NER) and Dialect Identificessible to all, and subsequently disseminated beyond cation (DID), thanks to BarNER, a medium-sized corpus the community itself through engagement initiatives and collecting Wikipedia and tweets data [15]. The authors in the promotion of active participation. In addressing low- [16] show how such resources can be efectively utilised resource and endangered languages a novel approach in NLP. would be applied based on respect, cultural awareness, A similar initiative is the COSER-UD treebank [17], and sensitivity to the wishes of their speakers. the first syntactically annotated corpus of spoken peninsular rural Spanish distributed within the UD frame2.3. Dialects in NLP work [18]. The treebank addresses features such as word-order flexibility, ellipses, disfluencies, and colloFocussing now specifically on dialects, it is important quial expressions, critical for accurately representing to note that their marginalisation is not a phenomenon morphosyntactic variation in oral communication.5 By exclusive to the field of NLP. A negative connotation of focussing on rural dialects beyond urban linguistic norms, dialects is often rooted in complex historical, social, and COSER-UD enhances the diversity of linguistic data availpolitical dynamics. For example in Italy, regional vari- able to NLP and supports sociolinguistic preservation of eties, dialects, and other non-standard linguistic forms under-represented varieties. The COSER-UD resource often coexist with the standard language in a situation has supported the development of tasks such as Partknown as dilalìa [13], where there is not a rigid compart- of-Speech (PoS) tagging, where models adapted to rumentalisation of the languages, as it happens in diglossìa, ral speech have been evaluated against a gold-standard but still Italian is preferred in formal and high-prestige dataset of over 13,000 sentences. Furthermore, the dataset domains, and dialects in informal, everyday, or familial in- has been used to test automatic speech recognition tools teractions. The significant linguistic loss experienced by on dialectal Spanish audio [19].

Sicilian and other Italian dialects can also be attributed to Another noteworthy project is the East Cretan Treethe Fascist dictatorship, which aimed to achieve linguis- bank [20]. It was built from audio material of folkloric tic unification by suppressing regional language varieties narratives collected from radio broadcasts, which were and all that was perceived as foreign. Furthermore, the transcribed and annotated according to the UD frameItalian language was instrumental in constructing na- work. The treebank annotates dialect-specific features, tional unity, serving as a symbol of collective identity such as euphonics and voicing phenomena, which are at the expense of non-standard varieties, which were in- represented using dedicated tags and treated as distinct creasingly marginalised in both institutional and public tokens in the annotated data. The East Cretan Treebank domains [3]. has been used for two main NLP tasks: PoS tagging and

One notable efort to address dialects and local lan- dependency parsing. Both tasks were addressed via fineguages is the MaiBaam project, a multi-dialectal Bavarian tuning of the Greek BERT model, using either exclusively UD treebank [14]. It represents the first UD treebank for the Eastern Cretan corpus data or in combination with the Bavarian language, a West German dialect spoken in data from the GUD, a treebank for Standard Modern southern Germany, Austria, and northern Italy (South Ty- Greek [21]. rol). The major challenges encountered by the MaiBaam Focussing on Italian dialects, a treebank for Ligurian project authors, which are common issues within this [22] is available in the UD repository which is the firstifeld, are the dificulty to collect texts and find native- ever digital corpus of that language, comprising 316 senspeaking annotators. While we are facing the former tences and 6,928 tokens. Like Sicilian, Ligurian is a michallenge, we did not encounter the latter, as the major- nority variety within the Italian linguistic landscape and ity of our team members are native speakers of Sicilian. faces many challenges due to its low-resourced status. Nevertheless, there remains the necessity for a strong The project shares similar goals with ours, aiming to linguistic knowledge of the dialect being worked on—a promote research and NLP development for endangered requirement that is uncommon, given that dialects are dialects, with a focus on supporting language preservararely studied actively but are instead acquired through tion. The study also addresses orthographic aspects of everyday use. The solution adopted by the MaiBaam the Genoese variety of the Ligurian dialect. The treebank group to adress this issue is making their work publicly was used for parsing experiments, and despite the peravailable, which enables them to engage with the pop- formance of the parser is lower than those trained on ulation and collect contributions from the community.4 high-resourced languages, the results obtained are in line The Bavarian dialect is also represented for tasks such with or superior to other small-scale corpora, confirming annotation consistency.

4Apart from sharing our resource, we mitigated this also during the

annotation process by making it the most objective as possible by using shared resources.

5Additional information can be found at https://github.com/

UniversalDependencies/UD_Spanish-COSER.

The UD repository also includes a small Neapolitan significant variability in the treatment of linguistic phetreebank that contains only 20 sentences, corresponding nomena. On the one hand, some grammars document to 197 tokens and 199 syntactic words6. some phenomena in detail, while in others they are com

As far as Sicilian is concerned, a particularly interest- pletely absent. On the other hand, some phenomena are ing project is the one carried out by Arba Sicula7 [23], mentioned in all grammars but treated diferently. It was which presents the first neural machine translator for therefore necessary to make a choice based on a critical the Sicilian dialect based on a deep-learning transformer comparison of the sources and data available to us. It model fed with Sicilian sentences augmented using back- should be noted that, as it is common in the development translation [24] to cope with the lack of resources. The of resources from scratch, some decisions were taken results were evaluated using the BLEU score metric and based on the limited set of examples currently included yielded scores of 35.0 for English>Sicilian and 36.8 for in the treebank. In future extensions of the resource, Sicilian>English. The project was later expanded into a new comparisons with additional instances of the same multilingual translation system by incorporating Italian, or similar phenomena may prompt a revision of certain using techniques such as transfer learning. annotation choices.

2.4. Studying Sicilian 3. Data Collection and Translation When approaching the creation of a treebank for a dialect,

one must come to terms with the absence of an ortho- In the development of a treebank, the first step to be graphic standard and norms to regulate its development. addressed is the collection of texts to be later annotated. Sicilian, as well as other dialects, exhibits great variabil- When the objective is a parallel treebank, texts must be ity, especially at the diachronic and diatopic levels. To made available in at least two languages. For the developdeal with these critical issues, we adopted a combined ap- ment of the first release of the Sicilian3bank12 we colproach, drawing on diferent grammars and dictionaries lected a group of open source texts available on the web of Sicilian and comparing them. In general, the grammars (sec. 3.1), and we applied to these texts a semi-automatic proved to be very useful to explain several phenomena procedure to obtain their Italian version (sec. 3.2). and guide their representation in Sicilian3bank. However, for a few especially challenging issues, those for 3.1. Data Collection which we found a discordance of opinions reported in the grammars, we provided solutions based on our intuition The first of the challenges we encountered was finding of native speakers and consulting a linguist expert on suitable texts and sources for building the treebank. We Sicilian. We carefully discussed them and kept track of constrained our search to literature, but we do not exouFromrothtievaptuiornpsosine othf eleaxnicnaoltcaotinosnugltuaitdioenlinaensd. to handle rcelusoduertcoei,nec.glu. dinecoltuhdeirnggetnhreeSsiicnilfiuantupreageenslaorfgWemikeinpteodfiath.1e3 diferent word forms, some online tools were used, such We started our search based on the criterion of contemas Wikizziunariu8, Glosbe9, Napizia-Chiù dâ Palora 10, poraneity, that is, we sought texts modern and reflecting sStarlavtiianmgot hilesiicmilpiaonrota11n,cpeluosf sleovceiarlapgoinsgtseavnedryblaovgasi,ldabemleorne-- luasnegfuulasgoeuurcsee hcaosnbseisetnenPtanwziathredpdreaswenetb-sdiatey.1S4iFcriloiamn.thAis source for dialectal language research and preservation. source, we retrieved two of the three texts of our coriItnalaiadndoitiboynA,wnteoncoionTsuraltienda N[2u5o]v,osevleocctaebdoflaorrioitssibcrieliaadntoh- pUuZs:uUccucu16n.tuThdiesPeurtpexut1s5 adnodnAomtianrdaicSaatpei t-hCeagpeìtouglruaUphniuc, and accuracy, and various other dictionaries [26, 27, 28]. origin of the authors or the dialectal variety, which pre

Several grammars from various time periods were also vents us from declaring with certainty the provenance consulted [29, 30, 31, 32, 33, 34, 35, 36, 37], in order to of these texts. However, based on a lexical analysis of gain a comprehensive understanding of the language also the terms used, it is likely that the first text comes from on diachronic aspect. Consulting these works revealed TthheeAthgririgdetnetxot aisreaacaonldletchtieosnecoofn1d8 fdrioamtotphice vCaartiaannitasa17reoaf. 6The few information about this resource can be found at https: //github.com/UniversalDependencies/UD_Neapolitan-RB. 12We plan to release it in the next oficial UD treebank release. 7Arba Sicula is a non-profit international organisation that promotes 13Main page: https://scn.wikipedia.org/wiki/PÃăggina_principali. the language and culture of Sicily https://arbasicula.org/. 14Available here: https://www.panzaredda.com/. 8Available here: https://scn.wiktionary.org/. 15Available here: https://www.panzaredda.com/post/ 9Available here: https://it.glosbe.com/. u-cuntu-di-purpu, written by Alesci Mistretta. 10Available here: https://www.napizia.com/cgi-bin/cchiu-da-palora. 16Available here: https://www.panzaredda.com/post/ pl. amara-sapi-capÃňtulu-unu-u-zuccu, by Goetia. 11Available here: http://www.salviamoilsiciliano.com/come-si-dice/ 17This paper focuses on 17 tales from the collection, excluding the dizionario/. 18th tale as it is entirely written in Italian. Some parts of the 17 the legend of Colapisci, a very well-known folktale in whereas for Colapisci, the most satisfactory version was Sicily, narrating the story of a merman. the one produced by GPT-4o.20

In the CoNLL-U file of Sicilian3bank, a comment line We considered subjective qualitative evaluations of the has been added at the beginning of each text, containing overall quality of the translation, focussing on the relainformation regarding the text’s diatopic variant and tionship between fidelity to the original text and fluency publication year. In the specific case of Colapisci, this of the translated text. Notably, despite not being specifiinformation is provided at the beginning of each story. cally trained on dialect data, the LLMs demonstrated a remarkable ability to generate meaningful translations, 3.2. Creating the Parallel Sicilian3bank producing a fluent and largely accurate output in both cases.21 However, some inaccuracies regarded: (i) UnIn this section, we present the challenges of LLMs in translated or roughly translated terms—nouns in particutranslating the selected texts from Sicilian into Italian, lar are the most dificult to translate and required manual and the translation principles we applied for manually corrections and lexical consultations; (ii) Cultural and correcting the automatic translations. linguistic nuances not correctly identified and translated; (iii) Inconsistencies in subject-verb agreement, especially 3.2.1. GenAI for Automatic Sicilian>Italian in translations produced by Mistral, and the use of verb Translation tense, which impaired temporal coherence; (iv) Omitted content—a few cases were observed where the models failed to translate parts of the text, producing incomplete results and requiring manual intervention.

To translate the Sicilian texts into Italian, we exploited

LLMs to obtain a first version, which was then manually revised by Sicilian native speakers.18 We decided not to use machine translation-specific systems, because they usually do not cover dialects, and when they do, e.g. 3.2.2. Translation Choices Google translate, their performance is low, as verified We created fluent translations into Italian, opting for at a first qualitative check on our texts. We preferred to the variant that has the most grammatical features of use general-purpose LLMs, as this might be the start of a the original, when possible, as in the PUD treebank [7]. more systematic study on LLMs abilities with translation Nevertheless, fully rendering the meaning of certain exof low-resource languages. The machine translated ver- pressions in the translation has been challenging. We sions were produced in three diferent settings, giving the have indeed encountered words that did not have an whole text in the prompt and asking for the translation, equivalent in Italian, or had one or more meanings. For giving a sentence at a time with the whole text as con- example, in U cuntu di Purpu, the nickname of the main text and giving each sentence in isolation.19 These three character, ‘Purpu’22, literally means ‘octopus’, but it is versions have been produced for each of the three LLMs commonly used also to ofensively indicate homosextested, i.e. Mistral 3 Small, LLaMA 3.3 70B and GPT-4o ual people. Nowadays, in the translation literature, it is models. These models were accessed using GPT@JRC, a commonly agreed that proper names are not translated, tool that enables the use of genAI models in a safe and unless they carry a meaning or the target audience reAI-Act compliant environment [2], and using standard quires it. A thoroughly studied case is the translation of settings (e.g. temperature 0.7). Despite the three texts names in Harry Potter [39, 40], where localisation seems having diferent lengths (from less than 2k to more than to be the most adopted technique. Since our primary aim 5k tokens), this did not influence the translation quality, is not translation, we decided to opt for a one-size-fits-all though only qualitatively evaluated, especially in the set- strategy instead of localisation, which involves an ad hoc ting asking for the translation of the whole text together, solution for each diferent case: proper names were not which is the one producing the best translations. This translated, even when they carried meaning. However, means that the degradation of performance reported in in the document with the whole translated text, provided the literature about LLMs [38] (using automatic metrics in the resource repository23, we added footnotes providsuch as BLEU) is not visible with our qualitative evaluation. In particular, reviewing the translations, it was observed that the best translations were generated by Mistral for the texts Amara Sapi and U Cuntu di Purpu, 20It must be noted that safety filters were triggered in some cases, especially in the short story U Cuntu di Purpu, as it is mentioned a dead body. This hindered the possibility for a full comparisons of the models and settings. collected tales contained Italian sentences, particularly in expla- 21Qualitatively better than translations obtained using Arba Sicula nations of details or cross-references to similar versions. These translator or Google Translate (Sicilian>English). sections were not included in the corpus. 22See https://it.wiktionary.org/wiki/purpu for the translation 18The first authors of this paper. of the term and this Quora thread https://it.quora.com/ 19We are aware that giving the whole text as a context per sentence Perché-in-Sicilia-gli-omosessuali-vengono-chiamati-purpi for a is not eficient considering computation costs, but we tried this discussion of its common use.

setting as we had only three texts. 23Available here: https://github.com/ElisaDiNuovo/Sicilian3bank. ing translation and further explanation where necessary. leading to the first version of the Sicilian3bank. The Other examples of proper names we met in the texts in- tool used for the correction was Arborator [43].26 Each cluded in the Sicilian3bank—which are known in the of the three texts was annotated by one annotator. The translation literature as challenging since rich in social, annotation was reviewed by a second annotator. Problemgeographical, or cultural references—are ‘Liotru’ (from U atic phenomena were discussed by the three annotators Cuntu di Purpu), literally translatable as ‘elephant’, but together, and specific cases also with the rest of the aualso bearing a reference to the city of Catania, that any thors.27 In Table 2 in Appendix A we report an example Sicilian reader would also recognise; ‘Zuccarata’ (from of the CoNLL-U file for a Sicilian sentence of the treeAmara Sapi), which is not only an afectionate epithet bank, featuring a comment line with the Sicilian text, and used to describe a person, but also the name of a tradi- the aligned Italian translation. tional dessert typical of the region. When it comes to this parallel dataset composed of

A diferent approach was taken with the toponyms the translations into Italian of the Sicilian sentences (dethat had a direct equivalent in Italian, which were indeed scribed in Sec. 3), the same parsing approach has been aptranslated, e.g. Missina, Turri di Faru, and Napuli (from plied, thus creating the Sicilian-Italian parallel treebank. the text Colapisci), rendered respectively as Messina, Nevertheless, considering that our main focus is on the Torre Faro, and Napoli. Finally, fictional toponyms, such Sicilian dialect, we decided to concentrate our current as Cirasitu, found in the text Amara Sapi, was Italianised eforts on the creation of the parallel data (translation as Cirasito, however the Italian reader would lose the into Italian) and the manual correction of the annotation reference to cherries. of the Sicilian data, carefully checking them both, and planning instead the manual check of the annotation of the Italian parallel data of the Sicilian3bank as a fu4. Sicilian3bank in UD ture work. This is further justified as automatic parsers for Italian are considered good enough, although some marginal phenomena still are consistently wrongly annotated [44, 12]. The next section is therefore focused on the analysis based on the Sicilian data only.

In this section, we describe the annotation process and

the challenges we faced in applying the UD format to our collection of texts described in Sec. 3. All the annotation choices are documented in the annotation guidelines, provided in the resource repository.

4.2. A Quantitative Analysis of the Sicilian Data 4.1. Parsing Sicilian in UD

There is no annotated resource or treebank in UD format After the manual check and correction, the Sicilian refor the Sicilian dialect. Based on the supposed similarity source annotated in CoNLL-U format consists of a total of Sicilian with Italian and the availability of UD tree- of 505 sentences and 11,709 tokens (Table 1). Each anbanks for this latter, we decided to create a first draft of notated sentence of each of the three texts presented in the Sicilian annotated data using the models for Italian, Sec. 3.1 includes a comment text line that reports the senexpecting to find a significant amount of errors in the tence in Sicilian dialect followed by a comment text line output to be manually corrected. We selected the mod- containing the translation into Italian. Following this, els trained on ISDT [41] and POSTWITA [42] treebanks, the UD annotation of the sentence is provided organised which are the biggest resources for Italian available in the in the ten columns typical of this format (Table 2). UD repository, and we have a performance evaluation of these models in non standard Italian texts (i.e. [12]). TAemxatra Sapi Number of senten2c4e6s Number of tok47en23s A preliminary comparison of the outputs generated by Colapisci 179 5092 UDPipe24 trained on them showed that the model based U cuntu di Purpu 80 1894 on ISDT outperforms that based on POSTWITA in deal- Total 505 11709 ing with Sicilian data. We started therefore the manual Table 1 check and correction of the output of UDPipe trained on The distribution of sentences and tokens in the Sicilian data ISDT, feeding it with gold sentence segmentation.25 of the Sicilian3bank.

The three first authors, all native Sicilian speakers skilled in linguistics and computational linguistics, carried out this manual revision of the automatic annotation 24Available here: https://lindat.mf.cuni.cz/services/udpipe/. 25For sentence segmentation we followed the VALICO-UD project, which does not split sentences on colons and treats direct speech as single segment.

26We noticed that Arborator (https://arborator.ilpga.fr) allowed to split tokens only into two, so in case of verb + double clitic we had to further tokenise manually. 27To further ensure annotation quality, an inter-annotator agreement score (Krippendorf’s kappa) will be computed for future releases of the treebank. areas [33].

(1) # text = Stu Piscicola era unu di lu Faru # translation = Questo Piscicola era uno del Faro

root det nsubj cop nmod case

det

Stu Piscicola era unu di lu Faru

DET PROPN AUX PRON ADP DET PROPN chistu Piscicola essiri unu di lu Faru this Piscicola was one of the Faro (2) # text = fòru ’mmarsamati propriamenti comu iddhi nisceru d’ ’u mari # translation = furono imbalsamate proprio quando uscirono dal mare advcl

A comparison of the annotation provided by UDPipe

with the manually corrected data enables us to evaluate the transfer domain abilities of the parsing models when applied on the Sicilian data. In Table 3 in Appendix A, we report the scores (precision, recall and F1 for UPOS, LAS and UAS) obtained by UDPipe models trained on ISDT and on PoSTWITA. These results confirm that the model based on ISDT outperforms the other one, but it must be observed that it may depend at least in part on the fact that the output of UDPipe trained on ISDT was the base for the manual correction. The table shows that the best performance based on ISDT can be referred to Colapisci (LAS F1 72.87) while the worst to Amara Sapi (LAS F1 59.80). An in-depth investigation of these results is beyond the scope of this paper, but will be addressed in our future work. However, we can qualitatively observe that the performance of the two models difers for some phenomena. For example, the model trained on PoSTWITA was more robust in annotating verbs containing double clitic pronouns.

4.3. Challenges in Dealing with the Sicilian Dialect

obj case The approach used for the generation of the annotated expl det det data, based on models available for Italian, has clearly brought out some characteristics and phenomena that PRsO’N asVcEiuRcBau D EiT lNàcOrUimNi AcDuP DlEaT NmOaUnuN diferentiate Sicilian from Italian. It is in dealing with si asciucari lu làcrima cu lu manu these phenomena that the parser has produced more oneself wiped the tears with the hand annotation errors, and it is on them that the work of Contracted articulated prepositions—graphically manual correction was mostly concentrated. marked by the circumflex accent [ 29, 30, 32]—were split

This section presents some choices we had to make to into two diferent tokens, as shown in Example 3. In deal with some features of the Sicilian texts considered. this way we show, for each articulated preposition, In particular, we focus on tokenisation (articulated prepo- the morphology attached to it, even in those cases in sitions), lemmatisation (orthographic variations of some which it is not apparently visible, as it is nevertheless pronouns reflecting suprasegmental traits), and syntac- part of its evolution and can be described by formal tic (focussing here on the reduplication phenomenon) rules. A diferent choice, such as not splitting it into two choices. tokens, would have highlighted the grammaticalisation of this particular phenomenon by not splitting it into 4.3.1. Tokenisation Issues two tokens. However, this choice might necessitate A particularly relevant phenomenon that emerged during the creation of a specific UPOS, which would hinder the annotation is that represented by articulated preposi- cross-language comparisons. tions, for which there has been, over time, a process of Similarly the forms nta and ntâ difer as the former is grammaticalisation that has determined their evolution. a simple preposition, equivalent to in of Italian, while the Generally, many prepositions that in Italian occur in a latter is the articulated preposition. Depending on the unified form have undergone a transformation in Sicilian, gender and number of the article, it can be rendered as ifrst passing through a disjunct form (Example 1) 28, until ntô (masculine singular), ntê (plural, both masculine and arriving at forms with elision (Example 2)29 [34, 31] and, feminine). in more recent times, with contraction (Example 3)30, al- It is worth noting in this regard that the Italian prepothough the disjunct form is still present, at least in some sition in can be rendered in Sicilian in various ways, such as in, ni, nni, nta [29]. The same is true for the Italian simple preposition da, which in Sicilian occurs in the forms di, ni and nni [29]. These diferent forms are relfected also in the corresponding articulated prepositions 28English translation: This Piscicola was one from Faro. 29English translation: [...] were embalmed just as they emerged from

the sea. 30English translation: He wiped away his tears with his hand. fòru ’mmarsamati propriamenti comu iddhi nisceru d’ ’u mari AUX VERB ADV SCONJ PRON VERB ADP DET NOUN essiri imbalsamari propriamenti comu iddi nesciri di lu mari were embalmed right as they came-out from the sea (3) # text = S’asciucau i làcrimi câ manu # translation = Si asciugò le lacrime colla mano root marknsubj obl obl case det (e.g. the Italian preposition nello, such as ntô, nô and nnô). shift or extension of meaning within the sentence. It is Please see Sec. 4.3.2, for our lemmatisation choices for a phenomenon still highly productive in contemporary these variants. Sicilian, as shown by Amenta through the analysis of

The complete scheme of the articulated prepositions a corpus from the Atlante Linguistico della Sicilia [46], system in Sicilian is presented in Table 4 in Appendix A. where these forms exhibit neither diachronic nor diastratic variation, thereby confirming the ongoing vitality 4.3.2. Lemmatisation Issues of this linguistic process. This phenomenon can involve the reduplication of a verb to form an adjective or a Concerning lemmatisation, as Sicilian does not have a noun; a noun to form an adjective or an adverb; and unified orthography—although recent eforts try to stan- other PoS [47]. This last pattern, the most frequent in dardise this [32]—in the texts considered there are dif- our texts, reveals several semantic implications, but freferent variants for the same forms, which try to render quently is used as a locational nominal modifier. In order diferent pronunciations. For example, in the considered to highlight the compound nature of this phenomenon texts there is no consistency in the transcription of the (in [45, p. 350], it is clearly stated that it is not possible Sicilian word meaning ‘no one’, nuddu, which is pro- to interpose any words between the two elements of the nounced reproducing a long voiced retroflex stop, but it reduplicated construct), we use the relation compound is transcribed sometimes as nuddu, other times as nu d. d.u, and the relation obl, in line with UD guidelines, as shown stressing the retroflex pronunciation. Other variants of in Example 631. In addition we added LOC=adv in the last the same word are nuddru, nuddhu. Since our aim is not column of the CoNLL-U file, as it is done in VALICO-UD, focused on phonetics, we lemmatised these occurrences to indicate that there is an adverbial locution. without any pronunciation marks, i.e. nuddu, and de- (6) # text = avìanu truvato campi campi cided not to uniform the orthographic rendering (i.e. the # translation = avevano trovato tra i campi form) of this word and similar cases, e.g. ci/cci and ni/nni, as shown in Examples 4a-4b and 5a-5b, respectively. aux root obl compound

(4a) # text = ci succidìu accussì LEMMA ci # translation = gli successe questo (this happened to him) avìanu truvatu campi campi (4b) # text = chi cci jemu a fari? LEMMA ci AUX VERB NOUN NOUN # translation = che ci andiamo a fare? (what are we going aviri truvari campu campu to do there?) had found ifelds ifelds (5a) # text = ni chiamavanu "l’Armali" LEMMA ni # translation = ci chiamavano "gli Animali" (they called us "the animals") 4.4. A Cross-Linguistic Analysis Example (5b) # text = Chi nni putìa sapiri iu? LEMMA ni In Sicilian, modal verbs—like the auxiliaries essiri (‘to # translation = Che ne potevo sapere io? (How could I be’) and aviri (‘to have’)—can serve two main functions: know about that?) they may appear independently with their own lexical

We applied the same principle to shortened oral vari- meaning, or they may function as support verbs, combinants of words, e.g. diri (‘to say’) or riri, both of which are ing with an infinitive (without a preposition) to convey abbreviated forms of diciri. All such variants have been specific modal values, such as: (i) ability/possibility → lemmatised using the extended lemma, such as diciri in putiri (‘can’); (ii) will/desire → vuliri (‘want’); (iii) obligaExample 8). tion/necessity → duviri (‘must’) or aviri a (‘have to’).

To summarise, the main aim of lemmatisation is to In modern Sicilian, particularly in spoken usage, the pereduce the sparseness of forms and their variants by riphrastic construction aviri a + infinitive is commonly reducing them to a common lemma, regardless of the employed to express modal meanings, especially obligacauses of this sparseness. Therefore, we have applied the tion, replacing the older verb duvìri found in Old Sicilsame strategy used in other resources where sparsity is ian [30] (see Example 7)32. Within this construction, the determined, for example, by the writing style of the users tense of aviri plays a central role in conveying modal (or by errors due to the writing device they use), as in values, whether epistemic or deontic. When aviri apPoSTWITA[42], to the lemmatisation of Sicilian3bank. pears in the past remote, its perfective aspect confers an epistemic meaning, indicating certainty about the event’s 4.3.3. Syntax Issues occurrence in the past. In contrast, when aviri is used in the present or imperfect—both imperfective tenses—the construction can express either an epistemic sense of probability or a deontic sense of obligation or necessity.

One of the cases in which we had to take a decision

about a syntactic phenomenon is reduplication, a typical and widespread phenomenon in the Sicilian dialect [45], which consists in the repetition of a word, resulting in a 31English translation: [...] they had found among the fields. 32English translation: I should listen to you much more often. In some cases, especially with the present indicative or imperfect subjunctive, an exhortative function may also emerge [48].

(7) # text = T’avissi a ’scutari cchiù assai # translation = ti dovrei ascoltare molto di più root expl

The annotation in UD of such resource allows for draw

ing a parallel with other languages. For example, with the English have to construction, which is similarly used to express obligation and certainty [49, p. 210]. In the English UD treebanks to is consistently annotated as a particle when used in this way (see Example 9a in Appendix A). We therefore decided to treat the element a, which is usually tagged as a preposition in our corpus, as a particle in this specific construction. However, in Italian avere da can be used with the same meaning (see Example 9b in Appendix A), but da is not annotated as particle. This might be due to historical reasons, a diferent function of da in Italian than of to in English, or to highlight a less grammaticalised relation.

Another periphrastic construction found in the treebank texts is veniri + a + diciri (literal translation into Italian venire a dire), which can have the meaning of the Italian verb significare (‘to mean’). In such cases, we treated it in the same way as the previous one, as shown in Example 833.

(8) # text = Chi veni a diri? # translation = Che significa ?

punct root obj

5. Conclusion and Future Work We can create a world that sustains its languages [50]. Among the concrete actions we can perform to achieve this goal, there is the possibility of speaking and studying the original languages of our places.

This paper describes and discusses the issues involved in the development of the first release of the Sicilian3bank. Many are the challenges we have encountered in dealing with a language which has never been treated before and which is in addition a dialect, which carries 33English translation: What does it mean? with it an uninterrupted history of oral transmission but does not have a standardised form of transcription or unified treatment of phenomena in grammars.

The project we present here is intended therefore solely as a preliminary foundation and proposal, which nonetheless requires substantial further work and numerous improvements. First, the inclusion of more texts and perform inter-annotator agreement, to verify guidelines soundness. Second, the corpus enrichment introducing Italian glosses in the MISC column of the CoNLL-U file. In the current version, each sentence is accompanied by a fluent Italian translation in a comment line, we propose the inclusion of a literal word-for-word translation from Sicilian into Italian. Although this form of translation may result in grammatically incorrect or unnatural Italian, it would provide an almost word-by-word parallel aligned resource that mirrors the syntactic structure of the original Sicilian sentences and would facilitate syntactic calque studies. Third, a future objective would be to manually validate the automatic annotation generated with UDPipe for the aligned Italian resource as well. This step is needed to give to the Italian parallel dataset the same quality we are currently providing for the Sicilian annotated data. Fourth, another interesting enhancement might be to systematically include graphic accents on all verb lemmas, to help reading them, and including in MISC column of the CoNLL-U file the International Phonetic Alphabet transcription. This idea is motivated by the desire to turn the resource not only into a syntactic dataset but also into a tool to support language learning, scientific studies and preservation of Sicilian. Finally, an aspect we would like to improve in the future concerns the translation of proper nouns. As already discussed, we encountered several challenges in translating these elements, which ultimately led us to the decision not to translate the proper nouns found in the texts at this stage. The focus of this work is the development of a Sicilian treebank, and although a deeper engagement with translation would certainly have added valuable insights, it would have diverted attention from the project’s primary objective. We therefore plan to revisit this aspect in a later phase of the project.

Acknowledgment We would like to express our gratitude to Giuseppe

Domenico Muscianisi, PhD, from the University of Parma, for very kindly sharing with us his expertise, which was instrumental in resolving several of our questions and improving our knowledge about the literature on the Sicilian dialect.

A special thanks goes to the JRC internal reviewers and to the CLiC-it 2025 anonymous reviewers for their precious comments. TUT parallel treebank, in: Proceedings of The Second Workshop on Annotation and Exploitation of [1] S. Bird, D. Yibarbuk, Centering the Speech Com- Parallel Corpora, 2011, pp. 19–28.

munity, in: Y. Graham, M. Purver (Eds.), Proceed- [9] M. Sanguinetti, C. Bosco, PartTUT: The Turin Uniings of the 18th Conference of the European Chap- versity Parallel Treebank, in: R. Basili, C. Bosco, ter of the Association for Computational Linguis- R. Delmonte, A. Moschitti, M. Simi (Eds.), Harmotics - Volume 1: Long Papers, ACL, St. Julian’s, nization and Development of Resources and Tools Malta, 2024, p. 826–839. URL: https://aclanthology. for Italian Natural Language Processing within the org/2024.eacl-long.50/. doi:10.18653/v1/2024. PARLI Project, Springer, 2015, pp. 51–69. eacl-long.50. [10] R. Steinberger, M. Ebrahim, A. Poulis, M. Carrasco[2] B. De Longueville, I. Sanchez, S. Kazakova, S. Luoni, Benitez, P. Schlüter, M. Przybyszewski, S. Gilbro, F. Zaro, K. Daskalaki, M. Inchingolo, The Proof An overview of the European Union’s highly multiis in the Eating: Lessons Learnt from One Year lingual parallel corpora, Language Resources and of Generative AI Adoption in a Science-for-Policy Evaluation 48 (2014) 679–707.

Organisation, AI 6 (2025) 128. [11] Y. Berzak, J. Kenney, C. Spadine, J. X. Wang, L. Lam, [3] A. Ramponi, Language Varieties of Italy: Tech- K. S. Mori, S. Garza, B. Katz, Universal Dependennology Challenges and Opportunities, Transac- cies for Learner English, in: E. Katrin, A. S. Noah tions of the Association for Computational Linguis- (Eds.), Proceedings of the 54th Annual Meeting of tics 12 (2024) 19–38. doi:https://doi.org/10. the Association for Computational Linguistics (Vol1162/tacl_a_00631. ume 1: Long Papers), Association for Computa[4] E. M. Bender, The #BenderRule: On Naming the tional Linguistics, 2016.

Languages We Study and Why It Matters, The [12] E. Di Nuovo, Introducing Valico-UD: A Parallel, Gradient (2019). Learner Italian Treebank for Language Learning [5] M.-C. de Marnefe, C. D. Manning, J. Nivre, Research, Pàtron, 2023.

D. Zeman, Universal Dependencies, Com- [13] G. Berruto, Lingua, dialetto, diglossia, dilalia, in: putational Linguistics 47 (2021) 255–308. G. Holtus, J. Kramer (Eds.), Romania et Slavia AdriURL: https://aclanthology.org/2021.cl-2.11/. atica. Festschrift für Zarko Muljačić, Buske, Hamdoi:10.1162/coli_a_00402. burg, 1987, pp. 57–81. [6] H. Bunt, P. Merlo, J. Nivre (Eds.), Trends in Parsing [14] V. Blaschke, B. Kovačić, S. Peng, H. Schütze, Technology: Dependency Parsing, Domain Adapta- B. Plank, MaiBaam: A Multi-Dialectal Bavarian Unition, and Deep Parsing, volume 43, Springer Science versal Dependency Treebank, in: N. Calzolari, M.-Y. & Business Media, 2010. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Pro[7] D. Zeman, M. Popel, M. Straka, J. Hajič, J. Nivre, ceedings of the 2024 Joint International Conference F. Ginter, J. Luotolahti, S. Pyysalo, S. Petrov, on Computational Linguistics, Language Resources M. Potthast, F. Tyers, E. Badmaeva, M. Gokirmak, and Evaluation (LREC-COLING 2024), ELRA and A. Nedoluzhko, S. Cinkova, J. Hajic jr., J. Hlaváčová, ICCL, Torino, Italia, 2024, pp. 10921–10938. URL: V. Kettnerová, Z. Urešová, J. Kanerva, S. Ojala, https://aclanthology.org/2024.lrec-main.953/. A. Missilä, C. D. Manning, S. Schuster, S. Reddy, [15] S. Peng, Z. Sun, H. Shan, M. Kolm, V. Blaschke, D. Taji, N. Habash, H. Leung, M.-C. de Marn- E. Artemova, B. Plank, Sebastian, Basti, Wastl?! efe, M. Sanguinetti, M. Simi, H. Kanayama, V. de- Recognizing Named Entities in Bavarian DialecPaiva, K. Droganova, H. Martínez Alonso, C. Çöl- tal Data, in: N. Calzolari, M.-Y. Kan, V. Hoste, tekin, U. Sulubacak, H. Uszkoreit, V. Macketanz, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of A. Burchardt, K. Harris, K. Marheinecke, G. Rehm, the 2024 Joint International Conference on ComT. Kayadelen, M. Attia, A. Elkahky, Z. Yu, E. Pitler, putational Linguistics, Language Resources and S. Lertpradit, M. Mandl, J. Kirchner, H. F. Alcalde, Evaluation (LREC-COLING 2024), ELRA and ICCL, J. Strnadová, E. Banerjee, R. Manurung, A. Stella, Torino, Italia, 2024, pp. 14478–14493. URL: https: A. Shimada, S. Kwak, G. Mendonça, T. Lando, R. Ni- //aclanthology.org/2024.lrec-main.1262/. tisaroj, J. Li, CoNLL 2017 Shared Task: Multilin- [16] X. M. Krückl, V. Blaschke, B. Plank, Improving gual Parsing from Raw Text to Universal Dependen- Dialectal Slot and Intent Detection with Auxiliary cies, in: J. Hajič, D. Zeman (Eds.), Proceedings of Tasks: A Multi-Dialectal Bavarian Case Study, in: the CoNLL 2017 Shared Task: Multilingual Parsing Y. Scherrer, T. Jauhiainen, N. Ljubešić, P. Nakov, from Raw Text to Universal Dependencies, Asso- J. Tiedemann, M. Zampieri (Eds.), Proceedings of ciation for Computational Linguistics, Vancouver, the 12th Workshop on NLP for Similar Languages, Canada, 2017, pp. 1–19. Varieties and Dialects, Association for Computa[8] M. Sanguinetti, C. Bosco, Building the multilingual tional Linguistics, Abu Dhabi, UAE, 2025, pp. 128– 146. URL: https://aclanthology.org/2025.vardial-1. [26] G. Biundi, Vocabolario manuale completo siciliano10/. italiano seguito da un’appendice e da un elenco di [17] J. E. Bonilla, Spoken Spanish PoS tagging: nomi proprj siciliani: coll’aggiunta di un dizionario gold standard dataset, Language Resources geografico in cui sono particolarmente descritti and Evaluation 59 (2025) 983–1012. doi:10.1007/ i nomi di città, fiumi, villaggi ed altri luoghi ris10579-024-09751-x. marchevoli della Sicilia: e corredato di una breve [18] J. E. Bonilla, Development of the first spoken span- grammatica per gl’Italiani, Palermo, Carini, 1851. ish treebank within the universal dependencies [27] V. Mortillaro, Nuovo dizionario siciliano-italiano. framework: A multi-regional approach, submitted. Volume unico, Palermo, Stabilimento tipografico [19] C. Adsuar Ávila, Automatic Speech Recog- Lao, 1876.

nition in Dialectal Data (COSER), 2024. [28] R. Rocca, Dizionario Siciliano-Italiano compilato URL: https://audias.ii.uam.es/2024/10/30/ su quello del Pasqualino con aggiunte e correzioni. automatic-speech-recognition-in-dialectal-data-coser/, Volume unico, Catania, Pietro Giunti Editore, 1839. Presentation at the AUDIAS-UAM Seminar, Octo- [29] A. Fortuna, Grammatica siciliana: Principali regole ber 30, 2024. grammaticali, fonetiche e grafiche (comparate tra i [20] S. Vakirtzian, V. Stamou, Y. Kazos, S. Markantona- vari dialetti siciliani), Caltanissetta, Terzo Millennio tou, Dialectal treebanks and their relation with the Editore, 2002. standard variety: The case of East Cretan and Stan- [30] F. Giacalone, Prammatica siciliana. Storia della nosdard Modern Greek, in: R. Johansson, S. Stymne tra lingua, proverbi, curiosità, modi di dire, consigli (Eds.), Proceedings of the Joint 25th Nordic Con- pratici per una corretta scrittura, Trapani, Edizioni ference on Computational Linguistics and 11th Colorgrafica, 2009.

Baltic Conference on Human Language Technolo- [31] A. Messina, Grammatica sistematica della lingua gies (NoDaLiDa/Baltic-HLT 2025), University of siciliana. Dall’ortoepia all’ortografia. Dall’analisi Tartu Library, Tallinn, Estonia, 2025, pp. 776–784. grammaticale all’analisi logica e del periodo. Con URL: https://aclanthology.org/2025.nodalida-1.77/. antologia esemplificativa dei poeti. Seconda edi[21] P. Prokopidis, H. Papageorgiou, Experiments for zione riveduta e ampliata con 30 chine sui mestieri Dependency Parsing of Greek, in: Y. Goldberg, d’una volta eseguite da Francesco Nania e poesie, Y. Marton, I. Rehbein, Y. Versley, Ö. Çetinoğlu, Assessorato alle politiche scolastiche di Siracusa, J. Tetreault (Eds.), Proceedings of the First Joint 2007.

Workshop on Statistical Parsing of Morphologi- [32] S. Baiamonte, Documento per l’ortografia del sically Rich Languages and Syntactic Analysis of ciliano. Documentu pi l’ortugrafìa dû sicilianu. II Non-Canonical Languages, Dublin City Univer- edizione, Cademia Siciliana, 2024. sity, Dublin, Ireland, 2014, pp. 90–96. URL: https: [33] Lingua siciliana. Come scrivere in sicil//aclanthology.org/W14-6109/. iano, n.d. URL: https://linguasiciliana.com/ [22] S. Lusito, J. Maillard, A Universal Dependencies come-scrivere-in-siciliano/.

corpus for Ligurian, in: M. de Lhoneux, R. Tsarfaty [34] M. Gorini, Ortografia Siculo-Calabra, 2017. URL: (Eds.), Proceedings of the Fifth Workshop on Uni- https://michelegorini.blogspot.com/2017/08/ versal Dependencies (UDW, SyntaxFest 2021), As- ortografia-siculo-calabra.html. sociation for Computational Linguistics, Sofia, Bul- [35] G. Gerbino, N. Barone, Cenni di ortografia siciliana, garia, 2021, pp. 121–128. URL: https://aclanthology. Trapani, Jò A.L.A.S.D., 2011.

org/2021.udw-1.10/. [36] V. Lumia, La Nostra Grammatica Siciliana, Trapani, [23] E. Wdowiak, Sicilian Translator: A Recipe for Low- Jò A.L.A.S.D., 2010.

Resource NMT, 2021. URL: https://arxiv.org/abs/ [37] N. Russo, Corso di grammatica siciliana, Forum 2110.01938. arXiv:2110.01938. Lingua siciliana 2003. [24] R. Sennrich, B. Haddow, A. Birch, Improving Neu- [38] L. Wang, Z. Du, W. Jiao, C. Lyu, J. Pang, L. Cui, ral Machine Translation Models with Monolingual K. Song, D. Wong, S. Shi, Z. Tu, BenchmarkData, in: K. Erk, N. A. Smith (Eds.), Proceedings ing and Improving Long-Text Translation with of the 54th Annual Meeting of the Association Large Language Models, in: L.-W. Ku, A. Marfor Computational Linguistics (Volume 1: Long tins, V. Srikumar (Eds.), Findings of the AsPapers), Association for Computational Linguis- sociation for Computational Linguistics: ACL tics, Berlin, Germany, 2016, pp. 86–96. URL: https: 2024, Association for Computational Linguistics, //aclanthology.org/P16-1009/. doi:10.18653/v1/ Bangkok, Thailand, 2024, pp. 7175–7187. URL: https: P16-1009. //aclanthology.org/2024.findings-acl.428/. doi: 10. [25] A. Traina, Nuovo vocabolario siciliano-italiano, 18653/v1/2024.findings-acl.428.

Palermo, Lauriel, 1868. [39] K. Brøndsted, C. Dollerup, The names in Harry Potter, Perspectives: Studies in Translatology 12 (2004) Padova, 2010, pp. 1–20.

56–72. doi:10.1080/0907676X.2004.9961490. [49] M. Swan, Practical English Usage 3rd edition, Ox[40] C. Mastrangelo, Harry Potter in Translation: Com- ford University Press, 2005.

parison of Nine Romance Languages in the Trans- [50] S. Bird, Beyond Technological Solutions: How we lation of Proper Names in Harry Potter and the Create a World that Sustains its Languages, LinguaPhilosopher’s Stone, Transletters. International pax Review 9 (2022) 167–173.

Journal of Translation and Interpreting (2024) 1–28. [41] C. Bosco, S. Montemagni, M. Simi, Converting

Italian Treebanks: Towards an Italian Stanford Dependency Treebank, in: A. Pareja-Lora, M. Liakata, S. Dipper (Eds.), Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 61–69. URL: https://aclanthology.org/W13-2308/. [42] M. Sanguinetti, C. Bosco, A. Lavelli, A. Mazzei,

O. Antonelli, F. Tamburini, PoSTWITA-UD: an Italian Twitter Treebank in Universal Dependencies, in: N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, T. Tokunaga (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, 2018, pp. 1768–1775. URL: https://aclanthology.org/

L18-1279/. [43] G. Guibon, M. Courtin, K. Gerdes, B. Guillaume,

When Collaborative Treebank Curation Meets Graph Grammars, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 5291–5300. [44] E. Di Nuovo, M. Sanguinetti, A. Mazzei, E. Corino,

C. Bosco, VALICO-UD: Treebanking an Italian Learner Corpus in Universal Dependencies, IJCoL. Italian Journal of Computational Linguistics 8 (2022). [45] L. Amenta, La reduplicazione sintattica in siciliano,

Bollettino del Centro di studi filologici e linguistici siciliani 22 (2010) 345–358. [46] G. Rufino, Linee di discussione a ipotesi di lavoro per l’Atlante Linguistico della Sicilia, in: Actas do XIX Congreso Internacional de Lingüística e Filoloxia Románicas (1989), volume VIII, A Coruña, 1996, pp. 649–682. [47] G. Todaro, F. Villoing, P. Gréa, INTERNAL LO

CALISATION NN ADV REDUPLICATION IN SICILIAN, in: Colloque International de Morphology, volume 22, Bordeaux, France, 2012. [48] L. Amenta, Perifrasi verbali in siciliano, in: J. Gar

zonio (Ed.), Studi sui dialetti della Sicilia, Unipress, A. Appendix # sent_id = 35 # text = Nud. d.u di nuiautri sapìa soccu fari. # translation = Nessuno di noi sapeva cosa fare. 1 Nud. d.u nuddu PRON PI 2 di di ADP E 3 nuiautri nuiautri PRON PE 4 sapìa sapiri VERB V 5 soccu soccu PRON PQ 6 fari fari VERB V 7 . . PUNCT FS

Gender=Masc|Number=Sing|PronType=Ind

Number=Sing|PronType=Int

VerbForm=Inf _ 4 nsubj 3 case 1 nmod 0 root 6 obj 4 ccomp 4 punct _ _ _ _ _ _ _ _ _ _ _ _ SpaceAfter=No SpacesAfter=\r\n (9a) [From EWT treebank] # sent_id = weblog-blogspot.com_alaindewitt_20060827093500_ENG_20060827_093500-0017 # text = The wedding had to be postponed as family members fled the outbreak of the war, she said.

root The DET the det

nsubj wedding NOUN wedding xcomp mark

aux:pass had VERB have (9b) [From ISDT treebank] # sent_id = isst_tanl-1497 # text = ho da dire anche molte cose che avrei da dire contro me stesso

ho VERB avere have xcomp

mark da ADP da to dire VERB dire say

det advmod molte DET molto many root cose NOUN cosa things Declaration on Generative AI acl:relcl

obj During the preparation of this work, the author(s) used ChatGPT (OpenAI), Grammarly, Other, and GPT@JRC (an internal JRC testbed for LLMs. The model used there is an on-premises installation of LLaMa 3.3 70B) in order to: Paraphrase and reword, Improve writing style, Grammar and spelling check, and Citation management. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.