<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Arbuli sunnu: a Sicilian-Italian Parallel Treebank</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Caterina Maria Cappello</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabrina D'Alì</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Guglielmetti</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisa Di Nuovo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Bosco</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Commission, Joint Research Centre (JRC)</institution>
          ,
          <addr-line>Via Enrico Fermi, 2749, Ispra (VA), 21027</addr-line>
          ,
          <country country="IT">Italia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università di Torino</institution>
          ,
          <addr-line>Dipartimento di Informatica, Corso Svizzera 185, Torino, 10149</addr-line>
          ,
          <country country="IT">Italia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The Natural Language Processing (NLP) community has recently begun to engage with endangered languages and dialects which encode culturally diferent perspectives and local knowledge. Regardless of the usefulness and applicability of NLP tools for such languages, creating resources for dialects increases our knowledge of them, encourages the community to study them further, and supports the preservation of an important heritage. As part of this endeavour, we are focussing on Sicilian, a dialect spoken in Sicily, with a rich cultural history. Sicilian preservation is crucial to maintaining Southern Italy's linguistic diversity. In this paper, we present the first release of a novel treebank called Sicilian3bank. On the one hand, to improve the usability of this resource and provide access to non-Sicilian speakers, all sentences are linked to their translation into Italian, resulting in a 1:1 parallel resource. On the other hand, by applying the Universal Dependencies format, a widely used standard for the annotation of treebanks, we pave the way for data-driven cross-linguistic research. We hope that this work can serve as a basis for further linguistic research and computational applications for the Sicilian dialect.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sicilian</kwd>
        <kwd>treebank</kwd>
        <kwd>parallel texts</kwd>
        <kwd>Universal Dependencies</kwd>
        <kwd>translation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>languages, since only in the very last years the NLP
community has begun to engage with local and endangered
Recent developments in generative Artificial Intelligence languages. Therefore the challenges to address are still
(genAI) have increasingly highlighted the importance many.
of taking more into account a larger variety of the lan- In this paper, we focus on the first steps of developing
guages spoken in the world. Developing tools and re- a resource for one of the most spoken Italian dialects,
sources to deal with a language has meaningful efects, which is featured in a long tradition of studies in
linamong which the most important is an improvement of guistics, but not considered enough in NLP until now.1
the awareness of the underlying cultural heritage, an The aim of this study goes beyond introducing a specific
aspect that can be crucial for the achievement of bet- novel resource and consists of starting a discussion on the
ter performances by Large Language Models (LLMs) in challenges that can be encountered when NLP meets a
several tasks. dialect or a language without a standardised orthography</p>
      <p>According to [1], the world’s living languages can be and reference grammar.2 Starting this discussion may
categorised into 500 institutional languages and a further be especially relevant in the context of the CLiC-it
con6,500 local vernaculars, or oral languages. While institu- ference, since Italy is characterised by a one-of-a-kind
tional languages feature standardised orthographies and linguistic diversity in the European landscape, where
widespread literacy, the local languages include ancestral diatopic variation implicitly encodes local knowledge,
languages, with an unbroken history of oral transmis- cultural traditions, artistic expressions, and the history
sion, and languages in danger of disappearing. Most of its speakers [3]. With respect to high-resource
lanNatural Language Processing (NLP) tools and resources
developed until now are almost only for institutional</p>
      <sec id="sec-1-1">
        <title>1This paper has been revised for English using the LLaMa 3.3 70B</title>
        <p>model through the GPT@JRC platform, an internal JRC testbed for
LLMs [2].</p>
        <p>All cited links were last accessed on the 12th of June 2025.</p>
        <p>Some of the reported examples have been shortened due to space
constraints.
2In linguistics, the distinction between a language and a dialect
is not always clear-cut and is often influenced by political and
sociocultural factors rather than purely linguistic ones. A dialect
is typically considered a regional or social variety of a language,
but varieties such as Sicilian, which may lack oficial status or
standardisation, are often labelled as dialects despite possessing
many characteristics of a distinct language. For this reason, we use
the terms language and dialect interchangeably when referring to
Sicilian, to reflect its complex sociopolitical status.
guages, which have extensive amounts of digital data and name suggests, UD represents syntax using dependency
resources available, Italian dialects are under-resourced, trees, instead of constituency trees. This is because
delacking suficient digital representation and support. pendency trees are perceived as better suited to represent</p>
        <p>The language observed in our study is Sicilian, as the ti- free or flexible word order languages [ 6]. Furthermore,
tle arbuli sunnu suggests, literally ‘trees (they) are’, show- models using dependency representations have achieved
ing a common predicate-initial structure. Sicilian is a promising results in many NLP tasks (e.g. in machine
vernacular language with local functions that include translation and information extraction) [6, p. 3].
intergenerational knowledge transmission. The resource UD comprises treebanks in more than 100 languages,
presented comprises diachronic and diatopic variants, including low-resource languages (see sec. 2.2 for a
defienabling the analysis of linguistic changes in certain phe- nition), e.g. Irish, Faroese, Uyghur. Among the UD
treenomena along these axes. In addition, it features ortho- banks, there are also parallel treebanks, i.e. treebanks
graphic variability due to the non-standardised transfer that have been translated into other languages and
subfrom oral to written form. sequently annotated. The biggest efort in this respect</p>
        <p>In order to make the resource accessible to a bigger has been done for the PUD treebank [7], which consists
audience, we provide the Italian translation in a 1:1 align- of 1,000 sentences in 18 languages (the majority
origiment setting. We decided to translate into Italian rather nally in English). Translators were asked to opt for the
than English to underline the importance of mitigating translation which is fluent but also sharing the most
the over-reliance towards English [4]. grammatical features of the original. Another example of</p>
        <p>It is beyond the scope of this article to cover all the parallel treebanks in UD is ParTUT [8], which contains
challenges associated with developing a treebank3 for sentences from diferent domains in English, Italian and
Sicilian; we focused mainly on the phenomena that have French. In ParTUT, the alignment is not 1:1 for all the
a major impact at (morpho-)syntactic level. By showing sentences [9], though the texts coming from a more
forsome of the major challenges in the treebank annotation, mal register, i.e. those from the JRC-Acquis corpus [10],
we hope to pave the way for the future development of an are almost all aligned 1:1.
expanded resource and the discussion about the involved The 1:1 alignment has been considered as especially
phenomena. helpful in learning contexts, and has been therefore
ap</p>
        <p>The paper is organised as follows: the next section plied in the case of the English Second Language (ESL)
(Sec. 2) presents an overview of related work, followed [11] or VALICO-UD [12] treebanks, resources which
inby Sec. 3, which describes the data collection and anno- clude learner texts in English and Italian, respectively.
tation process for the first release of the Sicilian3bank, We decided to follow their example for Sicilian3bank,
including the translation of Sicilian sentences into Ital- as it might be used for language learning.
ian to create a parallel corpus. In Sec. 4 we show the
parallel architecture of the treebank and the annotation 2.2. Language Variation in NLP
methodology we followed. This section also highlights
the challenges we faced developing a treebank for Sicil- It is possible to distinguish two main groups of languages
ian. Finally, the last section (Sec. 5) is about conclusions based on the availability of resources: high-resource
lanand future work. guages and low-resource languages [4]. The former are
languages (excluding sign languages) that have a large
collection of machine-readable texts or, at the very least,
2. Related Work a solid foundation upon which to build corpora,
treebanks, and similar linguistic resources [4]. These include
This section provides a brief introduction to the Univer- English, Mandarin Chinese, Arabic, and French, as well
sal Dependencies (UD) formalism and existing parallel as Portuguese, Italian, Dutch, Standard Arabic, and Czech
treebanks in UD, followed by a discussion on language to a somewhat lesser but still significant extent [ 4]. Many
variation in NLP, with a focus on dialects and, eventually, languages, particularly local varieties and dialects, are at
on Sicilian. risk of disappearing in a relatively short time due to the
lack of attention and resources they receive.
2.1. Universal Dependencies and Parallel In the European context, standard languages exhibit
Treebanks notable diatopic variation [3]. Failing to prioritise
research on language variations in the field of NLP would
UD [5] is a framework for annotating morphology and mean losing not only the languages as systems of
comsyntax consistently across languages. In recent decades, munication, but also the identities, social values, and
UD has become the de facto standard for treebanks. As its heritage of the societies they represent. It is not only a
matter of increasing eforts towards these languages, but
of doing so with an appropriate approach [3]. A shared
3A treebank is a corpus enriched with (morpho-)syntactic
annotations.
goal should be established, knowledge must be made ac- as Named Entity Recognition (NER) and Dialect
Identificessible to all, and subsequently disseminated beyond cation (DID), thanks to BarNER, a medium-sized corpus
the community itself through engagement initiatives and collecting Wikipedia and tweets data [15]. The authors in
the promotion of active participation. In addressing low- [16] show how such resources can be efectively utilised
resource and endangered languages a novel approach in NLP.
would be applied based on respect, cultural awareness, A similar initiative is the COSER-UD treebank [17],
and sensitivity to the wishes of their speakers. the first syntactically annotated corpus of spoken
peninsular rural Spanish distributed within the UD
frame2.3. Dialects in NLP work [18]. The treebank addresses features such as
word-order flexibility, ellipses, disfluencies, and
colloFocussing now specifically on dialects, it is important quial expressions, critical for accurately representing
to note that their marginalisation is not a phenomenon morphosyntactic variation in oral communication.5 By
exclusive to the field of NLP. A negative connotation of focussing on rural dialects beyond urban linguistic norms,
dialects is often rooted in complex historical, social, and COSER-UD enhances the diversity of linguistic data
availpolitical dynamics. For example in Italy, regional vari- able to NLP and supports sociolinguistic preservation of
eties, dialects, and other non-standard linguistic forms under-represented varieties. The COSER-UD resource
often coexist with the standard language in a situation has supported the development of tasks such as
Partknown as dilalìa [13], where there is not a rigid compart- of-Speech (PoS) tagging, where models adapted to
rumentalisation of the languages, as it happens in diglossìa, ral speech have been evaluated against a gold-standard
but still Italian is preferred in formal and high-prestige dataset of over 13,000 sentences. Furthermore, the dataset
domains, and dialects in informal, everyday, or familial in- has been used to test automatic speech recognition tools
teractions. The significant linguistic loss experienced by on dialectal Spanish audio [19].</p>
        <p>Sicilian and other Italian dialects can also be attributed to Another noteworthy project is the East Cretan
Treethe Fascist dictatorship, which aimed to achieve linguis- bank [20]. It was built from audio material of folkloric
tic unification by suppressing regional language varieties narratives collected from radio broadcasts, which were
and all that was perceived as foreign. Furthermore, the transcribed and annotated according to the UD
frameItalian language was instrumental in constructing na- work. The treebank annotates dialect-specific features,
tional unity, serving as a symbol of collective identity such as euphonics and voicing phenomena, which are
at the expense of non-standard varieties, which were in- represented using dedicated tags and treated as distinct
creasingly marginalised in both institutional and public tokens in the annotated data. The East Cretan Treebank
domains [3]. has been used for two main NLP tasks: PoS tagging and</p>
        <p>One notable efort to address dialects and local lan- dependency parsing. Both tasks were addressed via
fineguages is the MaiBaam project, a multi-dialectal Bavarian tuning of the Greek BERT model, using either exclusively
UD treebank [14]. It represents the first UD treebank for the Eastern Cretan corpus data or in combination with
the Bavarian language, a West German dialect spoken in data from the GUD, a treebank for Standard Modern
southern Germany, Austria, and northern Italy (South Ty- Greek [21].
rol). The major challenges encountered by the MaiBaam Focussing on Italian dialects, a treebank for Ligurian
project authors, which are common issues within this [22] is available in the UD repository which is the
firstifeld, are the dificulty to collect texts and find native- ever digital corpus of that language, comprising 316
senspeaking annotators. While we are facing the former tences and 6,928 tokens. Like Sicilian, Ligurian is a
michallenge, we did not encounter the latter, as the major- nority variety within the Italian linguistic landscape and
ity of our team members are native speakers of Sicilian. faces many challenges due to its low-resourced status.
Nevertheless, there remains the necessity for a strong The project shares similar goals with ours, aiming to
linguistic knowledge of the dialect being worked on—a promote research and NLP development for endangered
requirement that is uncommon, given that dialects are dialects, with a focus on supporting language
preservararely studied actively but are instead acquired through tion. The study also addresses orthographic aspects of
everyday use. The solution adopted by the MaiBaam the Genoese variety of the Ligurian dialect. The treebank
group to adress this issue is making their work publicly was used for parsing experiments, and despite the
peravailable, which enables them to engage with the pop- formance of the parser is lower than those trained on
ulation and collect contributions from the community.4 high-resourced languages, the results obtained are in line
The Bavarian dialect is also represented for tasks such with or superior to other small-scale corpora, confirming
annotation consistency.</p>
      </sec>
      <sec id="sec-1-2">
        <title>4Apart from sharing our resource, we mitigated this also during the</title>
        <p>annotation process by making it the most objective as possible by
using shared resources.</p>
      </sec>
      <sec id="sec-1-3">
        <title>5Additional information can be found at https://github.com/</title>
        <p>UniversalDependencies/UD_Spanish-COSER.</p>
        <p>The UD repository also includes a small Neapolitan significant variability in the treatment of linguistic
phetreebank that contains only 20 sentences, corresponding nomena. On the one hand, some grammars document
to 197 tokens and 199 syntactic words6. some phenomena in detail, while in others they are
com</p>
        <p>As far as Sicilian is concerned, a particularly interest- pletely absent. On the other hand, some phenomena are
ing project is the one carried out by Arba Sicula7 [23], mentioned in all grammars but treated diferently. It was
which presents the first neural machine translator for therefore necessary to make a choice based on a critical
the Sicilian dialect based on a deep-learning transformer comparison of the sources and data available to us. It
model fed with Sicilian sentences augmented using back- should be noted that, as it is common in the development
translation [24] to cope with the lack of resources. The of resources from scratch, some decisions were taken
results were evaluated using the BLEU score metric and based on the limited set of examples currently included
yielded scores of 35.0 for English&gt;Sicilian and 36.8 for in the treebank. In future extensions of the resource,
Sicilian&gt;English. The project was later expanded into a new comparisons with additional instances of the same
multilingual translation system by incorporating Italian, or similar phenomena may prompt a revision of certain
using techniques such as transfer learning. annotation choices.</p>
        <sec id="sec-1-3-1">
          <title>2.4. Studying Sicilian</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Data Collection and Translation</title>
      <sec id="sec-2-1">
        <title>When approaching the creation of a treebank for a dialect,</title>
        <p>one must come to terms with the absence of an ortho- In the development of a treebank, the first step to be
graphic standard and norms to regulate its development. addressed is the collection of texts to be later annotated.
Sicilian, as well as other dialects, exhibits great variabil- When the objective is a parallel treebank, texts must be
ity, especially at the diachronic and diatopic levels. To made available in at least two languages. For the
developdeal with these critical issues, we adopted a combined ap- ment of the first release of the Sicilian3bank12 we
colproach, drawing on diferent grammars and dictionaries lected a group of open source texts available on the web
of Sicilian and comparing them. In general, the grammars (sec. 3.1), and we applied to these texts a semi-automatic
proved to be very useful to explain several phenomena procedure to obtain their Italian version (sec. 3.2).
and guide their representation in Sicilian3bank.
However, for a few especially challenging issues, those for 3.1. Data Collection
which we found a discordance of opinions reported in the
grammars, we provided solutions based on our intuition The first of the challenges we encountered was finding
of native speakers and consulting a linguist expert on suitable texts and sources for building the treebank. We
Sicilian. We carefully discussed them and kept track of constrained our search to literature, but we do not
exouFromrothtievaptuiornpsosine othf eleaxnicnaoltcaotinosnugltuaitdioenlinaensd. to handle rcelusoduertcoei,nec.glu. dinecoltuhdeirnggetnhreeSsiicnilfiuantupreageenslaorfgWemikeinpteodfiath.1e3
diferent word forms, some online tools were used, such We started our search based on the criterion of
contemas Wikizziunariu8, Glosbe9, Napizia-Chiù dâ Palora 10, poraneity, that is, we sought texts modern and reflecting
sStarlavtiianmgot hilesiicmilpiaonrota11n,cpeluosf sleovceiarlapgoinsgtseavnedryblaovgasi,ldabemleorne-- luasnegfuulasgoeuurcsee hcaosnbseisetnenPtanwziathredpdreaswenetb-sdiatey.1S4iFcriloiamn.thAis
source for dialectal language research and preservation. source, we retrieved two of the three texts of our
coriItnalaiadndoitiboynA,wnteoncoionTsuraltienda N[2u5o]v,osevleocctaebdoflaorrioitssibcrieliaadntoh- pUuZs:uUccucu16n.tuThdiesPeurtpexut1s5 adnodnAomtianrdaicSaatpei t-hCeagpeìtouglruaUphniuc,
and accuracy, and various other dictionaries [26, 27, 28]. origin of the authors or the dialectal variety, which
pre</p>
        <p>Several grammars from various time periods were also vents us from declaring with certainty the provenance
consulted [29, 30, 31, 32, 33, 34, 35, 36, 37], in order to of these texts. However, based on a lexical analysis of
gain a comprehensive understanding of the language also the terms used, it is likely that the first text comes from
on diachronic aspect. Consulting these works revealed TthheeAthgririgdetnetxot aisreaacaonldletchtieosnecoofn1d8 fdrioamtotphice vCaartiaannitasa17reoaf.
6The few information about this resource can be found at https:
//github.com/UniversalDependencies/UD_Neapolitan-RB. 12We plan to release it in the next oficial UD treebank release.
7Arba Sicula is a non-profit international organisation that promotes 13Main page: https://scn.wikipedia.org/wiki/PÃăggina_principali.
the language and culture of Sicily https://arbasicula.org/. 14Available here: https://www.panzaredda.com/.
8Available here: https://scn.wiktionary.org/. 15Available here: https://www.panzaredda.com/post/
9Available here: https://it.glosbe.com/. u-cuntu-di-purpu, written by Alesci Mistretta.
10Available here: https://www.napizia.com/cgi-bin/cchiu-da-palora. 16Available here: https://www.panzaredda.com/post/
pl. amara-sapi-capÃňtulu-unu-u-zuccu, by Goetia.
11Available here: http://www.salviamoilsiciliano.com/come-si-dice/ 17This paper focuses on 17 tales from the collection, excluding the
dizionario/. 18th tale as it is entirely written in Italian. Some parts of the 17
the legend of Colapisci, a very well-known folktale in whereas for Colapisci, the most satisfactory version was
Sicily, narrating the story of a merman. the one produced by GPT-4o.20</p>
        <p>In the CoNLL-U file of Sicilian3bank, a comment line We considered subjective qualitative evaluations of the
has been added at the beginning of each text, containing overall quality of the translation, focussing on the
relainformation regarding the text’s diatopic variant and tionship between fidelity to the original text and fluency
publication year. In the specific case of Colapisci, this of the translated text. Notably, despite not being
specifiinformation is provided at the beginning of each story. cally trained on dialect data, the LLMs demonstrated a
remarkable ability to generate meaningful translations,
3.2. Creating the Parallel Sicilian3bank producing a fluent and largely accurate output in both
cases.21 However, some inaccuracies regarded: (i)
UnIn this section, we present the challenges of LLMs in translated or roughly translated terms—nouns in
particutranslating the selected texts from Sicilian into Italian, lar are the most dificult to translate and required manual
and the translation principles we applied for manually corrections and lexical consultations; (ii) Cultural and
correcting the automatic translations. linguistic nuances not correctly identified and translated;
(iii) Inconsistencies in subject-verb agreement, especially
3.2.1. GenAI for Automatic Sicilian&gt;Italian in translations produced by Mistral, and the use of verb
Translation tense, which impaired temporal coherence; (iv) Omitted
content—a few cases were observed where the models
failed to translate parts of the text, producing incomplete
results and requiring manual intervention.</p>
      </sec>
      <sec id="sec-2-2">
        <title>To translate the Sicilian texts into Italian, we exploited</title>
        <p>LLMs to obtain a first version, which was then
manually revised by Sicilian native speakers.18 We decided
not to use machine translation-specific systems, because
they usually do not cover dialects, and when they do, e.g. 3.2.2. Translation Choices
Google translate, their performance is low, as verified We created fluent translations into Italian, opting for
at a first qualitative check on our texts. We preferred to the variant that has the most grammatical features of
use general-purpose LLMs, as this might be the start of a the original, when possible, as in the PUD treebank [7].
more systematic study on LLMs abilities with translation Nevertheless, fully rendering the meaning of certain
exof low-resource languages. The machine translated ver- pressions in the translation has been challenging. We
sions were produced in three diferent settings, giving the have indeed encountered words that did not have an
whole text in the prompt and asking for the translation, equivalent in Italian, or had one or more meanings. For
giving a sentence at a time with the whole text as con- example, in U cuntu di Purpu, the nickname of the main
text and giving each sentence in isolation.19 These three character, ‘Purpu’22, literally means ‘octopus’, but it is
versions have been produced for each of the three LLMs commonly used also to ofensively indicate
homosextested, i.e. Mistral 3 Small, LLaMA 3.3 70B and GPT-4o ual people. Nowadays, in the translation literature, it is
models. These models were accessed using GPT@JRC, a commonly agreed that proper names are not translated,
tool that enables the use of genAI models in a safe and unless they carry a meaning or the target audience
reAI-Act compliant environment [2], and using standard quires it. A thoroughly studied case is the translation of
settings (e.g. temperature 0.7). Despite the three texts names in Harry Potter [39, 40], where localisation seems
having diferent lengths (from less than 2k to more than to be the most adopted technique. Since our primary aim
5k tokens), this did not influence the translation quality, is not translation, we decided to opt for a one-size-fits-all
though only qualitatively evaluated, especially in the set- strategy instead of localisation, which involves an ad hoc
ting asking for the translation of the whole text together, solution for each diferent case: proper names were not
which is the one producing the best translations. This translated, even when they carried meaning. However,
means that the degradation of performance reported in in the document with the whole translated text, provided
the literature about LLMs [38] (using automatic metrics in the resource repository23, we added footnotes
providsuch as BLEU) is not visible with our qualitative
evaluation. In particular, reviewing the translations, it was
observed that the best translations were generated by
Mistral for the texts Amara Sapi and U Cuntu di Purpu,
20It must be noted that safety filters were triggered in some cases,
especially in the short story U Cuntu di Purpu, as it is mentioned a
dead body. This hindered the possibility for a full comparisons of
the models and settings.
collected tales contained Italian sentences, particularly in expla- 21Qualitatively better than translations obtained using Arba Sicula
nations of details or cross-references to similar versions. These translator or Google Translate (Sicilian&gt;English).
sections were not included in the corpus. 22See https://it.wiktionary.org/wiki/purpu for the translation
18The first authors of this paper. of the term and this Quora thread https://it.quora.com/
19We are aware that giving the whole text as a context per sentence Perché-in-Sicilia-gli-omosessuali-vengono-chiamati-purpi for a
is not eficient considering computation costs, but we tried this discussion of its common use.</p>
        <p>setting as we had only three texts. 23Available here: https://github.com/ElisaDiNuovo/Sicilian3bank.
ing translation and further explanation where necessary. leading to the first version of the Sicilian3bank. The
Other examples of proper names we met in the texts in- tool used for the correction was Arborator [43].26 Each
cluded in the Sicilian3bank—which are known in the of the three texts was annotated by one annotator. The
translation literature as challenging since rich in social, annotation was reviewed by a second annotator.
Problemgeographical, or cultural references—are ‘Liotru’ (from U atic phenomena were discussed by the three annotators
Cuntu di Purpu), literally translatable as ‘elephant’, but together, and specific cases also with the rest of the
aualso bearing a reference to the city of Catania, that any thors.27 In Table 2 in Appendix A we report an example
Sicilian reader would also recognise; ‘Zuccarata’ (from of the CoNLL-U file for a Sicilian sentence of the
treeAmara Sapi), which is not only an afectionate epithet bank, featuring a comment line with the Sicilian text, and
used to describe a person, but also the name of a tradi- the aligned Italian translation.
tional dessert typical of the region. When it comes to this parallel dataset composed of</p>
        <p>A diferent approach was taken with the toponyms the translations into Italian of the Sicilian sentences
(dethat had a direct equivalent in Italian, which were indeed scribed in Sec. 3), the same parsing approach has been
aptranslated, e.g. Missina, Turri di Faru, and Napuli (from plied, thus creating the Sicilian-Italian parallel treebank.
the text Colapisci), rendered respectively as Messina, Nevertheless, considering that our main focus is on the
Torre Faro, and Napoli. Finally, fictional toponyms, such Sicilian dialect, we decided to concentrate our current
as Cirasitu, found in the text Amara Sapi, was Italianised eforts on the creation of the parallel data (translation
as Cirasito, however the Italian reader would lose the into Italian) and the manual correction of the annotation
reference to cherries. of the Sicilian data, carefully checking them both, and
planning instead the manual check of the annotation of
the Italian parallel data of the Sicilian3bank as a
fu4. Sicilian3bank in UD ture work. This is further justified as automatic parsers
for Italian are considered good enough, although some
marginal phenomena still are consistently wrongly
annotated [44, 12]. The next section is therefore focused on
the analysis based on the Sicilian data only.</p>
      </sec>
      <sec id="sec-2-3">
        <title>In this section, we describe the annotation process and</title>
        <p>the challenges we faced in applying the UD format to our
collection of texts described in Sec. 3. All the annotation
choices are documented in the annotation guidelines,
provided in the resource repository.</p>
        <sec id="sec-2-3-1">
          <title>4.2. A Quantitative Analysis of the</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Sicilian Data</title>
        </sec>
        <sec id="sec-2-3-3">
          <title>4.1. Parsing Sicilian in UD</title>
          <p>There is no annotated resource or treebank in UD format After the manual check and correction, the Sicilian
refor the Sicilian dialect. Based on the supposed similarity source annotated in CoNLL-U format consists of a total
of Sicilian with Italian and the availability of UD tree- of 505 sentences and 11,709 tokens (Table 1). Each
anbanks for this latter, we decided to create a first draft of notated sentence of each of the three texts presented in
the Sicilian annotated data using the models for Italian, Sec. 3.1 includes a comment text line that reports the
senexpecting to find a significant amount of errors in the tence in Sicilian dialect followed by a comment text line
output to be manually corrected. We selected the mod- containing the translation into Italian. Following this,
els trained on ISDT [41] and POSTWITA [42] treebanks, the UD annotation of the sentence is provided organised
which are the biggest resources for Italian available in the in the ten columns typical of this format (Table 2).
UD repository, and we have a performance evaluation
of these models in non standard Italian texts (i.e. [12]). TAemxatra Sapi Number of senten2c4e6s Number of tok47en23s
A preliminary comparison of the outputs generated by Colapisci 179 5092
UDPipe24 trained on them showed that the model based U cuntu di Purpu 80 1894
on ISDT outperforms that based on POSTWITA in deal- Total 505 11709
ing with Sicilian data. We started therefore the manual Table 1
check and correction of the output of UDPipe trained on The distribution of sentences and tokens in the Sicilian data
ISDT, feeding it with gold sentence segmentation.25 of the Sicilian3bank.</p>
          <p>The three first authors, all native Sicilian speakers
skilled in linguistics and computational linguistics,
carried out this manual revision of the automatic annotation
24Available here: https://lindat.mf.cuni.cz/services/udpipe/.
25For sentence segmentation we followed the VALICO-UD project,
which does not split sentences on colons and treats direct speech
as single segment.</p>
          <p>26We noticed that Arborator (https://arborator.ilpga.fr) allowed to
split tokens only into two, so in case of verb + double clitic we had
to further tokenise manually.
27To further ensure annotation quality, an inter-annotator agreement
score (Krippendorf’s kappa) will be computed for future releases
of the treebank.
areas [33].</p>
          <p>(1) # text = Stu Piscicola era unu di lu Faru
# translation = Questo Piscicola era uno del Faro</p>
          <p>root
det
nsubj
cop
nmod
case</p>
          <p>det</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Stu Piscicola era unu di lu Faru</title>
        <p>DET PROPN AUX PRON ADP DET PROPN
chistu Piscicola essiri unu di lu Faru
this Piscicola was one of the Faro
(2) # text = fòru ’mmarsamati propriamenti comu iddhi
nisceru d’ ’u mari
# translation = furono imbalsamate proprio quando
uscirono dal mare
advcl</p>
      </sec>
      <sec id="sec-2-5">
        <title>A comparison of the annotation provided by UDPipe</title>
        <p>with the manually corrected data enables us to evaluate
the transfer domain abilities of the parsing models when
applied on the Sicilian data. In Table 3 in Appendix A,
we report the scores (precision, recall and F1 for UPOS,
LAS and UAS) obtained by UDPipe models trained on
ISDT and on PoSTWITA. These results confirm that the
model based on ISDT outperforms the other one, but it
must be observed that it may depend at least in part on
the fact that the output of UDPipe trained on ISDT was
the base for the manual correction. The table shows that
the best performance based on ISDT can be referred to
Colapisci (LAS F1 72.87) while the worst to Amara Sapi
(LAS F1 59.80). An in-depth investigation of these results
is beyond the scope of this paper, but will be addressed in
our future work. However, we can qualitatively observe
that the performance of the two models difers for some
phenomena. For example, the model trained on
PoSTWITA was more robust in annotating verbs containing
double clitic pronouns.</p>
        <sec id="sec-2-5-1">
          <title>4.3. Challenges in Dealing with the</title>
        </sec>
        <sec id="sec-2-5-2">
          <title>Sicilian Dialect</title>
          <p>obj
case
The approach used for the generation of the annotated expl det det
data, based on models available for Italian, has clearly
brought out some characteristics and phenomena that PRsO’N asVcEiuRcBau D EiT lNàcOrUimNi AcDuP DlEaT NmOaUnuN
diferentiate Sicilian from Italian. It is in dealing with si asciucari lu làcrima cu lu manu
these phenomena that the parser has produced more oneself wiped the tears with the hand
annotation errors, and it is on them that the work of Contracted articulated prepositions—graphically
manual correction was mostly concentrated. marked by the circumflex accent [ 29, 30, 32]—were split</p>
          <p>This section presents some choices we had to make to into two diferent tokens, as shown in Example 3. In
deal with some features of the Sicilian texts considered. this way we show, for each articulated preposition,
In particular, we focus on tokenisation (articulated prepo- the morphology attached to it, even in those cases in
sitions), lemmatisation (orthographic variations of some which it is not apparently visible, as it is nevertheless
pronouns reflecting suprasegmental traits), and syntac- part of its evolution and can be described by formal
tic (focussing here on the reduplication phenomenon) rules. A diferent choice, such as not splitting it into two
choices. tokens, would have highlighted the grammaticalisation
of this particular phenomenon by not splitting it into
4.3.1. Tokenisation Issues two tokens. However, this choice might necessitate
A particularly relevant phenomenon that emerged during the creation of a specific UPOS, which would hinder
the annotation is that represented by articulated preposi- cross-language comparisons.
tions, for which there has been, over time, a process of Similarly the forms nta and ntâ difer as the former is
grammaticalisation that has determined their evolution. a simple preposition, equivalent to in of Italian, while the
Generally, many prepositions that in Italian occur in a latter is the articulated preposition. Depending on the
unified form have undergone a transformation in Sicilian, gender and number of the article, it can be rendered as
ifrst passing through a disjunct form (Example 1) 28, until ntô (masculine singular), ntê (plural, both masculine and
arriving at forms with elision (Example 2)29 [34, 31] and, feminine).
in more recent times, with contraction (Example 3)30, al- It is worth noting in this regard that the Italian
prepothough the disjunct form is still present, at least in some sition in can be rendered in Sicilian in various ways, such
as in, ni, nni, nta [29]. The same is true for the Italian
simple preposition da, which in Sicilian occurs in the
forms di, ni and nni [29]. These diferent forms are
relfected also in the corresponding articulated prepositions
28English translation: This Piscicola was one from Faro.
29English translation: [...] were embalmed just as they emerged from</p>
          <p>the sea.
30English translation: He wiped away his tears with his hand.
fòru ’mmarsamati propriamenti comu iddhi nisceru d’ ’u mari
AUX VERB ADV SCONJ PRON VERB ADP DET NOUN
essiri imbalsamari propriamenti comu iddi nesciri di lu mari
were embalmed right as they came-out from the sea
(3) # text = S’asciucau i làcrimi câ manu
# translation = Si asciugò le lacrime colla mano
root
marknsubj
obl
obl
case det
(e.g. the Italian preposition nello, such as ntô, nô and nnô). shift or extension of meaning within the sentence. It is
Please see Sec. 4.3.2, for our lemmatisation choices for a phenomenon still highly productive in contemporary
these variants. Sicilian, as shown by Amenta through the analysis of</p>
          <p>The complete scheme of the articulated prepositions a corpus from the Atlante Linguistico della Sicilia [46],
system in Sicilian is presented in Table 4 in Appendix A. where these forms exhibit neither diachronic nor
diastratic variation, thereby confirming the ongoing vitality
4.3.2. Lemmatisation Issues of this linguistic process. This phenomenon can involve
the reduplication of a verb to form an adjective or a
Concerning lemmatisation, as Sicilian does not have a noun; a noun to form an adjective or an adverb; and
unified orthography—although recent eforts try to stan- other PoS [47]. This last pattern, the most frequent in
dardise this [32]—in the texts considered there are dif- our texts, reveals several semantic implications, but
freferent variants for the same forms, which try to render quently is used as a locational nominal modifier. In order
diferent pronunciations. For example, in the considered to highlight the compound nature of this phenomenon
texts there is no consistency in the transcription of the (in [45, p. 350], it is clearly stated that it is not possible
Sicilian word meaning ‘no one’, nuddu, which is pro- to interpose any words between the two elements of the
nounced reproducing a long voiced retroflex stop, but it reduplicated construct), we use the relation compound
is transcribed sometimes as nuddu, other times as nu d. d.u, and the relation obl, in line with UD guidelines, as shown
stressing the retroflex pronunciation. Other variants of in Example 631. In addition we added LOC=adv in the last
the same word are nuddru, nuddhu. Since our aim is not column of the CoNLL-U file, as it is done in VALICO-UD,
focused on phonetics, we lemmatised these occurrences to indicate that there is an adverbial locution.
without any pronunciation marks, i.e. nuddu, and de- (6) # text = avìanu truvato campi campi
cided not to uniform the orthographic rendering (i.e. the # translation = avevano trovato tra i campi
form) of this word and similar cases, e.g. ci/cci and ni/nni,
as shown in Examples 4a-4b and 5a-5b, respectively. aux root obl compound</p>
          <p>(4a) # text = ci succidìu accussì LEMMA ci
# translation = gli successe questo (this happened to him) avìanu truvatu campi campi
(4b) # text = chi cci jemu a fari? LEMMA ci AUX VERB NOUN NOUN
# translation = che ci andiamo a fare? (what are we going aviri truvari campu campu
to do there?) had found ifelds ifelds
(5a) # text = ni chiamavanu "l’Armali" LEMMA ni
# translation = ci chiamavano "gli Animali" (they called
us "the animals") 4.4. A Cross-Linguistic Analysis Example
(5b) # text = Chi nni putìa sapiri iu? LEMMA ni In Sicilian, modal verbs—like the auxiliaries essiri (‘to
# translation = Che ne potevo sapere io? (How could I be’) and aviri (‘to have’)—can serve two main functions:
know about that?) they may appear independently with their own lexical</p>
          <p>We applied the same principle to shortened oral vari- meaning, or they may function as support verbs,
combinants of words, e.g. diri (‘to say’) or riri, both of which are ing with an infinitive (without a preposition) to convey
abbreviated forms of diciri. All such variants have been specific modal values, such as: (i) ability/possibility →
lemmatised using the extended lemma, such as diciri in putiri (‘can’); (ii) will/desire → vuliri (‘want’); (iii)
obligaExample 8). tion/necessity → duviri (‘must’) or aviri a (‘have to’).</p>
          <p>To summarise, the main aim of lemmatisation is to In modern Sicilian, particularly in spoken usage, the
pereduce the sparseness of forms and their variants by riphrastic construction aviri a + infinitive is commonly
reducing them to a common lemma, regardless of the employed to express modal meanings, especially
obligacauses of this sparseness. Therefore, we have applied the tion, replacing the older verb duvìri found in Old
Sicilsame strategy used in other resources where sparsity is ian [30] (see Example 7)32. Within this construction, the
determined, for example, by the writing style of the users tense of aviri plays a central role in conveying modal
(or by errors due to the writing device they use), as in values, whether epistemic or deontic. When aviri
apPoSTWITA[42], to the lemmatisation of Sicilian3bank. pears in the past remote, its perfective aspect confers an
epistemic meaning, indicating certainty about the event’s
4.3.3. Syntax Issues occurrence in the past. In contrast, when aviri is used in
the present or imperfect—both imperfective tenses—the
construction can express either an epistemic sense of
probability or a deontic sense of obligation or necessity.</p>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>One of the cases in which we had to take a decision</title>
        <p>about a syntactic phenomenon is reduplication, a typical
and widespread phenomenon in the Sicilian dialect [45],
which consists in the repetition of a word, resulting in a
31English translation: [...] they had found among the fields.
32English translation: I should listen to you much more often.
In some cases, especially with the present indicative or
imperfect subjunctive, an exhortative function may also
emerge [48].</p>
        <p>(7) # text = T’avissi a ’scutari cchiù assai
# translation = ti dovrei ascoltare molto di più
root
expl</p>
      </sec>
      <sec id="sec-2-7">
        <title>The annotation in UD of such resource allows for draw</title>
        <p>ing a parallel with other languages. For example, with
the English have to construction, which is similarly used
to express obligation and certainty [49, p. 210]. In the
English UD treebanks to is consistently annotated as a
particle when used in this way (see Example 9a in
Appendix A). We therefore decided to treat the element a,
which is usually tagged as a preposition in our corpus,
as a particle in this specific construction. However, in
Italian avere da can be used with the same meaning (see
Example 9b in Appendix A), but da is not annotated as
particle. This might be due to historical reasons, a
diferent function of da in Italian than of to in English, or to
highlight a less grammaticalised relation.</p>
        <p>Another periphrastic construction found in the
treebank texts is veniri + a + diciri (literal translation into
Italian venire a dire), which can have the meaning of
the Italian verb significare (‘to mean’). In such cases, we
treated it in the same way as the previous one, as shown
in Example 833.</p>
        <p>(8) # text = Chi veni a diri?
# translation = Che significa ?</p>
        <p>punct
root
obj</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion and Future Work</title>
      <sec id="sec-3-1">
        <title>We can create a world that sustains its languages [50]. Among the concrete actions we can perform to achieve this goal, there is the possibility of speaking and studying the original languages of our places.</title>
        <p>This paper describes and discusses the issues involved
in the development of the first release of the
Sicilian3bank. Many are the challenges we have encountered
in dealing with a language which has never been treated
before and which is in addition a dialect, which carries
33English translation: What does it mean?
with it an uninterrupted history of oral transmission but
does not have a standardised form of transcription or
unified treatment of phenomena in grammars.</p>
        <p>The project we present here is intended therefore
solely as a preliminary foundation and proposal, which
nonetheless requires substantial further work and
numerous improvements. First, the inclusion of more texts and
perform inter-annotator agreement, to verify guidelines
soundness. Second, the corpus enrichment introducing
Italian glosses in the MISC column of the CoNLL-U file.
In the current version, each sentence is accompanied by
a fluent Italian translation in a comment line, we propose
the inclusion of a literal word-for-word translation from
Sicilian into Italian. Although this form of translation
may result in grammatically incorrect or unnatural
Italian, it would provide an almost word-by-word parallel
aligned resource that mirrors the syntactic structure of
the original Sicilian sentences and would facilitate
syntactic calque studies. Third, a future objective would be
to manually validate the automatic annotation generated
with UDPipe for the aligned Italian resource as well. This
step is needed to give to the Italian parallel dataset the
same quality we are currently providing for the Sicilian
annotated data. Fourth, another interesting enhancement
might be to systematically include graphic accents on
all verb lemmas, to help reading them, and including in
MISC column of the CoNLL-U file the International
Phonetic Alphabet transcription. This idea is motivated by
the desire to turn the resource not only into a syntactic
dataset but also into a tool to support language learning,
scientific studies and preservation of Sicilian. Finally, an
aspect we would like to improve in the future concerns
the translation of proper nouns. As already discussed,
we encountered several challenges in translating these
elements, which ultimately led us to the decision not to
translate the proper nouns found in the texts at this stage.
The focus of this work is the development of a Sicilian
treebank, and although a deeper engagement with
translation would certainly have added valuable insights, it
would have diverted attention from the project’s primary
objective. We therefore plan to revisit this aspect in a
later phase of the project.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgment</title>
      <sec id="sec-4-1">
        <title>We would like to express our gratitude to Giuseppe</title>
        <p>Domenico Muscianisi, PhD, from the University of Parma,
for very kindly sharing with us his expertise, which was
instrumental in resolving several of our questions and
improving our knowledge about the literature on the
Sicilian dialect.</p>
        <p>A special thanks goes to the JRC internal reviewers and to
the CLiC-it 2025 anonymous reviewers for their precious
comments.
TUT parallel treebank, in: Proceedings of The
Second Workshop on Annotation and Exploitation of
[1] S. Bird, D. Yibarbuk, Centering the Speech Com- Parallel Corpora, 2011, pp. 19–28.</p>
        <p>munity, in: Y. Graham, M. Purver (Eds.), Proceed- [9] M. Sanguinetti, C. Bosco, PartTUT: The Turin
Uniings of the 18th Conference of the European Chap- versity Parallel Treebank, in: R. Basili, C. Bosco,
ter of the Association for Computational Linguis- R. Delmonte, A. Moschitti, M. Simi (Eds.),
Harmotics - Volume 1: Long Papers, ACL, St. Julian’s, nization and Development of Resources and Tools
Malta, 2024, p. 826–839. URL: https://aclanthology. for Italian Natural Language Processing within the
org/2024.eacl-long.50/. doi:10.18653/v1/2024. PARLI Project, Springer, 2015, pp. 51–69.
eacl-long.50. [10] R. Steinberger, M. Ebrahim, A. Poulis, M.
Carrasco[2] B. De Longueville, I. Sanchez, S. Kazakova, S. Luoni, Benitez, P. Schlüter, M. Przybyszewski, S. Gilbro,
F. Zaro, K. Daskalaki, M. Inchingolo, The Proof An overview of the European Union’s highly
multiis in the Eating: Lessons Learnt from One Year lingual parallel corpora, Language Resources and
of Generative AI Adoption in a Science-for-Policy Evaluation 48 (2014) 679–707.</p>
        <p>Organisation, AI 6 (2025) 128. [11] Y. Berzak, J. Kenney, C. Spadine, J. X. Wang, L. Lam,
[3] A. Ramponi, Language Varieties of Italy: Tech- K. S. Mori, S. Garza, B. Katz, Universal
Dependennology Challenges and Opportunities, Transac- cies for Learner English, in: E. Katrin, A. S. Noah
tions of the Association for Computational Linguis- (Eds.), Proceedings of the 54th Annual Meeting of
tics 12 (2024) 19–38. doi:https://doi.org/10. the Association for Computational Linguistics
(Vol1162/tacl_a_00631. ume 1: Long Papers), Association for
Computa[4] E. M. Bender, The #BenderRule: On Naming the tional Linguistics, 2016.</p>
        <p>Languages We Study and Why It Matters, The [12] E. Di Nuovo, Introducing Valico-UD: A Parallel,
Gradient (2019). Learner Italian Treebank for Language Learning
[5] M.-C. de Marnefe, C. D. Manning, J. Nivre, Research, Pàtron, 2023.</p>
        <p>D. Zeman, Universal Dependencies, Com- [13] G. Berruto, Lingua, dialetto, diglossia, dilalia, in:
putational Linguistics 47 (2021) 255–308. G. Holtus, J. Kramer (Eds.), Romania et Slavia
AdriURL: https://aclanthology.org/2021.cl-2.11/. atica. Festschrift für Zarko Muljačić, Buske,
Hamdoi:10.1162/coli_a_00402. burg, 1987, pp. 57–81.
[6] H. Bunt, P. Merlo, J. Nivre (Eds.), Trends in Parsing [14] V. Blaschke, B. Kovačić, S. Peng, H. Schütze,
Technology: Dependency Parsing, Domain Adapta- B. Plank, MaiBaam: A Multi-Dialectal Bavarian
Unition, and Deep Parsing, volume 43, Springer Science versal Dependency Treebank, in: N. Calzolari, M.-Y.
&amp; Business Media, 2010. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.),
Pro[7] D. Zeman, M. Popel, M. Straka, J. Hajič, J. Nivre, ceedings of the 2024 Joint International Conference
F. Ginter, J. Luotolahti, S. Pyysalo, S. Petrov, on Computational Linguistics, Language Resources
M. Potthast, F. Tyers, E. Badmaeva, M. Gokirmak, and Evaluation (LREC-COLING 2024), ELRA and
A. Nedoluzhko, S. Cinkova, J. Hajic jr., J. Hlaváčová, ICCL, Torino, Italia, 2024, pp. 10921–10938. URL:
V. Kettnerová, Z. Urešová, J. Kanerva, S. Ojala, https://aclanthology.org/2024.lrec-main.953/.
A. Missilä, C. D. Manning, S. Schuster, S. Reddy, [15] S. Peng, Z. Sun, H. Shan, M. Kolm, V. Blaschke,
D. Taji, N. Habash, H. Leung, M.-C. de Marn- E. Artemova, B. Plank, Sebastian, Basti, Wastl?!
efe, M. Sanguinetti, M. Simi, H. Kanayama, V. de- Recognizing Named Entities in Bavarian
DialecPaiva, K. Droganova, H. Martínez Alonso, C. Çöl- tal Data, in: N. Calzolari, M.-Y. Kan, V. Hoste,
tekin, U. Sulubacak, H. Uszkoreit, V. Macketanz, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of
A. Burchardt, K. Harris, K. Marheinecke, G. Rehm, the 2024 Joint International Conference on
ComT. Kayadelen, M. Attia, A. Elkahky, Z. Yu, E. Pitler, putational Linguistics, Language Resources and
S. Lertpradit, M. Mandl, J. Kirchner, H. F. Alcalde, Evaluation (LREC-COLING 2024), ELRA and ICCL,
J. Strnadová, E. Banerjee, R. Manurung, A. Stella, Torino, Italia, 2024, pp. 14478–14493. URL: https:
A. Shimada, S. Kwak, G. Mendonça, T. Lando, R. Ni- //aclanthology.org/2024.lrec-main.1262/.
tisaroj, J. Li, CoNLL 2017 Shared Task: Multilin- [16] X. M. Krückl, V. Blaschke, B. Plank, Improving
gual Parsing from Raw Text to Universal Dependen- Dialectal Slot and Intent Detection with Auxiliary
cies, in: J. Hajič, D. Zeman (Eds.), Proceedings of Tasks: A Multi-Dialectal Bavarian Case Study, in:
the CoNLL 2017 Shared Task: Multilingual Parsing Y. Scherrer, T. Jauhiainen, N. Ljubešić, P. Nakov,
from Raw Text to Universal Dependencies, Asso- J. Tiedemann, M. Zampieri (Eds.), Proceedings of
ciation for Computational Linguistics, Vancouver, the 12th Workshop on NLP for Similar Languages,
Canada, 2017, pp. 1–19. Varieties and Dialects, Association for
Computa[8] M. Sanguinetti, C. Bosco, Building the multilingual tional Linguistics, Abu Dhabi, UAE, 2025, pp. 128–
146. URL: https://aclanthology.org/2025.vardial-1. [26] G. Biundi, Vocabolario manuale completo
siciliano10/. italiano seguito da un’appendice e da un elenco di
[17] J. E. Bonilla, Spoken Spanish PoS tagging: nomi proprj siciliani: coll’aggiunta di un dizionario
gold standard dataset, Language Resources geografico in cui sono particolarmente descritti
and Evaluation 59 (2025) 983–1012. doi:10.1007/ i nomi di città, fiumi, villaggi ed altri luoghi
ris10579-024-09751-x. marchevoli della Sicilia: e corredato di una breve
[18] J. E. Bonilla, Development of the first spoken span- grammatica per gl’Italiani, Palermo, Carini, 1851.
ish treebank within the universal dependencies [27] V. Mortillaro, Nuovo dizionario siciliano-italiano.
framework: A multi-regional approach, submitted. Volume unico, Palermo, Stabilimento tipografico
[19] C. Adsuar Ávila, Automatic Speech Recog- Lao, 1876.</p>
        <p>nition in Dialectal Data (COSER), 2024. [28] R. Rocca, Dizionario Siciliano-Italiano compilato
URL: https://audias.ii.uam.es/2024/10/30/ su quello del Pasqualino con aggiunte e correzioni.
automatic-speech-recognition-in-dialectal-data-coser/, Volume unico, Catania, Pietro Giunti Editore, 1839.
Presentation at the AUDIAS-UAM Seminar, Octo- [29] A. Fortuna, Grammatica siciliana: Principali regole
ber 30, 2024. grammaticali, fonetiche e grafiche (comparate tra i
[20] S. Vakirtzian, V. Stamou, Y. Kazos, S. Markantona- vari dialetti siciliani), Caltanissetta, Terzo Millennio
tou, Dialectal treebanks and their relation with the Editore, 2002.
standard variety: The case of East Cretan and Stan- [30] F. Giacalone, Prammatica siciliana. Storia della
nosdard Modern Greek, in: R. Johansson, S. Stymne tra lingua, proverbi, curiosità, modi di dire, consigli
(Eds.), Proceedings of the Joint 25th Nordic Con- pratici per una corretta scrittura, Trapani, Edizioni
ference on Computational Linguistics and 11th Colorgrafica, 2009.</p>
        <p>Baltic Conference on Human Language Technolo- [31] A. Messina, Grammatica sistematica della lingua
gies (NoDaLiDa/Baltic-HLT 2025), University of siciliana. Dall’ortoepia all’ortografia. Dall’analisi
Tartu Library, Tallinn, Estonia, 2025, pp. 776–784. grammaticale all’analisi logica e del periodo. Con
URL: https://aclanthology.org/2025.nodalida-1.77/. antologia esemplificativa dei poeti. Seconda
edi[21] P. Prokopidis, H. Papageorgiou, Experiments for zione riveduta e ampliata con 30 chine sui mestieri
Dependency Parsing of Greek, in: Y. Goldberg, d’una volta eseguite da Francesco Nania e poesie,
Y. Marton, I. Rehbein, Y. Versley, Ö. Çetinoğlu, Assessorato alle politiche scolastiche di Siracusa,
J. Tetreault (Eds.), Proceedings of the First Joint 2007.</p>
        <p>Workshop on Statistical Parsing of Morphologi- [32] S. Baiamonte, Documento per l’ortografia del
sically Rich Languages and Syntactic Analysis of ciliano. Documentu pi l’ortugrafìa dû sicilianu. II
Non-Canonical Languages, Dublin City Univer- edizione, Cademia Siciliana, 2024.
sity, Dublin, Ireland, 2014, pp. 90–96. URL: https: [33] Lingua siciliana. Come scrivere in
sicil//aclanthology.org/W14-6109/. iano, n.d. URL: https://linguasiciliana.com/
[22] S. Lusito, J. Maillard, A Universal Dependencies come-scrivere-in-siciliano/.</p>
        <p>corpus for Ligurian, in: M. de Lhoneux, R. Tsarfaty [34] M. Gorini, Ortografia Siculo-Calabra, 2017. URL:
(Eds.), Proceedings of the Fifth Workshop on Uni- https://michelegorini.blogspot.com/2017/08/
versal Dependencies (UDW, SyntaxFest 2021), As- ortografia-siculo-calabra.html.
sociation for Computational Linguistics, Sofia, Bul- [35] G. Gerbino, N. Barone, Cenni di ortografia siciliana,
garia, 2021, pp. 121–128. URL: https://aclanthology. Trapani, Jò A.L.A.S.D., 2011.</p>
        <p>org/2021.udw-1.10/. [36] V. Lumia, La Nostra Grammatica Siciliana, Trapani,
[23] E. Wdowiak, Sicilian Translator: A Recipe for Low- Jò A.L.A.S.D., 2010.</p>
        <p>Resource NMT, 2021. URL: https://arxiv.org/abs/ [37] N. Russo, Corso di grammatica siciliana, Forum
2110.01938. arXiv:2110.01938. Lingua siciliana 2003.
[24] R. Sennrich, B. Haddow, A. Birch, Improving Neu- [38] L. Wang, Z. Du, W. Jiao, C. Lyu, J. Pang, L. Cui,
ral Machine Translation Models with Monolingual K. Song, D. Wong, S. Shi, Z. Tu,
BenchmarkData, in: K. Erk, N. A. Smith (Eds.), Proceedings ing and Improving Long-Text Translation with
of the 54th Annual Meeting of the Association Large Language Models, in: L.-W. Ku, A.
Marfor Computational Linguistics (Volume 1: Long tins, V. Srikumar (Eds.), Findings of the
AsPapers), Association for Computational Linguis- sociation for Computational Linguistics: ACL
tics, Berlin, Germany, 2016, pp. 86–96. URL: https: 2024, Association for Computational Linguistics,
//aclanthology.org/P16-1009/. doi:10.18653/v1/ Bangkok, Thailand, 2024, pp. 7175–7187. URL: https:
P16-1009. //aclanthology.org/2024.findings-acl.428/. doi: 10.
[25] A. Traina, Nuovo vocabolario siciliano-italiano, 18653/v1/2024.findings-acl.428.</p>
        <p>Palermo, Lauriel, 1868. [39] K. Brøndsted, C. Dollerup, The names in Harry
Potter, Perspectives: Studies in Translatology 12 (2004) Padova, 2010, pp. 1–20.</p>
        <p>56–72. doi:10.1080/0907676X.2004.9961490. [49] M. Swan, Practical English Usage 3rd edition,
Ox[40] C. Mastrangelo, Harry Potter in Translation: Com- ford University Press, 2005.</p>
        <p>parison of Nine Romance Languages in the Trans- [50] S. Bird, Beyond Technological Solutions: How we
lation of Proper Names in Harry Potter and the Create a World that Sustains its Languages,
LinguaPhilosopher’s Stone, Transletters. International pax Review 9 (2022) 167–173.</p>
        <p>Journal of Translation and Interpreting (2024) 1–28.
[41] C. Bosco, S. Montemagni, M. Simi, Converting</p>
        <p>Italian Treebanks: Towards an Italian Stanford
Dependency Treebank, in: A. Pareja-Lora, M.
Liakata, S. Dipper (Eds.), Proceedings of the 7th
Linguistic Annotation Workshop and
Interoperability with Discourse, Association for Computational
Linguistics, Sofia, Bulgaria, 2013, pp. 61–69. URL:
https://aclanthology.org/W13-2308/.
[42] M. Sanguinetti, C. Bosco, A. Lavelli, A. Mazzei,</p>
        <p>O. Antonelli, F. Tamburini, PoSTWITA-UD: an
Italian Twitter Treebank in Universal
Dependencies, in: N. Calzolari, K. Choukri, C. Cieri, T.
Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard,
J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis,
T. Tokunaga (Eds.), Proceedings of the Eleventh
International Conference on Language Resources
and Evaluation (LREC 2018), European Language
Resources Association (ELRA), Miyazaki, Japan,
2018, pp. 1768–1775. URL: https://aclanthology.org/</p>
        <p>L18-1279/.
[43] G. Guibon, M. Courtin, K. Gerdes, B. Guillaume,</p>
        <p>When Collaborative Treebank Curation Meets
Graph Grammars, in: N. Calzolari, F. Béchet,
P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi,
H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings
of the Twelfth Language Resources and Evaluation
Conference, European Language Resources
Association, Marseille, France, 2020, pp. 5291–5300.
[44] E. Di Nuovo, M. Sanguinetti, A. Mazzei, E. Corino,</p>
        <p>C. Bosco, VALICO-UD: Treebanking an Italian
Learner Corpus in Universal Dependencies,
IJCoL. Italian Journal of Computational Linguistics 8
(2022).
[45] L. Amenta, La reduplicazione sintattica in siciliano,</p>
        <p>Bollettino del Centro di studi filologici e linguistici
siciliani 22 (2010) 345–358.
[46] G. Rufino, Linee di discussione a ipotesi di lavoro
per l’Atlante Linguistico della Sicilia, in: Actas
do XIX Congreso Internacional de Lingüística e
Filoloxia Románicas (1989), volume VIII, A Coruña,
1996, pp. 649–682.
[47] G. Todaro, F. Villoing, P. Gréa, INTERNAL
LO</p>
        <p>CALISATION NN ADV REDUPLICATION IN
SICILIAN, in: Colloque International de Morphology,
volume 22, Bordeaux, France, 2012.
[48] L. Amenta, Perifrasi verbali in siciliano, in: J.
Gar</p>
        <p>zonio (Ed.), Studi sui dialetti della Sicilia, Unipress,
A. Appendix
# sent_id = 35
# text = Nud. d.u di nuiautri sapìa soccu fari.
# translation = Nessuno di noi sapeva cosa fare.
1 Nud. d.u nuddu PRON PI
2 di di ADP E
3 nuiautri nuiautri PRON PE
4 sapìa sapiri VERB V
5 soccu soccu PRON PQ
6 fari fari VERB V
7 . . PUNCT FS</p>
        <p>Gender=Masc|Number=Sing|PronType=Ind</p>
        <p>_</p>
        <p>Number=Plur|Person=1|PronType=Prs
Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin</p>
        <p>Number=Sing|PronType=Int</p>
        <p>VerbForm=Inf
_
4 nsubj
3 case
1 nmod
0 root
6 obj
4 ccomp
4 punct
_
_
_
_
_
_
_
_
_
_
_
_
SpaceAfter=No
SpacesAfter=\r\n
(9a) [From EWT treebank]
# sent_id = weblog-blogspot.com_alaindewitt_20060827093500_ENG_20060827_093500-0017
# text = The wedding had to be postponed as family members fled the outbreak of the war, she said.</p>
        <p>root
The
DET
the
det</p>
        <p>nsubj
wedding
NOUN
wedding
xcomp
mark</p>
        <p>aux:pass
had
VERB
have
(9b) [From ISDT treebank]
# sent_id = isst_tanl-1497
# text = ho da dire anche molte cose che avrei da dire contro me stesso</p>
        <p>ho
VERB
avere
have
xcomp</p>
        <p>mark
da
ADP
da
to
dire
VERB
dire
say</p>
        <p>det
advmod
molte
DET
molto
many
root
cose
NOUN
cosa
things
Declaration on Generative AI
acl:relcl</p>
        <p>obj
During the preparation of this work, the author(s) used ChatGPT (OpenAI), Grammarly, Other, and
GPT@JRC (an internal JRC testbed for LLMs. The model used there is an on-premises installation of
LLaMa 3.3 70B) in order to: Paraphrase and reword, Improve writing style, Grammar and spelling
check, and Citation management. After using these tool(s)/service(s), the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>