1. Introduction

Ciallabacialla! Modeling and Linking a Regional Lexical Resource to Include Sicilian in the Semantic Web

Rachele Sprugnoli

Giovanni Moretti

Domenico Giuseppe Muscianisi

Eleonora Litta

0 0 Università Cattolica del Sacro Cuore , Largo Gemelli, 1, 20123 Milano , Italy 1 Università di Parma , Via D'Azeglio, 85, 43125 Parma , Italy

2025

This paper describes the inclusion of Sicilian in the Semantic Web through the development of new resources aligned with Linguistic Linked Open Data principles. More specifically, we model and publish the first Sicilian Lemma Bank and a bilingual Sicilian-Italian glossary extracted from the Sicilian Wiktionary (Wikizziunariu). These resources are formalized using the OntoLex-Lemon and LiLa (Linking Latin) ontologies with the aim of enabling cross-lingual interoperability. The glossary is also linked to the LiITA (Linking Italian) knowledge base. In addition, two preliminary experiments are reported: the first evaluates the translation capabilities of commercial Large Language Models (LLMs) from Sicilian into Italian; the second investigates bilingual lexicon induction through cross-lingual embedding alignment, with results indicating the challenges posed by low-resource dialects. This work aims to demonstrate the feasibility and importance of integrating under-resourced languages into broader Computational Linguistics and Semantic Web infrastructures.

eol>Sicilian Linguistic Linked Open Data Semantic Web lexical resources dialectology

1. Introduction LiITA adopts the OntoLex-Lemon [2] model as its foun

dational standard for the representation of lexical reThe LiITA (Linking Italian) project is dedicated to devel- sources. This ensures that data is structured according to oping an interoperable Knowledge Base (KB) for Italian widely accepted Semantic Web principles, thereby prolinguistic resources. Its primary goal is to construct a net- moting interoperability and reusability. OntoLex-Lemon work that interconnects diverse Italian language datasets provides a framework for linking lexical entries to their (such as dictionaries, lexicons, and textual corpora) by meanings and to related linguistic properties. LiITA utipublishing them as Linked Open Data (LOD). At the core lizes this framework to establish connections between of LiITA is the Lemma Bank (LB), a continually expand- lemmas in the LB, their occurrences in texts, and their ing repository of canonical citation forms (lemmas) for corresponding entries in lexicons and dictionaries. AlItalian words [ 1 ]. The LB functions as a central hub, though the LiITA Knowledge Base primarily focuses on enabling interlinking and interoperability across vari- resources related to the Italian language, it is important ous linguistic datasets. By aligning lexical entries and to acknowledge that Italy is home to a rich array of local word occurrences from distributed resources with their languages. Many of these are endangered, predominantly corresponding lemmas, LiITA supports federated search oral, and often lack standardized orthographies. A recent capabilities and facilitates advanced linguistic analyses. paper [3] ofers a critical examination of Italy’s linguisCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- tic landscape, challenging mainstream Natural Language tics, September 24 — 26, 2025, Cagliari, Italy Processing (NLP) approaches. The study highlights the * Corresponding author. fragmented and underdeveloped state of NLP research † This paper is the result of the collaboration between the authors. for many Italian language varieties. Given that language For the specific concerns of the Italian academic attribution system, inherently encodes local knowledge, cultural traditions, fRoarcsheecletioSnpr1u;gDnoolmiiesnriecsopGoniussibelpepfeorMsuescctiioannsis3i,fo5,r6s;eEctleioonnsor2aaLnidtta4 and historical memory, the loss of these varieties entail a (the latter with the aid of Rachele Sprugnoli). Section 4 was collab- significant erosion of cultural heritage. Despite this, the oratively written by Rachele Sprugnoli and Giuseppe Muscianisi. language varieties of Italy are increasingly represented Giovanni Moretti is responsible for the technical implementation in multilingual NLP initiatives. These include participaof the Sicilian Lemma Bank and of the modeling of the bilingual tion in shared tasks on morphological inflection and on $lexraiccohne.le.sprugnoli@unicatt.it (R. Sprugnoli); language identification (see for example [ 4]). Additional giovanni.moretti@unicatt.it (G. Moretti); contributions include cross-lingual word embeddings for domenicogiuseppe.muscianisi@unipr.it (D. G. Muscianisi); low-resource settings and the inclusion of Italian varieties eleonoramaria.litta@unicatt.it (E. Litta) like Lombard, Piedmontese, and Sicilian in multilingual 0000-0001-6861-5595 (R. Sprugnoli); 0000-0001-7188-8172 pretrained language models, such as mBERT [5].1 How(G. Moretti); 0000-0002-2964-856X (D. G. Muscianisi); 0000-0002-0499-997X (E. Litta) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1See [3] for other bibliographical details about these recent eforts.

Attribution 4.0 International (CC BY 4.0). ever, these varieties remain under-represented in terms codification of present-day Standard Italian. Today, Stanof training data volume and quality. On the other hand, dard Italian functions as the roof-language for the vara tendency of multilingual NLP to treat language vari- ious Italo-Romance dialects spoken across the country eties "monolithically", without adequate consideration [9]. However, the medieval Sicilian volgare was among for their distinct orthographic conventions, sociolinguis- the first Romance varieties to be used as a literary lantic contexts, or community-specific needs, remains. In guage, particularly at the court of Emperor Frederick II of this light, the integration of bilingual dictionaries and Swabia, who established his principal seat in Palermo beother lexical resources for Italy’s minority languages into tween 1220 and 1250. During this period, poetry and the the LiITA LOD framework would represent a concrete arts flourished, giving rise to the Scuola poetica siciliana, step toward supporting these under-resourced languages. which Dante, in his De vulgari eloquentia, regarded as the Such inclusion would enhance their digital visibility, pro- earliest manifestation of an "Italian" literary tradition. mote accessibility, and contribute to the broader goal of According to the Carta by Giovan Battista Pellegrini, preservation and exchange of information on linguistic the Sicilian dialect is placed as group III (siciliano) among diversity. The first bilingual glossary to be included in the Extreme Southern dialects of Italy (henceforth abthe LiITA KB was the one published in the Vocabolario breviated as ESI, or Meridionale Estremo in Italian), with della lingua parmigiana [6]: data in RDF and CSV for- seven varieties based on the presence of umlaut (metamat together with a set of SPARQL queries are available fonesi), namely Western Sicilian, Central umlauted area, online [7].2 This paper, instead, concerns the modeling South-Eastern umlauted area, Original non-umlauted and linking of the Sicilian Wiktionary (Wikizziunariu). area, Messinese, Aeolian and Pantesco. This classificaMore specifically, this paper provides the following three tion, along with others that have been proposed (see, for contributions: example, [10]), highlights the structural and sociolinguis1. the modeling of the first Lemma Bank for Sicil- tic complexity of the Sicilian dialect. Moreover, due to its ian and of a Sicilian-Italian glossary extracted geographical location at the crossroads of the Mediterranean, Sicily has historically been (and continues to ftrico mLinthkeedWOikpieznziuDnaatariupraincccoiprdleisn;g3 to the Linguis- be) a site of intense cultural, communal, and linguistic contact [11]. Although the Sicilian dialect retains its 2. the linking of the glosssary to the KB of the LiITA core Italo-Romance structural features, it has undergone project4 significant stratification due to successive waves of lin3. the results of two preliminary NLP experiments guistic contact from Late Antiquity through the Middle using the aforementioned bilingual glossary. Ages. Early layers include influences from (Byzantine) Greek, particularly in eastern Sicily, and from Sicilian 2. The Sicilian Dialect Arabic in the west. Subsequent periods of contact include the Norman era (10th–12th centuries) and the reign of Dialects constitute an essential component of Italy’s lin- Frederick II (ending in 1266), followed by the Angevin guistic heritage. In this study, it is important to clarify the rule and the Sicilian Vespers (1282), which introduced intended meaning of the term dialect, which corresponds Gallo-Romance elements. Later, during the Aragonese to the Italian dialetto, i.e., a regional or areal language and Spanish periods (14th–17th centuries), further Iberothat is genealogically a sister language to so-called Stan- Romance influences were integrated into the language. dard Italian, as defined in the Vocabolario Treccani5 (see Following the medieval period, Sicilian dialects began also [8]). The dialects of Italy are, in fact, independent to evolve into their modern forms. In addition, various Romance languages that, over the centuries, have be- linguistic minority communities have historically settled come minoritized local varieties. This shift is primarily in Sicily. The oldest still active is that of Piana degli Alattributable to the prestige and difusion of the volgare banesi, the largest Arbëreshë (Italo-Albanian) settlement iforentino following the works of Dante, Petrarch, and on the island, established at the end of the 16th century. Boccaccio, whose literary influence from the 14th cen- Another notable case is the Gallo-Italic of Sicily, compristury onward played a central role in shaping the literary ing approximately 15 isolated communities in central and language of the Italian Peninsula and, eventually, the eastern Sicily, whose origins trace back to the Norman period. A third group is the Sicilian Greek community in 2https://github.com/LiITA-LOD/LocalVarieties/tree/main/ Messina, oficially recognized as a linguistic minority in Parmigiano 2012, which descends from settlers who migrated from 3The data in both Turtle RDF and TSV format are available on a the Peloponnese in the mid-16th century. Today, the vadedicated GitHub together with a set of ready-to-use queries: https: rieties of Sicilian spoken in these areas exhibit significant //github.com/LiITA-LOD/LocalVarieties/tree/main/Siciliano influence from these non-Italo-Romance minority lan45hhttttpp:s/://w/wwwww.li.titrae.citcani.it/vocabolario/dialetto_ guages. The long and complex sociolinguistic history of res-545debd7-0018-11de-9d89-0016357eee51/ the Sicilian dialect, together with its internal variation and multilingual contact layers, renders it a particularly rich and compelling subject for investigation through computational methods.

2.1. Dictionaries and Grammars of Sicilian

siciliano medievale6 of the University of Catania, which collects lemmas of the volgare siciliano from the mid 13th to the mid 16th century and provides a Web interface [17]. Within this context of rich historical and linguistic tradition, Wikizziunariu emerges as a collaborative resource that is easily accessible, machine-readable, and free from copyright restrictions.

With such a history, the studies on the dialects of Sicily, both in language and culture, show a long-lasting tra- 3. Workflow dition already from the Middle Ages. However, for a comprehensive understanding of the present-day lan- This work was carried out in two main phases. The guage, the most informative period for the study of Sicil- first involved parsing a dump of the Sicilian Wiktionary ian begins with Italian Romanticism, specifically in the (Wikizziunariu) to extract information relevant to our mid to late 19th century. Shortly after the Unification of objectives. The second phase focused on modeling and Italy, Antonino Traina published the Nuovo vocabolario creating resources in RDF format. This latter step insiciliano–italiano, a dictionary lemmatized according to cludes the construction of a Sicilian LB, the transformaSicilian entries, which provided Italian translations as tion of Wiktionary data into RDF triples, and the linking well as phraseological examples drawn from idiomatic of Italian translations to the LiITA LB developed within expressions and literary sources, encompassing both cul- the LiITA project. tivated and popular registers [12]. As was typical of the period, Traina’s underlying objective was to promote the Tuscan-based national language, thereby contributing 3.1. Data Extraction to the broader project of fostering social and linguistic The dump of the Sicilian Wiktionary, downloaded from unification among the newly formed Italian citizenry. In the Academic Computer Club archive in Umeå,7 was the same period, the most influential scholar of the Sicil- parsed using a custom script designed to extract relevant ian language and cultural traditions was Giuseppe Pitrè, data. Figure 1 illustrates the structure of an entry from author of the monumental Fiabe, novelle e racconti popo- which the following elements were retrieved: the page lari siciliani [13] and Grammatica Siciliana [14]. In his title (abbentu), the grammatical category (Sustantivu, i.e., linguistic work, Pitrè approached Sicilian as a Romance common noun), number and gender (singulari maschili), language in its own right, analyzing its phonology di- alternative forms (puru scrittu abbientu), and the Italian achronically from Latin without reference to Tuscan (i.e., translation(s) (i.e., values following the label talianu in Italian), which he explicitly treated as a separate variety the Traduzzioni section, such as riposo, quiete, pace). rather than a standard of comparison. Both Traina and The main challenge in the extraction process stemmed Pitrè promoted a spelling standardization rooted in Latin from the variability in how information is structured orthographic principles. This approach had a dual efect: across entries. For example, number and gender may be on the one hand, it contributed to the definition of a kind represented using initials (e.g., s for sostantivo, noun, m of Sicilian koine (common language), but on the other for masculine, and f for feminine). Furthermore, while hand this introduced a bias towards the Latinization of alternative forms are always enclosed in parentheses, Sicilian [15]. This process of standardization continues they are not always preceded by the phrase puru scrittu, to play a fundamental role today. In 2024, the Cademia and the number of translations varies. In some cases, Siciliana (Sicilian Academy) published the Documento these translations are accompanied by information about per l’ortografia del siciliano (Document for the spelling the grammatical gender of the Italian equivalents (e.g., of Sicilian), aiming to be friendly for those who want maschili and f, as shown in the figure). to write in Sicilian. On the scientific and academic side, A total of 14,464 entries were extracted through this the most important linguistic and ethnographic research process, distributed across 20 distinct classes. Twelve of on Sicilian consists of the pioneering investigation by these correspond to traditional grammatical categories: Franco Fanciullo on the Aeolian Islands [16]. adjectives, adverbs, articles, coordinating conjunctions,

Besides the Dictionary by Traina, two other fundamen- interjections, common nouns, proper nouns, numerals, tal lexicographic resources for the Sicilian dialect are the prepositions, pronouns, subordinating conjunctions, and Vocabolario storico-etimologico del siciliano and the Vo- verbs. In addition, the entries included acronyms, concabolario siciliano, both published on paper by Centro ifxes, prefixes, sufixes, nominal phrases, multiword exdi studi filologici e linguistici siciliani . As far as digital dictionaries are concerned, there is the Vocabolario del 6http://artesia.unict.it/vocabolario 7https://hammurabi.ftp.acc.umu.se/mirror/wikimedia.org/dumps/ backup-index.html pressions, proverbs, and conjugated verb forms. These latter entries were not included in the subsequent stages of the work, as they cannot be directly mapped to a LB. Table 1 presents the final number of entries considered for each grammatical category and provides example for each category; the original categories have been converted into UPOS (Universal Dependencies Part of Speech) tags [18]. The low number of determiners (DET) is due to the fact that, in the original classification, this category includes only articles, while other types of determiners are assigned to diferent classes; for example, possessive determiners are categorized as adjectives or pronouns. puntaperi (kick), ràrica (root) aççiari (to find), studiari (to study) nastenti (stubborn), sicilianu (sicilian) nsièmmula (together), viatu (soon) cu (with), nt’a (in the) iddi (them), nui (we) cincu (five), sìrici (sixteen) Cìfaru (Lucifer), Aropa (Europe) olè, osara nu (a/an), lu (the) mentri (while), pirchistu (therefore) anchi (also), nì (neither)

3.2. Data Modeling and Linking The Sicilian entries were used to build the Sicilian LB.

Lemmas are described with the OntoLex model in conjunction with the LiLa ontology. The latter provides a structured representation of the linguistic features of each lemma, including part-of-speech classification, via the lila:hasPos property, and grammatical gender, via the lila:hasGender property. The total number of lemmas in the Sicilian LB is 10,232. The discrepancy with respect to the number of entries in the Wikizziunariu (see Table 1) is primarily due to the fact that some of them are written representations, rather than distinct standalone lemmas. The following RDF triple, expressed in Turtle syntax, represents the Sicilian lemma middeu,8 classified as a masculine noun. It includes multiple written representations (amiddeu, amoddei, middeu, muddeu, muddìu) each annotated with the language ISO tag @scn. These forms are considered orthographic or graphical variants of the same lemma and do not afect its morphological interpretation; all share the same grammatical gender (masculine). Additionally, the lemma is related to a lemma variant identified by an URI 9 corresponding to the lemma muddìa.10 In our example, middeu and muddìa can be used alternatively but they difer in gender, being the second a feminine noun. < h t t p : / / l i i t a . i t / d a t a / i d /

D i a l e t t o S i c i l i a n o / lemma / 7 5 3 > a l i l a : Lemma ; l i l a : h a s G e n d e r l i l a : m a s c u l i n e ; l i l a : hasPOS l i l a : noun ; l i l a : l e m m a V a r i a n t < h t t p : / / l i i t a . i t / d a t a / i d / D i a l e t t o S i c i l i a n o / lemma / 1 0 1 0 > ; d c t e r m s : i s P a r t O f < h t t p : / / l i i t a . i t / d a t a / i d / D i a l e t t o S i c i l i a n o / lemma / LemmaBank > ; r d f s : l a b e l " middeu " ; o n t o l e x : w r i t t e n R e p " a m i d d e u " @scn , " a m o d d e i " @scn , " middeu " @scn , " muddeu " @scn , " m u d d u " @scn .

Subsequently, the bilingual glossary was modeled. The Sicilian lexical entries were linked to the corresponding lemmas in the Sicilian LB via the ontolex:canonicalForm property. The Italian translations were connected to the Italian LB developed within the LiITA project using the same property. Furthermore, the lexical entries of the two languages were directly related through the vartrans:translatableAs property, which establishes a correspondence between trans

8With URI:http://liita.it/data/id/DialettoSiciliano/lemma/753

9http://liita.it/data/id/DialettoSiciliano/lemma/1010 10The Property lila:lemmaVariant relates two lemmas that are semantically related to one another but difer in some linguistic feature, such as gender or number. lations. The following RDF triple defines a lexical entry in The linking process with the Italian LB was conducted Italian for the word frassino (ash) associated with a canon- in two distinct phases. In the initial phase, an automatic ical form which represents the corresponding lemma in alignment was performed between the string of each the LiITA LB. Furthermore, this entry is linked to its cor- translation of Sicilian glossary entry and those recorded responding Sicilian lexical entry (middeu), establishing in the Italian LB, considering the part of speech. This a cross-lingual correspondence between the Italian and procedure successfully accounted for 55% of the entries. Sicilian lexical resources. An additional 19% of entries were identified as ambigu< h t t p : / / l i i t a . i t / d a t a / ous, i.e., a single Italian entry corresponded to multiple L e x i c a l R e s o u r c e s / D i a l e t t o S i c i l i a n o lemmas within the LB, thus requiring manual disambigua/ i d / L e x i c a l E n t r y / i t a l i a n / 3 2 8 > tion. For instance, the entry caglio, whose Sicilian translaa o n t o l e x : L e x i c a l E n t r y ; tion is quagghialatti, could be linked either to the lemma r d f s : l a b e l " L e x i c a l e n t r y o f identified by the URI http://liita.it/data/id/lemma/972573,

I t a l i a n : f r a s s i n o " ; corresponding to the meaning “rennet”, or to http://liita. o n t o l e x : c a n o n i c a l F o r m < h t t p : / / l i i t a it/data/id/lemma/972574, which refers to a type of herb . i t / d a t a / i d / lemma / 9 9 3 6 9 2 > ; or artichoke. To resolve such ambiguities, additional inv a r t r a n s : t r a n s l a t a b l e A s < h t t p : / / formation was consulted from Wikizziunariu or other l i i t a . i t / d a t a / L e x i c a l R e s o u r c e s / Sicilian-language dictionaries.

D i a l e t t o S i c i l i a n o / i d / Currently, 26% of the entries lack a corresponding linkL e x i c a l E n t r y / s i c i l i a n o / 7 5 3 > . ing to the Italian LB. These terms include, among others, feminine or plural forms absent from the LB, as well as

Figure 2 displays the lemma frassino (ash) as it appears culturally specific terms unique to the Sicilian context, in the LiITA LB, together with information regarding such as spènsiri translated as largo mantello utilizzato its grammatical gender (masculine) and part of speech dai contadini (a wide cloak worn by peasants) or carpita (common noun). The node is linked to the lexical en- translated as coperta rustica tessuta con ritagli di stofa (a tries in the linked lexical resources through the property rustic blanket woven from fabric scraps). ontolex:canonicalForm. In particular, there are six entries connected via the vartrans:translatableAs property related to the Sicilian dictionary and one related to the dialect of Parma. The visualization also shows the lemmaVariant relation between middeu and muddìa.

4. Case Studies Using SPARQL queries, it is possible to extract linguis

tically meaningful information from multiple perspec- 1. -ia (same Greek ía-sufix for abstractivize nomtives.11 inals), as in ancarìa ∼ angheria (vexation), and

For instance, one can retrieve Sicilian lemmas having magarìa ∼ stregoneria (witchcraft); written representations beginning with a d and an r; the 2. -ità (< Latin *-ITÁ(TEM)), as in avracìa ∼ altezcomplementary distribution [d] ∼ [r] is especially at- zosità (haughtiness) and liccum(ar)ìa ∼ golosità tested in the western variant from Palermo when those (delicacy); sounds appear in intervocalic or initial position. Among 3. -ezza (< Latin *-ITIA(M) ∼ *-ITIES), as in laccanìa such cases is the lemma dicembri (December) (< Latin ∼ debolezza (weakness); *DECEMBRE(M) ∼ *DECEMBRU(M)) that witnesses sev- 4. various other abstractivizing suxfies, such as eral written representations, namely (a) dicièmmuru, (b) eccio, -io, -enza, -ita (with the accent on the antedicèmmiru, (c) dicembru, (d) dicèmmuru, (e) dicièmmiru, penultimate syllable). (f) ricièmmiru and (g) ricièmmuru. The lemmas (d) and (f) indeed show the aforementioned allophony [d] ∼ [r] but there are also other interesting phenomena. The lemmas 5. Experiments (a) and (e) show the metafonesi (umlaut) in tonic syllables, Beyond the specific linguistic analyses enabled by interi.e. a process of vowel assimilation; the lemmas (a), (b), operability, such as those presented in Section 4, the data (d), (e), (f) and (g) witness the lag assimilation of Latin we provide can support a variety of experimental appli*MB > Sicilian MM [19]. Finally, the lemmas (a), (d) and cations. A couple of examples are given in the following (g) attest a u-vowel, while the lemmas (b), (e) and (f) an subsections. e-vowel: these are epentheses, thus random insertions of one or more sounds to favor the pronunciation.

It is also possible to search for lemmas having writ- 5.1. How much Sicilian do LLMs know? ten representations that include ed or ied, an alternation The bilingual glossary may be used to assess the ability of which graphically renders the umlaut of vowels in tonic Large Language Models (LLMs) to translate from Sicilian syllables. This is a significant linguistic phenomenon into Italian. Specifically, we randomly selected 20 nouns, in Sicilian, serving as a marker for distinguishing di- 20 adjectives, 20 verbs, and 20 adverbs, and prompted alectal variants. It is generally attested in central and the main commercially available LLMs to translate each western regions of the island, while it is absent in the word into Italian. We chose to focus on commercial sysnorth-eastern areas. For example, in the Sicilian word tems (namely, ChatGPT, Gemini, and Claude) because (a) aceddu (bird) (< Latin *AU(I)CELLU(M)), the actual they are the most widely used by non-experts due to pronunciation of dd is retroflexed as d. d. [ã:] but it is their user-friendly interfaces. A simple zero-shot prompt here not represented [14]. This feature is contained was employed uniformly across all models: Traduci ogni in all the following written representations, that is (b) parola dal siciliano all’italiano (Translate each word from acieddu, (c) ancieddu and (d) oceddu. The tonic syllable Sicilian to Italian). The responses were compared against is the middle one and witnesses either (1) no changes the translations provided in the glossary and were also in lemmas (a) and (d) deriving from Latin *-CE- or (2) evaluated by one of the authors, a linguist and native umlauted vowels in lemmas (b) and (c) both bisyllabic speaker of Sicilian. This additional human evaluation ["I.e]. The same phenomenon occurs with, among oth- was intended to determine whether certain translations, ers, (ab)bruciareddu ∼ (ab)bruciarieddu (ripe ear), beddu even if not identical to those recorded in the resource, ∼ bieddu (beautiful), ciuceddu ∼ ciucieddu (soup, broth, could nonetheless be considered acceptable. For examdelicacy), frateddu ∼ fratieddu (brother), marzamareddu ple, while the adjective bacioccu is oficially translated and mazzamareddu ∼ marzama(u)rieddu, mazzumau- only as sempliciotto (nitwit), the alternatives sciocco (foolrieddu and mazzamarieddu (whirlwind, whirlpool, de- ish) (proposed by GPT-4o) and tonto (dumb) (provided mon), munzeddu ∼ munzieddu (stack, pile), pisciteddu ∼ by Claude Sonnet 4) were considered equally valid. Tapiscitieddu (small fish). ble 2 presents the results of this evaluation in terms of

As for morphology, we can search for nouns ending (synonym-aware) accuracy. with -ìa (< Greek -ía), an abstract sufix which is one Table 2 reveals not very high accuracies even with of the most common and attested. We can thus notice synonym tolerance. Gemini 2.5 Flash tops the list at 67% that the Sicilian sufix is variously represented in Italian accuracy, about 6 points ahead of GPT o3 and roughly 15 points above Claude 4 Sonnet (51%) and GPT-4o (52%). Even the best-performing model thus mistranslates 11Queries can be found in the GitHub repository: https://github.com/

LiITA-LOD/LocalVarieties/tree/main/Siciliano. They can be tested on the following endpoint: https://liita.it/sparql. roughly one word out of three, underscoring how lowresource dialects remain challenging for general-purpose systems. An interesting case is that of GPT o3, which, during the reasoning process, retrieves information from the Web. For certain translations, it explicitly cites its sources, including the Wikizziunariu, the vocabulary published on the TerraLab blog,12 and the lexicon curated by the group Salviamo il siciliano.13 This approach leads to better accuracy than the GPT-4o model but still lower than that of Gemini 2.5 Flash.

Two noteworthy observations can be drawn from Figure 3, which shows with a bar chart the accuracy calculated for each part of speech. First, verbs consistently emerge as the most challenging grammatical category to translate across all four models. Second, GPT o3 and Gemini 2.5 Flash exhibit relatively stable performance across categories, whereas Claude 4 Sonnet and GPT4o show greater variability. However, given the limited sample size of only 80 items, the results are subject to a high sampling error, and the observed diferences are not statistically significant. Future work should expand the benchmark and incorporate a broader range of dialectal variants to enable more robust evaluation.

Error analysis shows that 18 words were incorrectly translated by all the systems. More generally, all models exhibit a systematic tendency to infer translations on the basis of superficial orthographic similarity between the 12https://www.terralab.it 13http://www.salviamoilsiciliano.com

Sicilian lemma and a resembling Italian word, which is

then selected as the output. For example, mbròcculi is rendered as broccoli, although its actual meaning is moina (flattery), and pisuliddu is rendered as pisellino (little pea), whereas the intended sense is permaloso (touchy).

5.2. Evaluating Bilingual Lexicon Induction

A second experiment used the bilingual glossary to build cross-lingual word embeddings and to evaluate the resulting mapped vectors on the Bilingual Lexicon Induction (BLI) task. Irvine and Callison-Burch [20] define BLI as “the task of inducing word translations from monolingual corpora in two languages.” Although recent work has introduced solutions based on LLMs [21] [22], one of the most widely adopted methods is still to align embeddings trained separately on monolingual corpora into a shared vector space. We therefore applied vecmap14 [23] in its supervised mode to map Sicilian and Italian fastText embeddings.15 The glossary was partitioned into training and test sets using a 90:10 ratio after removing homographs and Sicilian lemmas whose Italian equivalents were multitoken expressions, yielding 9,698 Sicilian–Italian pairs for training and 1,079 pairs for testing. Evaluation employed the nearest-neighbor retrieval method (with k=10) and resulted in an accuracy of 19.8% (coverage=50.6%). By using the Cross-domain Similarity Local Scaling (CSLS) retrieval, a cosine-similarity variant that attenuates the hubness problem, namely the tendency of a small subset of vectors to appear disproportionately often as nearest neighbors of other points [24], the result is even lower, i.e., 14.68%. These low scores suggest that, although more than 9.6 K seed pairs are non-trivial for a low-resource variety such as Sicilian, there are many out-of-vocabulary words.

6. Conclusions This work represents a step toward the integration of the

Sicilian dialect into the ecosystem of Linguistic Linked Open Data [25]. By modeling and publishing a bilingual Sicilian–Italian glossary extracted from Wikizziunariu, and by aligning it with the LiITA LB through established ontologies such as OntoLex-Lemon and LiLa, we provide a reusable, interoperable lexical resource that promotes the visibility and accessibility of Sicilian in digital environments. The two preliminary NLP experiments, evaluating LLMs’ translation capabilities and testing BLI, highlight both the potential and the current limitations of applying computational methods to under-resourced varieties. 14https://github.com/artetxem/vecmap 15https://fasttext.cc/docs/en/crawl-vectors.html

Acknowledgments This contribution is partly funded by the European Union

- Next Generation EU, Mission 4 Component 1 CUP J53D2301727OOO1. The PRIN 2022 PNRR project “LiITA: Interlinking Linguistic Resources for Italian via Linked Data” is carried out jointly by the Università Cattolica del Sacro Cuore, Milano and the Università di Torino.

Future work will proceed along multiple directions. First, we plan to model and integrate additional Sicilian resources, with particular attention to Antonino Traina’s

Nuovo vocabolario siciliano–italiano, which is already available in digital format. Second, we aim to broaden the scope of the LiITA KB by incorporating resources from other dialects. An expanded multilingual dataset will enhance interoperability and enable richer cross-lingual analyses. Third, we intend to link textual resources to the LB. However, this will require reliable lemmatization procedures, a non-trivial task for dialects with nonstandardized orthographies and scarce annotated corpora.

Finally, we plan to extend the range and depth of NLP experiments to evaluate downstream tasks with the goal of advancing computational support for Italy’s linguistic diversity.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Text translation. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

ceedings of the 62nd Annual Meeting of the Associa-

tion for Computational Linguistics (Volume 2 : Short

tics , Bangkok, Thailand, 2024 , pp. 743 - 753 . URL:

https://aclanthology.org/ 2024 .acl-short. 67 /. doi:10.

18653 /v1/ 2024 .acl-short. 67 . [23]

Artetxe ,

Labaka ,

Agirre , Learning bilingual

in: Proceedings of the 55th Annual Meeting of the

1 : Long

Papers)

, 2017 , pp. 451 - 462 . [24]

Conneau , G. Lample,

Ranzato , L. Denoyer,

arXiv preprint arXiv:1710.04087 ( 2017 ). [25]

Cimiano ,

Chiarcos ,

J. P.

McCrae ,

Gracia ,