<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ciallabacialla! Modeling and Linking a Regional Lexical Resource to Include Sicilian in the Semantic Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rachele Sprugnoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Moretti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Giuseppe Muscianisi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eleonora Litta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università Cattolica del Sacro Cuore</institution>
          ,
          <addr-line>Largo Gemelli, 1, 20123 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università di Parma</institution>
          ,
          <addr-line>Via D'Azeglio, 85, 43125 Parma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper describes the inclusion of Sicilian in the Semantic Web through the development of new resources aligned with Linguistic Linked Open Data principles. More specifically, we model and publish the first Sicilian Lemma Bank and a bilingual Sicilian-Italian glossary extracted from the Sicilian Wiktionary (Wikizziunariu). These resources are formalized using the OntoLex-Lemon and LiLa (Linking Latin) ontologies with the aim of enabling cross-lingual interoperability. The glossary is also linked to the LiITA (Linking Italian) knowledge base. In addition, two preliminary experiments are reported: the first evaluates the translation capabilities of commercial Large Language Models (LLMs) from Sicilian into Italian; the second investigates bilingual lexicon induction through cross-lingual embedding alignment, with results indicating the challenges posed by low-resource dialects. This work aims to demonstrate the feasibility and importance of integrating under-resourced languages into broader Computational Linguistics and Semantic Web infrastructures.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sicilian</kwd>
        <kwd>Linguistic Linked Open Data</kwd>
        <kwd>Semantic Web</kwd>
        <kwd>lexical resources</kwd>
        <kwd>dialectology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>LiITA adopts the OntoLex-Lemon [2] model as its foun</title>
        <p>
          dational standard for the representation of lexical
reThe LiITA (Linking Italian) project is dedicated to devel- sources. This ensures that data is structured according to
oping an interoperable Knowledge Base (KB) for Italian widely accepted Semantic Web principles, thereby
prolinguistic resources. Its primary goal is to construct a net- moting interoperability and reusability. OntoLex-Lemon
work that interconnects diverse Italian language datasets provides a framework for linking lexical entries to their
(such as dictionaries, lexicons, and textual corpora) by meanings and to related linguistic properties. LiITA
utipublishing them as Linked Open Data (LOD). At the core lizes this framework to establish connections between
of LiITA is the Lemma Bank (LB), a continually expand- lemmas in the LB, their occurrences in texts, and their
ing repository of canonical citation forms (lemmas) for corresponding entries in lexicons and dictionaries.
AlItalian words [
          <xref ref-type="bibr" rid="ref7">1</xref>
          ]. The LB functions as a central hub, though the LiITA Knowledge Base primarily focuses on
enabling interlinking and interoperability across vari- resources related to the Italian language, it is important
ous linguistic datasets. By aligning lexical entries and to acknowledge that Italy is home to a rich array of local
word occurrences from distributed resources with their languages. Many of these are endangered, predominantly
corresponding lemmas, LiITA supports federated search oral, and often lack standardized orthographies. A recent
capabilities and facilitates advanced linguistic analyses. paper [3] ofers a critical examination of Italy’s
linguisCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- tic landscape, challenging mainstream Natural Language
tics, September 24 — 26, 2025, Cagliari, Italy Processing (NLP) approaches. The study highlights the
* Corresponding author. fragmented and underdeveloped state of NLP research
† This paper is the result of the collaboration between the authors. for many Italian language varieties. Given that language
For the specific concerns of the Italian academic attribution system, inherently encodes local knowledge, cultural traditions,
fRoarcsheecletioSnpr1u;gDnoolmiiesnriecsopGoniussibelpepfeorMsuescctiioannsis3i,fo5,r6s;eEctleioonnsor2aaLnidtta4 and historical memory, the loss of these varieties entail a
(the latter with the aid of Rachele Sprugnoli). Section 4 was collab- significant erosion of cultural heritage. Despite this, the
oratively written by Rachele Sprugnoli and Giuseppe Muscianisi. language varieties of Italy are increasingly represented
Giovanni Moretti is responsible for the technical implementation in multilingual NLP initiatives. These include
participaof the Sicilian Lemma Bank and of the modeling of the bilingual tion in shared tasks on morphological inflection and on
$lexraiccohne.le.sprugnoli@unicatt.it (R. Sprugnoli); language identification (see for example [ 4]). Additional
giovanni.moretti@unicatt.it (G. Moretti); contributions include cross-lingual word embeddings for
domenicogiuseppe.muscianisi@unipr.it (D. G. Muscianisi); low-resource settings and the inclusion of Italian varieties
eleonoramaria.litta@unicatt.it (E. Litta) like Lombard, Piedmontese, and Sicilian in multilingual
0000-0001-6861-5595 (R. Sprugnoli); 0000-0001-7188-8172 pretrained language models, such as mBERT [5].1
How(G. Moretti); 0000-0002-2964-856X (D. G. Muscianisi);
0000-0002-0499-997X (E. Litta)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1See [3] for other bibliographical details about these recent eforts.
        </p>
        <p>Attribution 4.0 International (CC BY 4.0).
ever, these varieties remain under-represented in terms codification of present-day Standard Italian. Today,
Stanof training data volume and quality. On the other hand, dard Italian functions as the roof-language for the
vara tendency of multilingual NLP to treat language vari- ious Italo-Romance dialects spoken across the country
eties "monolithically", without adequate consideration [9]. However, the medieval Sicilian volgare was among
for their distinct orthographic conventions, sociolinguis- the first Romance varieties to be used as a literary
lantic contexts, or community-specific needs, remains. In guage, particularly at the court of Emperor Frederick II of
this light, the integration of bilingual dictionaries and Swabia, who established his principal seat in Palermo
beother lexical resources for Italy’s minority languages into tween 1220 and 1250. During this period, poetry and the
the LiITA LOD framework would represent a concrete arts flourished, giving rise to the Scuola poetica siciliana,
step toward supporting these under-resourced languages. which Dante, in his De vulgari eloquentia, regarded as the
Such inclusion would enhance their digital visibility, pro- earliest manifestation of an "Italian" literary tradition.
mote accessibility, and contribute to the broader goal of According to the Carta by Giovan Battista Pellegrini,
preservation and exchange of information on linguistic the Sicilian dialect is placed as group III (siciliano) among
diversity. The first bilingual glossary to be included in the Extreme Southern dialects of Italy (henceforth
abthe LiITA KB was the one published in the Vocabolario breviated as ESI, or Meridionale Estremo in Italian), with
della lingua parmigiana [6]: data in RDF and CSV for- seven varieties based on the presence of umlaut
(metamat together with a set of SPARQL queries are available fonesi), namely Western Sicilian, Central umlauted area,
online [7].2 This paper, instead, concerns the modeling South-Eastern umlauted area, Original non-umlauted
and linking of the Sicilian Wiktionary (Wikizziunariu). area, Messinese, Aeolian and Pantesco. This
classificaMore specifically, this paper provides the following three tion, along with others that have been proposed (see, for
contributions: example, [10]), highlights the structural and
sociolinguis1. the modeling of the first Lemma Bank for Sicil- tic complexity of the Sicilian dialect. Moreover, due to its
ian and of a Sicilian-Italian glossary extracted geographical location at the crossroads of the
Mediterranean, Sicily has historically been (and continues to
ftrico mLinthkeedWOikpieznziuDnaatariupraincccoiprdleisn;g3 to the Linguis- be) a site of intense cultural, communal, and linguistic
contact [11]. Although the Sicilian dialect retains its
2. the linking of the glosssary to the KB of the LiITA core Italo-Romance structural features, it has undergone
project4 significant stratification due to successive waves of
lin3. the results of two preliminary NLP experiments guistic contact from Late Antiquity through the Middle
using the aforementioned bilingual glossary. Ages. Early layers include influences from (Byzantine)
Greek, particularly in eastern Sicily, and from Sicilian
2. The Sicilian Dialect Arabic in the west. Subsequent periods of contact include
the Norman era (10th–12th centuries) and the reign of
Dialects constitute an essential component of Italy’s lin- Frederick II (ending in 1266), followed by the Angevin
guistic heritage. In this study, it is important to clarify the rule and the Sicilian Vespers (1282), which introduced
intended meaning of the term dialect, which corresponds Gallo-Romance elements. Later, during the Aragonese
to the Italian dialetto, i.e., a regional or areal language and Spanish periods (14th–17th centuries), further
Iberothat is genealogically a sister language to so-called Stan- Romance influences were integrated into the language.
dard Italian, as defined in the Vocabolario Treccani5 (see Following the medieval period, Sicilian dialects began
also [8]). The dialects of Italy are, in fact, independent to evolve into their modern forms. In addition, various
Romance languages that, over the centuries, have be- linguistic minority communities have historically settled
come minoritized local varieties. This shift is primarily in Sicily. The oldest still active is that of Piana degli
Alattributable to the prestige and difusion of the volgare banesi, the largest Arbëreshë (Italo-Albanian) settlement
iforentino following the works of Dante, Petrarch, and on the island, established at the end of the 16th century.
Boccaccio, whose literary influence from the 14th cen- Another notable case is the Gallo-Italic of Sicily,
compristury onward played a central role in shaping the literary ing approximately 15 isolated communities in central and
language of the Italian Peninsula and, eventually, the eastern Sicily, whose origins trace back to the Norman
period. A third group is the Sicilian Greek community in
2https://github.com/LiITA-LOD/LocalVarieties/tree/main/ Messina, oficially recognized as a linguistic minority in
Parmigiano 2012, which descends from settlers who migrated from
3The data in both Turtle RDF and TSV format are available on a the Peloponnese in the mid-16th century. Today, the
vadedicated GitHub together with a set of ready-to-use queries: https: rieties of Sicilian spoken in these areas exhibit significant
//github.com/LiITA-LOD/LocalVarieties/tree/main/Siciliano influence from these non-Italo-Romance minority
lan45hhttttpp:s/://w/wwwww.li.titrae.citcani.it/vocabolario/dialetto_ guages. The long and complex sociolinguistic history of
res-545debd7-0018-11de-9d89-0016357eee51/ the Sicilian dialect, together with its internal variation
and multilingual contact layers, renders it a particularly
rich and compelling subject for investigation through
computational methods.</p>
        <sec id="sec-1-1-1">
          <title>2.1. Dictionaries and Grammars of Sicilian</title>
          <p>siciliano medievale6 of the University of Catania, which
collects lemmas of the volgare siciliano from the mid 13th
to the mid 16th century and provides a Web interface
[17]. Within this context of rich historical and
linguistic tradition, Wikizziunariu emerges as a collaborative
resource that is easily accessible, machine-readable, and
free from copyright restrictions.</p>
          <p>With such a history, the studies on the dialects of Sicily,
both in language and culture, show a long-lasting tra- 3. Workflow
dition already from the Middle Ages. However, for a
comprehensive understanding of the present-day lan- This work was carried out in two main phases. The
guage, the most informative period for the study of Sicil- first involved parsing a dump of the Sicilian Wiktionary
ian begins with Italian Romanticism, specifically in the (Wikizziunariu) to extract information relevant to our
mid to late 19th century. Shortly after the Unification of objectives. The second phase focused on modeling and
Italy, Antonino Traina published the Nuovo vocabolario creating resources in RDF format. This latter step
insiciliano–italiano, a dictionary lemmatized according to cludes the construction of a Sicilian LB, the
transformaSicilian entries, which provided Italian translations as tion of Wiktionary data into RDF triples, and the linking
well as phraseological examples drawn from idiomatic of Italian translations to the LiITA LB developed within
expressions and literary sources, encompassing both cul- the LiITA project.
tivated and popular registers [12]. As was typical of the
period, Traina’s underlying objective was to promote the
Tuscan-based national language, thereby contributing 3.1. Data Extraction
to the broader project of fostering social and linguistic The dump of the Sicilian Wiktionary, downloaded from
unification among the newly formed Italian citizenry. In the Academic Computer Club archive in Umeå,7 was
the same period, the most influential scholar of the Sicil- parsed using a custom script designed to extract relevant
ian language and cultural traditions was Giuseppe Pitrè, data. Figure 1 illustrates the structure of an entry from
author of the monumental Fiabe, novelle e racconti popo- which the following elements were retrieved: the page
lari siciliani [13] and Grammatica Siciliana [14]. In his title (abbentu), the grammatical category (Sustantivu, i.e.,
linguistic work, Pitrè approached Sicilian as a Romance common noun), number and gender (singulari maschili),
language in its own right, analyzing its phonology di- alternative forms (puru scrittu abbientu), and the Italian
achronically from Latin without reference to Tuscan (i.e., translation(s) (i.e., values following the label talianu in
Italian), which he explicitly treated as a separate variety the Traduzzioni section, such as riposo, quiete, pace).
rather than a standard of comparison. Both Traina and The main challenge in the extraction process stemmed
Pitrè promoted a spelling standardization rooted in Latin from the variability in how information is structured
orthographic principles. This approach had a dual efect: across entries. For example, number and gender may be
on the one hand, it contributed to the definition of a kind represented using initials (e.g., s for sostantivo, noun, m
of Sicilian koine (common language), but on the other for masculine, and f for feminine). Furthermore, while
hand this introduced a bias towards the Latinization of alternative forms are always enclosed in parentheses,
Sicilian [15]. This process of standardization continues they are not always preceded by the phrase puru scrittu,
to play a fundamental role today. In 2024, the Cademia and the number of translations varies. In some cases,
Siciliana (Sicilian Academy) published the Documento these translations are accompanied by information about
per l’ortografia del siciliano (Document for the spelling the grammatical gender of the Italian equivalents (e.g.,
of Sicilian), aiming to be friendly for those who want maschili and f, as shown in the figure).
to write in Sicilian. On the scientific and academic side, A total of 14,464 entries were extracted through this
the most important linguistic and ethnographic research process, distributed across 20 distinct classes. Twelve of
on Sicilian consists of the pioneering investigation by these correspond to traditional grammatical categories:
Franco Fanciullo on the Aeolian Islands [16]. adjectives, adverbs, articles, coordinating conjunctions,</p>
          <p>Besides the Dictionary by Traina, two other fundamen- interjections, common nouns, proper nouns, numerals,
tal lexicographic resources for the Sicilian dialect are the prepositions, pronouns, subordinating conjunctions, and
Vocabolario storico-etimologico del siciliano and the Vo- verbs. In addition, the entries included acronyms,
concabolario siciliano, both published on paper by Centro ifxes, prefixes, sufixes, nominal phrases, multiword
exdi studi filologici e linguistici siciliani . As far as digital
dictionaries are concerned, there is the Vocabolario del 6http://artesia.unict.it/vocabolario
7https://hammurabi.ftp.acc.umu.se/mirror/wikimedia.org/dumps/
backup-index.html
pressions, proverbs, and conjugated verb forms. These
latter entries were not included in the subsequent stages
of the work, as they cannot be directly mapped to a
LB. Table 1 presents the final number of entries
considered for each grammatical category and provides
example for each category; the original categories have
been converted into UPOS (Universal Dependencies Part
of Speech) tags [18]. The low number of determiners
(DET) is due to the fact that, in the original classification,
this category includes only articles, while other types of
determiners are assigned to diferent classes; for example,
possessive determiners are categorized as adjectives or
pronouns.
puntaperi (kick), ràrica (root)
aççiari (to find), studiari (to study)
nastenti (stubborn), sicilianu (sicilian)
nsièmmula (together), viatu (soon)
cu (with), nt’a (in the)
iddi (them), nui (we)
cincu (five), sìrici (sixteen)
Cìfaru (Lucifer), Aropa (Europe)
olè, osara
nu (a/an), lu (the)
mentri (while), pirchistu (therefore)
anchi (also), nì (neither)</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>3.2. Data Modeling and Linking</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>The Sicilian entries were used to build the Sicilian LB.</title>
        <p>Lemmas are described with the OntoLex model in
conjunction with the LiLa ontology. The latter provides a
structured representation of the linguistic features of
each lemma, including part-of-speech classification, via
the lila:hasPos property, and grammatical gender, via
the lila:hasGender property. The total number of
lemmas in the Sicilian LB is 10,232. The discrepancy with
respect to the number of entries in the Wikizziunariu
(see Table 1) is primarily due to the fact that some of
them are written representations, rather than distinct
standalone lemmas. The following RDF triple, expressed
in Turtle syntax, represents the Sicilian lemma middeu,8
classified as a masculine noun. It includes multiple
written representations (amiddeu, amoddei, middeu, muddeu,
muddìu) each annotated with the language ISO tag @scn.
These forms are considered orthographic or graphical
variants of the same lemma and do not afect its
morphological interpretation; all share the same grammatical
gender (masculine). Additionally, the lemma is related to
a lemma variant identified by an URI 9 corresponding to
the lemma muddìa.10 In our example, middeu and muddìa
can be used alternatively but they difer in gender, being
the second a feminine noun.
&lt; h t t p : / / l i i t a . i t / d a t a / i d /</p>
        <p>D i a l e t t o S i c i l i a n o / lemma / 7 5 3 &gt; a
l i l a : Lemma ;
l i l a : h a s G e n d e r l i l a : m a s c u l i n e ;
l i l a : hasPOS l i l a : noun ;
l i l a : l e m m a V a r i a n t &lt; h t t p : / / l i i t a . i t /
d a t a / i d / D i a l e t t o S i c i l i a n o / lemma
/ 1 0 1 0 &gt; ;
d c t e r m s : i s P a r t O f &lt; h t t p : / / l i i t a . i t /
d a t a / i d / D i a l e t t o S i c i l i a n o / lemma /
LemmaBank &gt; ;
r d f s : l a b e l " middeu " ;
o n t o l e x : w r i t t e n R e p " a m i d d e u " @scn , "
a m o d d e i " @scn , " middeu " @scn , "
muddeu " @scn , " m u d d u " @scn .</p>
        <p>Subsequently, the bilingual glossary was
modeled. The Sicilian lexical entries were linked to
the corresponding lemmas in the Sicilian LB via the
ontolex:canonicalForm property. The Italian
translations were connected to the Italian LB developed within
the LiITA project using the same property. Furthermore,
the lexical entries of the two languages were directly
related through the vartrans:translatableAs
property, which establishes a correspondence between
trans</p>
      </sec>
      <sec id="sec-1-3">
        <title>8With URI:http://liita.it/data/id/DialettoSiciliano/lemma/753</title>
        <p>9http://liita.it/data/id/DialettoSiciliano/lemma/1010
10The Property lila:lemmaVariant relates two lemmas that are
semantically related to one another but difer in some linguistic feature,
such as gender or number.
lations. The following RDF triple defines a lexical entry in The linking process with the Italian LB was conducted
Italian for the word frassino (ash) associated with a canon- in two distinct phases. In the initial phase, an automatic
ical form which represents the corresponding lemma in alignment was performed between the string of each
the LiITA LB. Furthermore, this entry is linked to its cor- translation of Sicilian glossary entry and those recorded
responding Sicilian lexical entry (middeu), establishing in the Italian LB, considering the part of speech. This
a cross-lingual correspondence between the Italian and procedure successfully accounted for 55% of the entries.
Sicilian lexical resources. An additional 19% of entries were identified as
ambigu&lt; h t t p : / / l i i t a . i t / d a t a / ous, i.e., a single Italian entry corresponded to multiple
L e x i c a l R e s o u r c e s / D i a l e t t o S i c i l i a n o lemmas within the LB, thus requiring manual
disambigua/ i d / L e x i c a l E n t r y / i t a l i a n / 3 2 8 &gt; tion. For instance, the entry caglio, whose Sicilian
translaa o n t o l e x : L e x i c a l E n t r y ; tion is quagghialatti, could be linked either to the lemma
r d f s : l a b e l " L e x i c a l e n t r y o f identified by the URI http://liita.it/data/id/lemma/972573,</p>
        <p>I t a l i a n : f r a s s i n o " ; corresponding to the meaning “rennet”, or to http://liita.
o n t o l e x : c a n o n i c a l F o r m &lt; h t t p : / / l i i t a it/data/id/lemma/972574, which refers to a type of herb
. i t / d a t a / i d / lemma / 9 9 3 6 9 2 &gt; ; or artichoke. To resolve such ambiguities, additional
inv a r t r a n s : t r a n s l a t a b l e A s &lt; h t t p : / / formation was consulted from Wikizziunariu or other
l i i t a . i t / d a t a / L e x i c a l R e s o u r c e s / Sicilian-language dictionaries.</p>
        <p>D i a l e t t o S i c i l i a n o / i d / Currently, 26% of the entries lack a corresponding
linkL e x i c a l E n t r y / s i c i l i a n o / 7 5 3 &gt; . ing to the Italian LB. These terms include, among others,
feminine or plural forms absent from the LB, as well as</p>
        <p>Figure 2 displays the lemma frassino (ash) as it appears culturally specific terms unique to the Sicilian context,
in the LiITA LB, together with information regarding such as spènsiri translated as largo mantello utilizzato
its grammatical gender (masculine) and part of speech dai contadini (a wide cloak worn by peasants) or carpita
(common noun). The node is linked to the lexical en- translated as coperta rustica tessuta con ritagli di stofa (a
tries in the linked lexical resources through the property rustic blanket woven from fabric scraps).
ontolex:canonicalForm. In particular, there are six
entries connected via the vartrans:translatableAs
property related to the Sicilian dictionary and one related
to the dialect of Parma. The visualization also shows the
lemmaVariant relation between middeu and muddìa.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Case Studies</title>
      <sec id="sec-2-1">
        <title>Using SPARQL queries, it is possible to extract linguis</title>
        <p>tically meaningful information from multiple perspec- 1. -ia (same Greek ía-sufix for abstractivize
nomtives.11 inals), as in ancarìa ∼ angheria (vexation), and</p>
        <p>For instance, one can retrieve Sicilian lemmas having magarìa ∼ stregoneria (witchcraft);
written representations beginning with a d and an r; the 2. -ità (&lt; Latin *-ITÁ(TEM)), as in avracìa ∼
altezcomplementary distribution [d] ∼ [r] is especially at- zosità (haughtiness) and liccum(ar)ìa ∼ golosità
tested in the western variant from Palermo when those (delicacy);
sounds appear in intervocalic or initial position. Among 3. -ezza (&lt; Latin *-ITIA(M) ∼ *-ITIES), as in laccanìa
such cases is the lemma dicembri (December) (&lt; Latin ∼ debolezza (weakness);
*DECEMBRE(M) ∼ *DECEMBRU(M)) that witnesses sev- 4. various other abstractivizing suxfies, such as
eral written representations, namely (a) dicièmmuru, (b) eccio, -io, -enza, -ita (with the accent on the
antedicèmmiru, (c) dicembru, (d) dicèmmuru, (e) dicièmmiru, penultimate syllable).
(f) ricièmmiru and (g) ricièmmuru. The lemmas (d) and (f)
indeed show the aforementioned allophony [d] ∼ [r] but
there are also other interesting phenomena. The lemmas 5. Experiments
(a) and (e) show the metafonesi (umlaut) in tonic syllables, Beyond the specific linguistic analyses enabled by
interi.e. a process of vowel assimilation; the lemmas (a), (b), operability, such as those presented in Section 4, the data
(d), (e), (f) and (g) witness the lag assimilation of Latin we provide can support a variety of experimental
appli*MB &gt; Sicilian MM [19]. Finally, the lemmas (a), (d) and cations. A couple of examples are given in the following
(g) attest a u-vowel, while the lemmas (b), (e) and (f) an subsections.
e-vowel: these are epentheses, thus random insertions of
one or more sounds to favor the pronunciation.</p>
        <p>It is also possible to search for lemmas having writ- 5.1. How much Sicilian do LLMs know?
ten representations that include ed or ied, an alternation The bilingual glossary may be used to assess the ability of
which graphically renders the umlaut of vowels in tonic Large Language Models (LLMs) to translate from Sicilian
syllables. This is a significant linguistic phenomenon into Italian. Specifically, we randomly selected 20 nouns,
in Sicilian, serving as a marker for distinguishing di- 20 adjectives, 20 verbs, and 20 adverbs, and prompted
alectal variants. It is generally attested in central and the main commercially available LLMs to translate each
western regions of the island, while it is absent in the word into Italian. We chose to focus on commercial
sysnorth-eastern areas. For example, in the Sicilian word tems (namely, ChatGPT, Gemini, and Claude) because
(a) aceddu (bird) (&lt; Latin *AU(I)CELLU(M)), the actual they are the most widely used by non-experts due to
pronunciation of dd is retroflexed as d. d. [ã:] but it is their user-friendly interfaces. A simple zero-shot prompt
here not represented [14]. This feature is contained was employed uniformly across all models: Traduci ogni
in all the following written representations, that is (b) parola dal siciliano all’italiano (Translate each word from
acieddu, (c) ancieddu and (d) oceddu. The tonic syllable Sicilian to Italian). The responses were compared against
is the middle one and witnesses either (1) no changes the translations provided in the glossary and were also
in lemmas (a) and (d) deriving from Latin *-CE- or (2) evaluated by one of the authors, a linguist and native
umlauted vowels in lemmas (b) and (c) both bisyllabic speaker of Sicilian. This additional human evaluation
["I.e]. The same phenomenon occurs with, among oth- was intended to determine whether certain translations,
ers, (ab)bruciareddu ∼ (ab)bruciarieddu (ripe ear), beddu even if not identical to those recorded in the resource,
∼ bieddu (beautiful), ciuceddu ∼ ciucieddu (soup, broth, could nonetheless be considered acceptable. For
examdelicacy), frateddu ∼ fratieddu (brother), marzamareddu ple, while the adjective bacioccu is oficially translated
and mazzamareddu ∼ marzama(u)rieddu, mazzumau- only as sempliciotto (nitwit), the alternatives sciocco
(foolrieddu and mazzamarieddu (whirlwind, whirlpool, de- ish) (proposed by GPT-4o) and tonto (dumb) (provided
mon), munzeddu ∼ munzieddu (stack, pile), pisciteddu ∼ by Claude Sonnet 4) were considered equally valid.
Tapiscitieddu (small fish). ble 2 presents the results of this evaluation in terms of</p>
        <p>As for morphology, we can search for nouns ending (synonym-aware) accuracy.
with -ìa (&lt; Greek -ía), an abstract sufix which is one Table 2 reveals not very high accuracies even with
of the most common and attested. We can thus notice synonym tolerance. Gemini 2.5 Flash tops the list at 67%
that the Sicilian sufix is variously represented in Italian accuracy, about 6 points ahead of GPT o3 and roughly
15 points above Claude 4 Sonnet (51%) and GPT-4o
(52%). Even the best-performing model thus mistranslates
11Queries can be found in the GitHub repository: https://github.com/</p>
        <p>LiITA-LOD/LocalVarieties/tree/main/Siciliano. They can be tested
on the following endpoint: https://liita.it/sparql.
roughly one word out of three, underscoring how
lowresource dialects remain challenging for general-purpose
systems. An interesting case is that of GPT o3, which,
during the reasoning process, retrieves information from
the Web. For certain translations, it explicitly cites its
sources, including the Wikizziunariu, the vocabulary
published on the TerraLab blog,12 and the lexicon curated by
the group Salviamo il siciliano.13 This approach leads to
better accuracy than the GPT-4o model but still lower
than that of Gemini 2.5 Flash.</p>
        <p>Two noteworthy observations can be drawn from
Figure 3, which shows with a bar chart the accuracy
calculated for each part of speech. First, verbs consistently
emerge as the most challenging grammatical category
to translate across all four models. Second, GPT o3 and
Gemini 2.5 Flash exhibit relatively stable performance
across categories, whereas Claude 4 Sonnet and
GPT4o show greater variability. However, given the limited
sample size of only 80 items, the results are subject to a
high sampling error, and the observed diferences are not
statistically significant. Future work should expand the
benchmark and incorporate a broader range of dialectal
variants to enable more robust evaluation.</p>
        <p>Error analysis shows that 18 words were incorrectly
translated by all the systems. More generally, all models
exhibit a systematic tendency to infer translations on the
basis of superficial orthographic similarity between the
12https://www.terralab.it
13http://www.salviamoilsiciliano.com</p>
      </sec>
      <sec id="sec-2-2">
        <title>Sicilian lemma and a resembling Italian word, which is</title>
        <p>then selected as the output. For example, mbròcculi is
rendered as broccoli, although its actual meaning is moina
(flattery), and pisuliddu is rendered as pisellino (little pea),
whereas the intended sense is permaloso (touchy).</p>
        <sec id="sec-2-2-1">
          <title>5.2. Evaluating Bilingual Lexicon</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Induction</title>
          <p>A second experiment used the bilingual glossary to build
cross-lingual word embeddings and to evaluate the
resulting mapped vectors on the Bilingual Lexicon Induction
(BLI) task. Irvine and Callison-Burch [20] define BLI as
“the task of inducing word translations from monolingual
corpora in two languages.” Although recent work has
introduced solutions based on LLMs [21] [22], one of the
most widely adopted methods is still to align embeddings
trained separately on monolingual corpora into a shared
vector space. We therefore applied vecmap14 [23] in its
supervised mode to map Sicilian and Italian fastText
embeddings.15 The glossary was partitioned into training and
test sets using a 90:10 ratio after removing homographs
and Sicilian lemmas whose Italian equivalents were
multitoken expressions, yielding 9,698 Sicilian–Italian pairs for
training and 1,079 pairs for testing. Evaluation employed
the nearest-neighbor retrieval method (with k=10) and
resulted in an accuracy of 19.8% (coverage=50.6%). By
using the Cross-domain Similarity Local Scaling (CSLS)
retrieval, a cosine-similarity variant that attenuates the
hubness problem, namely the tendency of a small subset
of vectors to appear disproportionately often as nearest
neighbors of other points [24], the result is even lower,
i.e., 14.68%. These low scores suggest that, although more
than 9.6 K seed pairs are non-trivial for a low-resource
variety such as Sicilian, there are many out-of-vocabulary
words.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Conclusions</title>
      <sec id="sec-3-1">
        <title>This work represents a step toward the integration of the</title>
        <p>Sicilian dialect into the ecosystem of Linguistic Linked
Open Data [25]. By modeling and publishing a bilingual
Sicilian–Italian glossary extracted from Wikizziunariu,
and by aligning it with the LiITA LB through established
ontologies such as OntoLex-Lemon and LiLa, we
provide a reusable, interoperable lexical resource that
promotes the visibility and accessibility of Sicilian in digital
environments. The two preliminary NLP experiments,
evaluating LLMs’ translation capabilities and testing BLI,
highlight both the potential and the current limitations
of applying computational methods to under-resourced
varieties.
14https://github.com/artetxem/vecmap
15https://fasttext.cc/docs/en/crawl-vectors.html</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>This contribution is partly funded by the European Union</title>
        <p>- Next Generation EU, Mission 4 Component 1 CUP
J53D2301727OOO1. The PRIN 2022 PNRR project
“LiITA: Interlinking Linguistic Resources for Italian
via Linked Data” is carried out jointly by the Università
Cattolica del Sacro Cuore, Milano and the Università di
Torino.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Future work will proceed along multiple directions. First, we plan to model and integrate additional Sicilian resources, with particular attention to Antonino Traina’s</title>
        <p>Nuovo vocabolario siciliano–italiano, which is already
available in digital format. Second, we aim to broaden the
scope of the LiITA KB by incorporating resources from
other dialects. An expanded multilingual dataset will
enhance interoperability and enable richer cross-lingual
analyses. Third, we intend to link textual resources to
the LB. However, this will require reliable
lemmatization procedures, a non-trivial task for dialects with
nonstandardized orthographies and scarce annotated corpora.</p>
        <p>Finally, we plan to extend the range and depth of NLP
experiments to evaluate downstream tasks with the goal
of advancing computational support for Italy’s linguistic
diversity.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Text
translation. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>ceedings of the 62nd Annual Meeting of the Associa-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>tion for Computational Linguistics</article-title>
          (Volume
          <volume>2</volume>
          : Short
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>743</fpage>
          -
          <lpage>753</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          https://aclanthology.org/
          <year>2024</year>
          .acl-short.
          <volume>67</volume>
          /. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .acl-short.
          <volume>67</volume>
          . [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Labaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Agirre</surname>
          </string-name>
          , Learning bilingual
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>in: Proceedings of the 55th Annual Meeting of the</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>451</fpage>
          -
          <lpage>462</lpage>
          . [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          , G. Lample,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          , L. Denoyer,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>arXiv preprint arXiv:1710.04087</source>
          (
          <year>2017</year>
          ). [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chiarcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gracia</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>