1. Introduction

A BERT-based Approach for Part-of-Speech Tagging in the Low-Resource Context of Sardinian

Salvatore Mario Carta

1 2

Filippo Concas

Gianni Fenu

Alessandro Giuliani

Marco Manolo Manca

Mirko Marras

Piergiorgio Mura

Simone Pisano

0 0 Department of Humanities, University for Foreigners of Siena , Piazza Carlo Rosselli 27/28, 53100 Siena - Italy 1 Department of Mathematics and Computer Science, University of Cagliari , Via Ospedale 72, 09124 Cagliari - Italy 2 VisioScientiae S.r.l. , Via Francesco Ciusa 46, 09131 Cagliari - Italy

2025

Natural language processing (NLP) has made significant improvements in recent years, primarily driven by the latest advancements in deep learning technologies and the increasing availability of large-scale linguistic resources. Nevertheless, such advancements have mostly benefited high-resource languages, leaving many minority and underrepresented languages at the margins of computational linguistics research. Sardinian, the native language of the island of Sardinia, exemplifies this disparity. Indeed, despite its cultural and linguistic value, there is a lack of proper resources, annotated corpora, and NLP tools. This work proposes a Part-of-Speech tagging system for Sardinian characterized by methods consistent with its morphological specificity. The system integrates a BERT-based token classifier capable of assigning a grammatical category to each input word in a sentence. The classifier was trained on a balanced, manually-annotated corpus, and its performance was evaluated using standard machine-learning-oriented performance metrics (Accuracy, F1-score, Recall, and Precision). Experiments show that pre-trained architectures such as BERT remain efective even for languages with limited data availability.

eol>Low-resource languages Part-of-speech tagging Language models

1. Introduction

digital domain and thus inadequately, or even entirely, unknown to most models. Indeed, in this scenario, tools Recent scientific advances in language models (LMs) and that support linguistic analysis, such as PoS taggers, renatural language processing (NLP) have contributed to main scarce or nonexistent, limiting the ability of linthe development of sophisticated technologies for gen- guists to study the features of such tools at scale. More erating, analyzing, and interpreting the world’s major specifically, PoS tagging aims to assign a grammatical languages. In such a context, large language models label to every word in a sentence to facilitate the study (LLMs), such as GPT-4 [1], Llama-3 [2], and Phi-4 [3], of its grammatical structure. This task is crucial for anahave shown strong proficiency across a wide range of lyzing the multifaceted nature of a given language. language-related tasks [4], including tasks such as sen- Sardinian, a Romance language spoken primarily on timent analysis [5, 6], text classification [ 7, 8], text sum- the island of Sardinia (Italy), stands out as a notable case marization, and part-of-speech (PoS) tagging [9]. study of low-resource language. Indeed, its rich morpho

However, despite their increasing efectiveness, LLMs logical structure and its classification as an endangered still present limitations in performing several NLP tasks language have attracted increasing attention in linguistic [10]. In particular, they struggle when the task concerns preservation and digital humanities [11]. In this direcminority and/or low-resource languages, which often tion, the present work describes the creation and the exhibit distinctive linguistic features that make them a evaluation of an automatic Sardinian PoS tagging model. subject of special interest for linguists. However, linguists The methodology relies on fine-tuning a BERT-based lanrarely have access to automated tools and resources that guage model [12] using a corpus manually annotated facilitate in-depth studies, as these minority and/or low- by linguists specializing in Sardinian. The experimenresource languages are often underrepresented in the tal phase includes the analysis of the hyperparameters and the monitoring of machine-learning-oriented perCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- formance metrics. The proposed approach provides a *tiCcso,rSreepstpeomnbdeirng24a—uth2o6,r.2025, Cagliari, Italy foundational methodology that can be adapted to develop $ salvatorem.carta@unica.it (S. M. Carta); similar tools for other low-resource languages. iflippo.concas2@unica.it (F. Concas); fenu@unica.it (G. Fenu); The remainder of this paper is structured as follows: alessandro.giuliani@unica.it (A. Giuliani); marcom.manca@unica.it Section 2 describes the state of the art; Section 3 provides (M. M. Manca); mirko.marras@unica.it (M. Marras); a mathematical formulation of the problem and a descripspiimerognioer.pgiisoa.mnou@rau@nuisntrisatsria.isti.(iSt.(PP.isManuor)a); tion of the proposed approach; Section 4 illustrates the © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License results; and finally, Section 5 concludes the work.

Attribution 4.0 International (CC BY 4.0).

2. Related Work 3. Methodology

This section provides an overview of the state of the art This section describes the methodology followed to build in PoS tagging for low-resource languages, followed by and evaluate the PoS tagger for the Sardinian language. a description of the work carried out for the Sardinian The section is organized as follows: first, the problem is language in the context of NLP. The PoS tagger is an NLP formulated mathematically; subsequently, an overview tool that assigns a grammatical label to each word in a of the entire methodology is provided; then, an analysis sentence, thus enabling the identification of the function of the data used to build the PoS tagger is conducted; of each word in that sentence. This tool facilitates syntac- finally, the fine-tuning technique employed is presented. tic analysis and provides fundamental support for developing any low-resource language, including Sardinian, 3.1. Problem Formulation by automating linguistic analysis in contexts where structured linguistic resources are lacking. Mathematically, let s ∈ be a sentence belonging to a set

In recent years, numerous approaches have been ex- of sentences; then s can be identified as a vector whose tensively investigated, with the aim of developing auto- entries represent the words included in the sentence s = matic tagging systems or augmenting training corpora [1, . . . , ], with ∈ N+. Therefore, a PoS tagger to enable high-accuracy, high-eficiency grammatical an- can be defined as a function expressed as: notation at the sentence level. In the context of lowresource languages, where typically scarce data is publicly available, data from more widely known languages : →− similar to the target language is usually employed; one ap- s →↦− (s) = t = [1, . . . , ] proach following this direction involves the use of Hidden Markov Models (HMMs), in which the PoS tagging task where ∈ identifies the tag, i.e., a grammatical is modelled as a sequence-to-sequence problem [13, 14]. label, of the -th word and is chosen from a specific tagset HMMs are first trained on a language with large amounts , and is the set of vectors whose entries contain the of annotated data, followed by a model that transfers the tag of each word in a sentence. learned information to the target language of interest. In this work, from an application point of view, the Diferent approaches that fill the gap in labeled data are problem of estimating the function defined above is based on adopting unsupervised learning techniques to interpreted as a classification problem, and therefore, it group words within sentences, annotate them, and then is solved by training a specific classifier. Given a dataset assign a label [15, 16]. Moreover, the problem of PoS tag- D = {s, t|s ∈ , t ∈ } that includes sentences and ging is sometimes interpreted as a classification problem. their respective tags, the objective is to optimize the paFor example, several works proposed to first train fully- rameters of a classifier so that it accurately assigns the connected neural networks (FNNs) and long short-term correct grammatical tag to each word in a sentence. memory (LSTM) models on annotations projected into English and, subsequently, adapt them to the tags of the 3.2. Methodology Overview target low-resource language [ 17, 18, 19 ].

The aforementioned works build upon resources from Figure 1 illustrates the workflow followed to develop the other languages to create the PoS taggers; alternative Sardinian PoS tagger proposed in this study. methods focus on optimizing the limited availability of data for the target language to achieve equally good results. An example is provided by a model that utilizes translations of parts of the Bible to train PoS taggers by aggregating tags from multiple annotated languages and spreading them through word alignment within the text [ 20 ]. Furthermore, diferent deep learning models have been evaluated to build a PoS tagger for the Albanian language [ 21 ], which is a low-resource language as well.

To the best of our knowledge, no prior studies describe a PoS tagger for the Sardinian language. Recent work has introduced a linguistic resource designed to identify semantic relationships between Sardinian words through manual mapping of existing WordNet entries to Sardinian word meanings [ 22 ]. However, this resource does not Figure 1: Workflow of the Sardinian PoS tagger. include any tools for automatic linguistic annotation.

The process consists of three main steps. In the first phase (pre-processing), the available tagged data is transformed and formatted adequately for use in the subsequent steps. Once transformed, the data is split into two parts: one part is used for training the model, and the other for evaluating it. In the second step (fine-tuning ), the model learns to accurately assign grammatical tags to each word in a sentence based on the training data.

Finally, in the third step (testing), the fine-tuned model automatically annotates the test data, and standard machine learning metrics are computed to evaluate how well it has learned to assign tags to each word.

Let us note that, as mentioned in the previous section, tags must be chosen from a specific set . In this work, two diferent state-of-the-art tag sets will be considered, i.e., the Universal Tags [ 23 ] (denoted as tag), and the tagset, conceived for the Italian language, adopted in the work of Palmero Aprosio & Moretti [ 24 ] (denoted as fineTag). The latter tagset is compliant with the EAGLES standards [ 25 ] and also more fine-grained than the former. Consequently, the pipeline depicted in Figure 1 is executed for each tagset separately.

3.3. Data Pre-Processing In the context of minority languages, particularly the

Sardinian language, it is challenging to find or utilize data that enables the training of specific models. In our scenario, to the best of our knowledge, the only available dataset for the Sardinian language that allows us to address a PoS tagging task is proposed by Mura et al. [ 26 ].

The dataset consists of 1, 472 sentences in which each word is annotated with both tag sets described in the previous section. The sentences were extracted from transcripts of interviews conducted with 21 native Sardinian emigrants, each speaking a diferent variety of Sardinian, as part of the Mannigos project [ 27 ].

Figure 2 illustrates the distribution of the number of words per sentence in the dataset. It is worth pointing out that the term word in this context refers to any part of the sentence, including punctuation. It can be observed that most sentences contain a limited number of words, with a significant portion not exceeding 100 words. Another key aspect is the distribution of tags within the dataset.

Ensuring a balanced representation of grammatical categories allows the model to efectively learn each tag from the two defined tag sets. Figure 3 illustrates this distribution and highlights the overall balance level. Even though the dataset appears to be heavily imbalanced due to the natural linguistic structures that are common in any language, it is noteworthy that all tag labels in the considered tag sets are represented in the dataset.

The development of the PoS tagger in this work is based on fine-tuning the BERT language model. This choice requires a careful data pre-processing phase,

tag [ 23 ] 1,172 293 Even though the BERT model is multilingual, it does fineTag [ 24 ] 1,177 295 not recognize minority languages like Sardinian. However, the pre-trained BERT model has learned the morphosyntactic behaviors of languages similar to Sardinian, where the text is appropriately tokenized (i.e., divided such as Italian or Spanish. Consequently, a fine-tuning into smaller units called tokens, which may consist of phase in which the BERT model identifies the primary words, sub-words, or characters). For consistency, this characteristics of the Sardinian language can lead to a process was performed using the BERT tokenizer, which high-performance PoS tagger for the Sardinian language. employs the WordPiece technique. This latter breaks un- Given that the PoS tagging problem can be interpreted known words into more common sub-word units, ensur- as a classification problem, the tuning phase of a token ing that each token aligns with an entry in the BERT classification model can be interpreted as a supervised vocabulary. Finally, each token is transformed into a nu- training phase, in which the model sees which tags are asmerical identifier that BERT can process. To streamline signed to each part of speech. In this phase, it is therefore processing, each sentence was standardized to a length essential to choose the appropriate loss function to miniof 512 tokens by appending padding tokens as needed. mize during the tuning phase and the hyperparameters

Following tokenization, the dataset was divided into to be input to the trainer to allow optimal learning. As for two train sets and two test sets, one for each tagset, se- the former, the Cross-Entropy Loss function was chosen, lecting 80% of the sentences for the first set and 20% for which, with the padding approach, takes the form: the second. These pre-processing steps, along with the removal of sentences containing missing or incorrect 1 tags, led to the data splits described in Table 1. (1) ℒ = − ∑︀ ∑︁ · log(, ) =1 =1

3.4. Model Fine-Tuning The next step is to choose the appropriate model for the

ifne-tuning phase. As a result of extensive, preliminary empirical evaluations, the pre-trained BERT model in its large-cased version was selected [12]. In more detail, BERT is a deep learning model based on the Transformer architecture developed by Google. Its special feature is its ability to process context bidirectionally, i.e., by simultaneously considering both the context to the left and the right of a word, significantly improving performance in the context of this work. It should be noted that, in this study, BERT was implemented for token classification, and the same architecture is used for both tagsets. For token classification, BERT follows this structure: • Input Embedding: Each token is transformed into a vector representation that combines token embeddings, i.e., the token representation, segment embeddings, i.e., the sentence the token belongs to, and positional embeddings, the position of the token in the sentence. • Transformer Layers: The network comprises 24 layers of this type, each using multi-head attention mechanisms to model the relationships between tokens. • Output Layer: BERT returns a probability distribution over all possible classes for each token. The final output is a sequence of logits, with one prediction for each token. where: • is the total number of tokens; • ∈ {0, 1} is the mask that is 1 if the token is valid (not padding), 0 otherwise; • ∈ is the true class of the token ; • , is the probability the model predicts for the correct class .

Note that the same loss function was used for models

trained on both the tag and fineTag sets.

Figure 4 shows the evolution of training loss, validation loss, and validation F1 score over epochs for both models during the tuning phase. These graphs were instrumental in determining the optimal number of fine-tuning epochs and in choosing other hyperparameters. Although all three metrics were considered, particular attention was paid to the validation F1 score, as it most directly reflects the model’s ability to generalize on the classification task. in which , , and are the same as defined in Formula 1; while ˆ is the tag predicted by the model, is the size of the set (i.e. the number of all possible tags), and 1() is the indicator function, equal to 1 if condition is true, 0 otherwise.

It is important to note that all metrics introduced vary within a range between 0 and 1, with values closer to 1 indicating better performance.

4. Experimental Results 3.5. Model Testing This section is organized into two main parts. In the first

Several metrics were used to evaluate model performance. part, we present the quantitative analysis of the models, Given the classification nature, the four performance reporting and comparing their performance on the test metrics used in this study are Accuracy, Recall, Precision, sets. These results allow us to evaluate the overall efecand F1 score. Note that the last three metrics mentioned tiveness of each model in a rigorous and reproducible were calculated in their macro version, considering the manner. The second part is dedicated to a brief qualipresence of more than two classes to be evaluated. tative analysis, in which we examine selected examples

These metrics allow us to assess how accurately the unobserved during the fine-tuning and testing phases. PoS tagging models classify the various words in the This analysis aims to illustrate the models’ predictions in sentence. In particular, they allow us to analyze both the practice, thus complementing the information obtained model’s ability to identify all relevant classes (Recall) and from the quantitative evaluation. its accuracy in avoiding false assignments (Precision), providing an overall measure of the balance between 4.1. Quantitative Analysis these two properties (F1 score). The following formulas define the metrics in detail.

Accuracy = Precisionmacro =

Recallmacro =

=1 · 1( = ˆ) ∑︀ ∑︀

=1 − 1 1 ∑︁

=0 + − 1 1 ∑︁

=0 + F1macro = =0

− 1 1 ∑︁ 2 · Precision · Recall

Precision + Recall

Table 3 shows the performance of the two fine-tuned BERT-based models on the test sets1. The first model, finetuned on the coarser-granularity tagset (tag), achieves an accuracy of 0.9418 and a macro F1 score of 0.9298, with recall and precision scores of 0.9347 and 0.9250, respectively. The second model, fine-tuned on the more detailed tagset (fineTag), produces slightly lower but still good results, with an accuracy of 0.9362, a macro F1 of 0.9291, 1While per-tag evaluation metrics could in principle ofer additional insights, given also the large size of the tagset, we chose to focus on overall metrics to maintain a clear and coherent narrative aligned with the primary research questions. We consider a detailed pertag analysis an important direction for future work, particularly in application-specific settings where tag-level behavior is critical. a recall of 0.9308, and a precision of 0.9274. These results indicate that both models generalize to the test data well. It is important to note that high performance is still achieved even in the fineTag setting, which involves a classification task with 36 PoS classes (the tag set included 15 PoS classes). This observation highlights the robustness of the fine-tuned models, demonstrating their ability to handle more complex and fine-grained label distributions without substantial performance loss. Notably, these results are achieved despite the linguistic variability within the dataset, which includes multiple Sardinian language varieties with difering morphological features.

Nevertheless, the models successfully capture the core structural patterns of each variety, demonstrating strong generalization across intra-language variation.

4.2. Qualitative Analysis

could be the creation of an accessible user interface that would make the PoS tagger usable by linguists, scholars, and citizens not experts in computer science. Such a tool could be integrated into digital platforms for teaching, documentation, and linguistic research on Sardinian, contributing to greater digitization and visibility of the language.

Acknowledgments We acknowledge financial support under the National

Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.5 - Call for tender No.3277 published on December 30, 2021 by the Italian Ministry of University and Research (MUR) funded by the European Union – NextGenerationEU. Project Code ECS0000038 – Project Title eINS Ecosystem of Innovation for Next Generation Sardinia – CUP F53C22000430001- Grant Assignment Decree No. 1056 adopted on June 23, 2022 by the Italian Ministry of University and Research (MUR).

Declaration on Generative AI

[17]

Duong ,

Cohn ,

Verspoor ,

Bird , P. Cook, dda, G. Fenu,

Frigau ,

Giuliani ,

Grassi , M. M. What can we get from 1000 tokens? a case study Manca , et al., Limba: An open-source frameof multilingual pos tagging for resource-poor lan- work for the preservation and valorization of lowguages , in: Proceedings of the 2014 Conference on resource languages using generative models, arXiv Empirical Methods in Natural Language Processing preprint arXiv:2411.13453 ( 2024 ). (EMNLP) , 2014 , pp. 886 - 897 . [29]

A. S.

Podda ,

Balia , M. M. Manca , J. Martellucci,

[18]

Fang ,

Cohn , Learning when to trust distant su- L. Pompianu, A deep learning strategy for the pervision: An application to low-resource pos tag- 3d segmentation of colorectal tumors from ultraging using cross-lingual projection, arXiv preprint sound imaging, Image and

Vision

Computing ( 2025 ) arXiv: 1607 .01133 ( 2016 ). 105668 .

[19]

Fang , T. Cohn, Model transfer for tagging low- [30]

Saia ,

Carta ,

Fenu , L. Pompianu, Influencresource languages using a bilingual dictionary, ing brain waves by evoked potentials as biometric arXiv preprint arXiv: 1705 .00424 ( 2017 ). approach: taking stock of the last six years of re-

[20] Ž. Agić , D.

Hovy , A.

Søgaard , If all you have is search, Neural Computing and Applications 35 a bit of the Bible: Learning POS taggers for truly ( 2023 ) 11625 - 11651 . low-resource languages , in: C. Zong , M. Strube [31] A.

Giuliani , R.

Savona , S.

Carta , G.

Addari , A. S. (Eds.), Proceedings of the 53rd Annual Meeting of Podda , Corporate risk stratification through an the Association for Computational Linguistics and interpretable autoencoder-based model , Computers the 7th International Joint Conference on Natural & Operations Research 174 ( 2025 ) 106884 . Language Processing (Volume 2 : Short

Papers)

, As- [32]

Nallakaruppan ,

Balusamy ,

M. L.

Shri , sociation for Computational Linguistics, Beijing,

Malathi ,

Bhattacharyya , An explainable ai China , 2015 , pp. 268 - 272 . framework for credit evaluation and analysis , Ap-

[21]

Fetahi ,

Hamiti ,

Susuri ,

Selimi , D. I. Saiti , plied Soft Computing 153 ( 2024 ) 111307 . Neural network and transformer-based pos tag - [33]

Carta ,

A. S.

Podda ,

D. Reforgiato

Recupero , M. M. ger for low resource languages, in: 2024 Inter- Stanciu, Explainable ai for financial forecasting , national Conference on Information Technologies in: International Conference on Machine Learning , (InfoTech) , 2024 , pp. 1 - 4 . Optimization, and Data Science, Springer, 2021 , pp.

[22]

Angioni ,

Tuveri ,

Virdis ,

L. L.

Lai , M. E. 51 - 69 . Maltesi, Sardanet: A linguistic resource for sar- [34]

Pisu ,

Elia ,

Pompianu ,

Barchi , A. Acquadinian language , in: Proceedings of the 9th Global viva, S. Carta, Enhancing workplace safety: A flexWordnet Conference , 2018 , pp. 412 - 419 . ible approach for personal protective equipment

[23] Universal

pos tags

, 2014 - 2024 . URL: https:// monitoring, Expert Systems with Applications 238 universaldependencies.org/u/pos/. ( 2024 ) 122285 .

[24]

Palmero Aprosio , G. Moretti, Tint 2 .0: an all- [35]

Armano ,

Giuliani , A two-tiered 2d visual tool inclusive suite for nlp in italian, in: Proceedings for assessing classifier performance , Information of the Fifth Italian Conference on Computational Sciences 463-464 ( 2018 ) 323 - 343 . Linguistics (CLiC-it 2018 ), 2018 . [36]

A. S.

Podda ,

Balia ,

Pompianu ,

Carta , G. Fenu,

[25] Eagles part -of-speech (pos) tag set, 2014 - 2024 . URL: R. Saia, Cargram: Cnn-based accident recognition https://www .ilc.cnr.it/EAGLES96/home.html. from road sounds through intensity-projected spec-

[26]

Mura ,

Pisano ,

Carta ,

Giuliani , M. Manca, trogram analysis, Digital Signal Processing 147 The corpus of Sardinian emigrants:a tool for a quan- ( 2024 ) 104431. titative approach to contact phenomena , MiLES : [37]

Mana ,

Allouhi ,

Hamrani ,

Rehman , I. El JaMinority Languages in European Societies - Inter- maoui, K. Jayachandran, Sustainable ai-based pronational Conference-Turin / Bard - BOOK OF AB- duction agriculture: Exploring ai applications and STRACTS, July 3-6 , 2024 . implications in agricultural practices, Smart Agri-

[27]

Pisano ,

Piunno ,

Ganfi , Appunti per un cultural Technology 7 ( 2024 ) 100416. corpus di sardo multimediale , in: M. V. D. Marzo, [38]

Rong ,

Xu ,

Liu , H. Liu,

Ding ,

Liu ,

Luo , S. Pisano (Ed.), Per una pianificazione del plurilin- C. Zhang ,

Gao , Du-bus: a realtime bus waiting guismo in Sardegna , Condaghes, 2022 , pp. 147 - 164 . time estimation system based on multi-source data,

[28]

S. M.

Carta ,

Chessa ,

Contu ,

Corriga , A . Dei- IEEE Transactions on Intelligent Transportation Systems 23 ( 2022 ) 24524 - 24539 .