-

Galliz at GeoLingIt: Enhancing BERT with Vocabulary Knowledge for Predicting the Region of Language Varieties of Italy

TizianoLabrun a

0 1 3 5

SimoneGallo

2 3 4 5

CEUR Workshop ProceedingsC(EUR-WS.org)

3 5

Natural Language Processing, Language varieties, Tweets classification

0 Fondazione Bruno Kessler , Trento , Italy 1 Free University of Bozen-Bolzano , Italy 2 HIIS Laboratory, CNR - ISTI , Pisa , Italy 3 Processing and Speech Tools for Italian , Sep 7 - 8, Parma, IT 4 University of Pisa , Italy 5 Workshop Proce dings

The linguistic diversity of the Italian peninsula and its islands, characterized by several language varieties, represents a linguistic condition and a cultural treasure unique in Europe. However, the oral nature of these varieties poses a challenge to their preservation in the written form. While significant research eforts have been dedicated to standard Italian language processing, less attention has been given to the language varieties of Italy and the development of supporting resources. This paper aims to study the peculiarities of language varieties of Italy and identify the region of origin of tweets written in non-[Standard Italian] varieties. To achieve this goal, we utilized two main techniques: fine-tuning a language model (BERT) and implementing an algorithm that utilizes dictionaries of regional varieties and word frequency. Our results show that integrating lexical analysis with BERT could be a promising approach for this particular task. We present an overview of the data, methodology, and evaluation results, then discuss the implications of our findings.

1. Introduction and Motivations for classifying tweets according to a subset of regions (“special track”, in this case, Lazio and Toscana). To tackle The Italian peninsula and its islands present considertahbilseproblem, we rely on the combination of two diferent linguistic variation among the diferent regions that ctoemch-niques: the first one is based on the fine-tuning of a pose it, as well as within the regions themselves. Tlhaenguage model (i.e., BERT), while the second is based on presence of many diferent language varieties makes thains algorithm that utilizes regional varieties dictionaries linguistic situation special and unique in Eu1r]o, paes [ and the frequency of words present in the tweets. The well as a treasure of cultural diversity, interpretatiocnla,assnidfication results obtained from both techniques are expression of the reality to which they belong. Howevneorr,malized and combined to derive the final result. In these linguistic diversities are in danger of being ltohsetf,ollowing sections, we provide an overview of the as most of them are passed on only orally, leaving dleastsa and resources used for the tasks. We then describe room for written usag1e].[Despite significant research the methodology applied, including data augmentation, eforts being devoted to processing techniques for stapnre-diction using the two diferent techniques, and global dard Italian (e.g.2,,[ 3, 4 ]), less efort has been devoted prediction. Next, we present the results obtained during to supporting language varieties, both from a tecthhneoe-valuation phase. Finally, we discuss the findings and [ 5 ]. In this paper, our main goal is to study varieties of logical point of view and in terms of curated resoudrrcaesw some conclusions.

Italy in order to develop efective methods for classifying the region of origin of Twitter posts (tweets) wri2tt.enData and Resources from Italy. We address two diferent tasks of GeoLingIt 2023 [ 6 ] from EVALITA 2023 [ 7 ]: one for classifying The dataset used for the tasks was collected by retrieving non-standard Italian tweets according to their regigoenotoafgged tweets classified aITs by Twitter. The curators origin at the country lev“setlan(dard track”), and another only kept posts that exhinboitn-standard language, along with the region information that falls within the Italy EVALITA 2023: 8th Evaluation Campaign of Natural Language

2.1. Provided Data

(S. Gallo)

The provided data consists of training and development splits for both thsteandard track and thespecial track tasks. The data is provided in a tab-separated formthaet,input to contribute to the prediction process. Its efecwith each column defining three properties: tiveness is evident from its state-of-the-art performance on multiple NLP benchmarks, including GLUE, SQuAD • “id”: an integer that uniquely identifies the tweeatn;d RACE. • “text”: a string representing the text of the tAwletehtough BERT has been in existence for five years, we in anon-standard language variety of Italy. Thibselieve that it remains highly suitable for our specific variety may be present as a single word otrask. In light of its robust performance across diverse phrase (there are many cases of code-switchingN)L,P applications, leveraging BERT finely aligns with the or the entire tweet can be written in thatrevqau-irements of our work. riety. Any sensitive information has been rDeu-ring the model selection, we also took into account placed with placeholders by the curators (eI.tga.,lian variants of the classic BERT model (e.g., “bert“@tagged_user” is replaced with “[USER]”); base-italian1)”, but the preliminary results reported, in • “region”: a string representing the tweets’ regsioomne case, worst performance compared to the “standard” of provenance (e.g., Lazio, Sicilia, Toscana). English BERT (i.e., “bert-base-uncase2d)”

The training set contains 13,669 tweets, covering all the

administrative regions of Italy. The development set 2co.3n.- Vocabularies sists of 552 tweets from 13 selected regions, namely CTaol-build the vocabularies needed for our purpose, we abria, Campania, Friuli-Venezia Giulia, Emilia Romaguntai,lized various online resources as well as the text from Lazio, Liguria, Lombardia, Piemonte, Puglia, Sardegnthae, provided tweets. Some of these vocabularies were Sicilia, Toscana, and Veneto. The training set exhibiutssead for thestandard track, while other ones for the strong imbalance, with highly represented regions slpikeceial track, as more precisely described in the rest of Lazio (5549 items) and Campania (2971 items), while rte-his Section. All vocabularies are publicly av3a.ilable gions such as Valle d’Aosta and Molise have only 14 and 35 items, respectively. The overall class distributioGnloisbal vocabulary We obtained a “global” vocabushown in Fig.1. lary, containing words from language varieties spoken in every Italian region, by performing web scraping on the dictionary available at “Dialettando4.cTohmis” resulted in a JSON file containing all the available words for each region. This vocabulary was used for tsthaendard track only.

Unique words vocabulary This vocabulary was generated starting from the provided training set and considering the occurrences of the words for every tweet from each region. Specifically, it contains all the unique words present in the tweets, along with their corresponding frequencies, grouped by region. This vocabulary was used for thestandard track, and a subset of this vocabulary with only the regions Toscana and Lazio was used for thespecial track. cessing (NLP) tasks. Unlike traditional language models32hhttttppss::////hguitghguibn.gcfoamce/s.cimo/obge-rdte-vb/aGsael-ulinzc_agseeodlingit that utilize left-to-right or right-to-left approaches, BERT 4https://www.dialettando.com/dizionario/dizionario.lrea-sso, utilizes bidirectional pre-training, allowing all tokterinevsedinon May 9th, 2023 track, and a subset of this vocabulary with only thefroer-equalizing the distribution of the samples for each gions Toscana and Lazio was used for tspheecial track. region, which was initially extremely unbalanced (cfr. § 2.1).

Toscana vocabulary We obtained this vocabulary byOur approach for implementing data augmentation was performing a web scraping of the terms present in ttohegenerate new sentences that are equal to the original website of “Vocabolario del Fiorentino Contempor5aneoon”e, but with one random word that is substituted by a and thus converting the content of the website indtioferaent word, semantically similar to the original one. JSON file. This vocabulary was used for thsepecial track We utilized established approaches from litera9t,u10r]e [ only. to implement word substitution in the text, by changing the value of the portion of the text and maintaining all Lazio vocabulary We obtained this vocabulary bythe rest unchanged. As an example, the original sentence performing a web scraping of the terms present in“tFhaeancora na sfaccim e per andare in #moto sulle mie website of “The Roman Post” webs6itaend thus convert-montagne” was transformed into “Fa ancora na sfaccim ing the content of the website into a JSON file. Theisper tornare in #moto sulle mie montagne”, since the vocabulary was used for tshpeecial track only. verbs “andare” and “tornare” are semantically similar and do not change the global meaning of the sentence.

In order to find similar words, we used a Word Embed3. Methodology ding model for Itali7an(due to the similarity between

Italian and the majority of linguistic varieties spoken In order to predict the most likely region of origininofIataly) fine-tuned on our training set, and then we given sentence, we decided to make use of both LLMsselected one among the vectors that are closer to the (fine-tuned on an augmented training set) and lexicvaelctor of the word we want to substitute, using the information of regional varieties, taken from the volcibarba-ry “Word2Vec.most_simil8a. r” ularies presented above. As shown in Fi2g,.we first consider the predictions of the two strategies individTuh-e augmented dataset ensures an equal distribution ally, and then we merge both contributions for cominogftsoentences (5549 sentences each) for every region. This one final prediction of the region enclosing the particuqluaarntity corresponds to the initial number of sentences variety for the sentence. for Lazio, the region with the highest number of entries.

In each region, except for Lazio, the number of newly generated sentences equalled the diference between the initial number of sentences in that region and the initial number of sentences in Lazio. Therefore, if a region had a lower initial sentence count, the same sentence was used more frequently for augmentation (thus the number of times a single sentence is used for augmentation is 5549 divided by the number of sentences). As an example, let’s consider Sicily which had 608 initial sentences. In this case, each sentence has been use≈d9 times to create new data, resulting in 4941 newly generated sentences.

3.1. Data Augmentation 3.2. Prediction Through Language Model The process for classifying the sentences using BERT was

the following: (i) we fine-tune BERT on our augmented training set (cfr.3§.1); (ii) for every sentence in the test set, we use the model to get a prediction on the regional variety of the sentence; (iii) a confidence score for each region is returned.

The first step was to increase the size of the training set,

both for producing better training of the classifier, and

5https://www.vocabolariofiorentino.it/ricerca/lermetmriie,ved

on May 10th, 2023 7https://github.com/MartinoMensio/it_vectors_wiki_spacy 6https://www.theromanpost.com/2016/06/dizionario-dialetto- 8https://tedboy.github.io/nlps/generated/ romanescor,etrieved on May 10th, 2023 generated/gensim.models.Word2Vec.most_similar.html Standard-Track_Run-1 Standard-Track_Run-2

Standard-Track_Run-3 Logistic_regression baseline

Most_frequent baseline

Special-Track_Run-1 Special-Track_Run-2

Special-Track_Run-3 Logistic_regression baseline

Most_frequent baseline

Precision

Recall

Macro F1

3.3. Prediction Through Vocabularies 3.4. Global Prediction

Among the vocabularies presented i2n.3§,we selected Our final prediction on the region of origin a given senthe ones relevant to the specific task (it will be discustseednce leverages both the predictions from BERT (cfr. § in § 4). The first step was to normalize the format of t3h.e2) and from the vocabularies (cfr3..3§). We introduce diferent vocabularies, associating each regional woarvdariableK which regulates the proportions of the conwith its frequency value. For the vocabularies retriterveibdutions given by each one of the 2 predictions. We from the web, we first performed web-scraping, disred-efine the sum of the 2 prediction sfor every regional garding the standard Italian translation of the termvsa,rainetd y as follows: then assigned 1 as the frequency for each one of the vocabulary entries. Since the vocabularies generated () = [] + ∗ [] (1) from the training set came with diferent frequencies,

where[] is the confidence of BERT for the regio nand we had to normalize those values, reassigning a va l[u]e is the confidence of the vocabularies algorithm for between 0 and 1, by maintaining the same proportitohne regio n. of the original frequencies. A process of normalization

The final prediction for the believed variety of the senwas also performed on the vocabulary words: accentteendce is defined by the following expression: characters were converted into unaccented equivalents, IPA representations presented in the terms were deleted, global_pred= arg max () (2) combinations of words (sometimes multiple entries were ∈ considered as one in the web vocabulary) were dividedwhere is the set of all regional varieties. into individual entries (e.g., the entry “c(o/u)mpà” becomes two diferent entries “compa” and “cumpa”).

Once we obtained all the normalized vocabularies,4w. eExperimental Setup then merged them into one single vocabulary, by summing all frequencies for the same term in the same varOieutryexperiments were divided into 2 parts: the first one (e.g. if “compà” has frequency 1 iVnocabulary_A for aims at classifying a regional variety among all the 20 Sicilian and frequency 0.9 Vinocabulary_B, it will have Italian regions and was targeted to the Standard Track frequency 1.9 in the global vocabulary, assuming thoefreGeoLingIt (cfr. §4.1), while the second one aims at are only 2 vocabularies). classifying only Toscana and Lazio varieties and was Finally, to predict the regional variety of a sentenctea,wrgeeted to the Special Track of GeoLingIt (4c.f2r). § sum the frequencies of each word in the sentence for each regional variety, using the frequencies presen4t.1in. Standard Track the global vocabulary. The region with the highest scoring is the predicted regional variety of the sentenceF.or classifying one variety among the 20 Italian regions, we used the 3 vocabularies presented i2n.3§and the BERT classifier, as described in §3. During the finetuning process of BERT, we experimented with diferent numbers of training epochs, learning rates, and values of K (cfr. §3.4), and used the 3 configurations that gave

For the standard track group, we can observe that our

• Run-1: 2 epochs, learning ra6te∗ −5, K=1; best run obtained a value for the f1-score 8 times higher • Run-2: 2 epochs, learning ra2te∗ −5, K=1; than theMost_frequent baseline and 10 points higher • Run-3: 2 epochs, learning ra6te∗ −5, K=0.5. than theLogistic_regression baseline. Comparing the Although we tried with a lower and greater num3breurns, besides the diferences in the hyper-parameters of epochs, we observed that 2 was the best value for tcheoice (a learning rate6o∗f −5 appears to give better relatively small training set that we used for fine-tunriensug.lts), it is interesting to note that the use of the vocabThe same approach was used to set the Adam optimiseru’lsary has a positive efect on the performances, since we learning rate. We started with a learning 2ra∗ te−5oafs observe an improvement in all evaluation metrics when suggested by TensorFlow documentation, and graduawlelypass from = 0.5 to = 1 (from Run-3 to Run-1, decrease or increase it. which are identical for all other parameters). Finally, we noted that relying too much on the vocFaobru-the special track group, we improvedMtohset_frelaries rather than the LLM (usinKggareater than 1), didquent andLogistic_regression baselines of 38 and 11 not bring high results, thus we focused on valuKes opfoints respectively, for what concerns the f1-score. Here, between 0 and 1. in contrast, the use of the vocabulary seems to be less influential than the learning rate, which results to work 4.2. Special Track better with a value6o∗f −5.

This diference between the two tasks can be partially exWe followed the same process also for the special trpalcakin,ed by the relative lexical similarity between Toscana using BERT fine-tuned on a corpus of only Toscana andand Lazio varieties: in both regions, the regional spoken Lazio samples (filtered out from the augmented datasleatn,guage does not difer too much from standard Italian described in §3.1), and the 4 vocabularies presented iannd therefore the strategy of distinguishing a sentence on § 2.3, 2 generated from the original training-set, onethfoerbasis of the vocabulary does not seem to be the right the Toscana lexicon, and one for the Lazio lexicon. aWpeproach. On the other side, when it comes to classifying tried diferent configurations and used the best 3 as thae sentence between all the 20 Italian regions varieties, runs for the special track: the lexical terms difer significantly and therefore our prediction based on the vocabularies proves to be a great improvement to the classification made through LLMs only. • Run-1: 2 epochs, learning ra4te∗ −5, K=0.5; • Run-2: 2 epochs, learning ra4te∗ −5, K=0.1; • Run-3: 2 epochs, learning ra2te∗ −5, K=0.1. better results on the validation set as the 3 runs o6f.thDeiscussion task:

Again, we found that 2 epochs were the optimal value

on the validation set and that the best learning ra7te. vaCl-onclusion ues were around the suggested value from the literature.

In this case, we observed that the contribution of theInvot-his paper, we addressed the classificationnoonfcabularies did not bring many advantages, and therefsotarnedard Italian Twitter posts according to their regional we kept a low value oKf. variety, by combining the prediction obtained using a language model (BERT) with an algorithm utilizing regional varieties dictionaries and word frequency. We 5. Results contribute to two tasks: a classification at the country level, and a classification according to a subset of regions Table1 shows the results for our 3 runs for the stand(aLradzio and Toscana). track (classification on the 20 Italian regions) and 3 runs for the special track (classification on the Toscana aAnftedr briefly introducing the Italian regional variLazio varieties), as described in sect4io.nIn addition, eties, the tasks addressed, and the methodology applied, we reported also the results for the 2 baselines provaiddeedscription of the data provided and other additional by the organizers of the tasks, one based on a logisticrerseo-urces retrieved (e.g., online vocabularies) follows. gression model and the other one which simply predicWtes then explain in detail the methodology used, starting the most frequent label in the training set for everfryoimn-the augmentation of the training set data, and going ference, as described by Ramponi and Casu1l1a].[ These through the diferent techniques used for obtaining the baselines are reported in the table once per groupinantedrmediate and final predictions for both tasks. Finally, are relative to the task of the correspondent groupt.he evaluation results are shown: the first task achieved According to the task’s indications, we employ macarmo-acro F1 score of 0.56, outperforming both the logistic averaged precision, recall and f1-score as evaluation rmeegtr-ession baseline (0.46) and the most frequent baseline rics. (0.07); while in the second task, even if the macro F1 ing and speech tools for italian, in: Proceedings score and the recall show substantial improvement, theof the Eighth Evaluation Campaign of Natural Lanoverall precision is 11 points lower with respect to theguage Processing and Speech Tools for Italian. Final logistic regression baseline (0.81 vs 0.92). Workshop (EVALITA 2023), CEUR.org, Parma, Italy, Overall, the knowledge captured from regional dictio-2023. naries and word frequencies seems to be efective in[8] J. Devlin, M.-W. Chang, K. Lee, Google, kt, lancapturing the nuances and characteristics of regionalguage, ai: Bert: pre-training of deep bidirectional varieties. Furthermore, by leveraging BERT’s bidirec- transformers for language understanding, in: Protional pre-training, the system can consider the entireceedings of NAACL-HLT, 2019, pp. 4171–4186. context of a sentence, thereby contributing to accu[r9a]teW. Wang, Z. Zhang, J. Guo, Y. Dai, B. Chen, W. Luo, predictions. Task-oriented dialogue system as natural language The availability of curated resources, such as regional generation, in: Proceedings of the 45th Internadictionaries, played an important role in enhancing thetional ACM SIGIR Conference on Research and system’s performance. However, we acknowledge the Development in Information Retrieval, 2022, pp. limitations of the vocabularies used in our experiments, 2698–2703. and further eforts should be made to expand and refine[10] T. Labruna, B. Magnini, Fine-tuning bert for generthese resources. Finally, we think there are still areas forative dialogue domain adaptation, in: Text, Speech, improvement. Future research could explore more so- and Dialogue: 25th International Conference, TSD phisticated and tailored methods for data augmentation2022, Brno, Czech Republic, September 6–9, 2022, and investigate alternative techniques for integratingProceedings, Springer, 2022, pp. 513–524. vocabulary lexical analysis with BERT. Moreover, [t1h1e] A. Ramponi, C. Casula, Diatopit: A corpus of soinclusion of additional linguistic features and the cial media posts for the study of diatopic language exploration of ensemble methods could potentially leadvariation in italy, in: Tenth Workshop on NLP for to further performance improvements. Similar Languages, Varieties and Dialects (VarDial 2023), 2023, pp. 187–199.

[1]

Moseley , Atlas of the World's Languages in Danger, Memory of peoples Series , UNESCO Publishing, 2010 . URL: https://books.google.it/books?id= kFVthqmDs_kC.

[2]

Bosco ,

Dell'Orletta ,

Montemagni ,

Sanguinetti ,

Simi , The evalita 2014 dependency parsing task, The Evalita 2014 Dependency Parsing Task ( 2014 ) 1 - 8 .

[3]

Dell 'Orletta, Ensemble system for part-of-speech tagging , Proceedings of EVALITA 9 ( 2009 ) 1 - 8 .

[4]

Polignano ,

Basile , M. De Gemmis , G.

Semeraro , V.

Basile , et al., Alberto: Italian bert language understanding model for nlp challenging tasks based on tweets , in: CEUR Workshop Proceedings , volume 2481 , CEUR , 2019 , pp. 1 - 6 .

[5]

Ramponi , Nlp for language varieties of italy: Challenges and the path forward , arXiv preprint arXiv:2209.09757 ( 2022 ).

[6]

Ramponi , C. Casula, GeoLingIt at EVALITA 2023: Overview of the geolocation of linguistic variation in Italy task , in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[7]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprugnoli , G. Venturi, Evalita 2023 : Overview of the 8th evaluation campaign of natural language process-