1. Introduction

ba tti at GeoLingIt: Beyond Boundaries, Enhancing Geolocation Prediction and Dialect Classification on Social Media in Italy

Alkis Koudounas

Flavio Giobergia

Irene Benedetto

0 1

Simone Monaco

Luca Cagliero

Daniele Apiletti

Elena Baralis

0 0 Department of Control and Computer Engineering , Politecnico di Torino, Turin , Italy 1 MAIZE , Turin , Italy

The proliferation of social media platforms has presented researchers with valuable avenues to examine language usage within diverse sociolinguistic frameworks. Italy, renowned for its rich linguistic diversity, provides a distinctive context for exploring diatopic variation, encompassing regional languages, dialects, and variations of Standard Italian. This paper presents our contributions to the GeoLingIt shared task, focusing on predicting the locations of social media posts in Italy based on linguistic content. For Task A, we propose a novel approach, combining data augmentation and contrastive learning, that outperforms the baseline in region prediction. For Task B, we introduce a joint multi-task learning approach leveraging the synergies with Task A and incorporate a post-processing rectification module for improved geolocation accuracy, surpassing the baseline and achieving first place in the competition.

eol>natural language processing dialect localization diatopic variation deep learning

1. Introduction

to advance the current knowledge of linguistic variation in Italy by focusing on the prediction of locations of soThe advent of social media has significantly facilitated cial media posts from Twitter based solely on linguistic the investigation of language usage across diverse so- content. In this paper, we present our contributions to ciolinguistic aspects. Italy, in particular, stands out as the GeoLingIt shared task. GeoLingIt proposes two sepa compelling case study due to its remarkable diatopic arate tasks: Subtask A, a classification task that aims variation, encompassing an array of local languages, di- to identify the region of provenance of a tweet exhibitalects, and regional manifestations of Standard Italian ing non-Standard Italian language, and Subtask B, a rewithin a relatively confined geographic area [ 1 ]. This lin- gression task to identify the fine-grained location of the guistic heterogeneity stems from historical and cultural provenance of the same tweets, in terms of longitude influences, with distinct lexicons, grammatical structures, and latitude coordinates. Both tasks are based on the and pronunciations shaping the various language vari- DiatopIt dataset [ 5 ]. eties present in the country, each bearing the imprints For Task A, we first gather additional data sources of historical events, geographical isolation, and cultural specifically for the various Italian regions. We propose traditions. Furthermore, the integration of regional va- a novel approach involving a pre-training step of staterieties of Standard Italian further enriches the linguistic of-the-art transformer-based models with a contrastive mosaic of Italy [ 2 ]. Within the digital realm, particu- learning strategy leveraging data augmentation techlarly on platforms like Twitter, Italian speakers leverage niques. This approach outperforms the baseline, demonthese linguistic variations to express their social identities strating the efectiveness of leveraging pre-training and and afiliations, thereby contributing to the visibility and contrastive learning to improve the accuracy of region preservation of these diverse linguistic forms in the on- prediction. For Task B, we introduce a joint multi-task line domain. This intriguing sociolinguistic phenomenon learning approach that addresses the challenge of finehas attracted researchers from computational linguistics grained variety geolocation. Our approach outperforms and sociolinguistics domains, providing valuable insights the baseline by simultaneously tackling both tasks. Adinto the nuances of language variation in Italy. ditionally, we introduce a post-processing rectification The GeoLingIt shared task [ 3 ] at Evalita 2023 [ 4 ] aims module that refines the predicted coordinates and ensures their alignment within the boundaries of Italy. This module enhances the reliability of the predicted locations, making them more precise and geographically accurate.

EVALITA 2023: 8th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Sep 7 – 8, Parma, IT $ alkis.koudounas@polito.it (A. Koudounas)

© 2023 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)

Our proposed methods1 not only achieve state-of-theart performance in terms of location prediction (with a test error lower than 100 km compared to an error of more than 250km of the baseline methods) but also ofer some valuable insights into Italy’s diverse linguistic landscape.

2. Related work 3. Methodology This section outlines the methodology adopted to au

tomatically ascertain the region (Task A) and the coordinates, in terms of latitude and longitude (Task B), of the origin of a tweet, considering the pronounced class imbalance prevalent within the training dataset.

To tackle these challenges, we propose to use two key

techniques: data augmentation and pre-training with contrastive learning, and multi-task learning.

The analysis of linguistic varieties and dialects is an

emerging topic in the field of Natural Language Processing (NLP) [ 6 ]. 3.1. Task A

Eforts have been made to address and incorporate

variations in corpora, such as pronunciation and spelling The goal of task A is to identify the origin region of a diferences. given tweet. We denote the set of regions as . Data

Gelly et al. [ 7 ] and Elfeky et al. [ 8 ] addressed the lan- augmentation plays a crucial role in our methodology, as guage varieties for speech recognition of the English it is implemented to address the class imbalance during language, emphasizing the fact that the task is even more the training process. In the initial data collection phase, challenging as the space of dialects is broad. Many re- we obtain a substantial amount of regional dialect data searchers in the past have attempted to address the chal- from various online sources2. Dump, editions, and further lenge of language variation by leveraging social media information on the collected data are available in the data. Grieve et al. [ 9 ] compared regional patterns, both oficial repository. We then pre-process them, obtaining analyzing dialect labels and geolocation, finding strong an expanded vocabulary that is utilized for the purpose correlations between the two sources. Sadat et al. [ 10 ] of data augmentation. We denote the vocabulary for each have shown that probabilistic models of language identi- region as ( ∈ ). ifcation can be used to identify Arabic dialects on tweets. We adopt a substitution approach to words in tweets Eforts in the direction of propagating information (e.g., representing language variations to build an augmented sentiment) from high-resource languages (e.g., Italian) version of the original dataset. Each tweet = to low-resource ones (e.g., regional variations) through {1,, ..., ,} belonging to region is augmented by vector space alignments have shown promising results, randomly replacing words that are contained in with as shown by Giobergia et al. [ 11 ] across languages (from other words from the same region, with a random probaEnglish to other ones). Recently, Italian computational bility . More formally, each term , ∈ is replaced linguistics research has encountered dificulties due to with ′,, defined follows: the limited availability of large-scale datasets specifisctaulldyyta[1il2o]r.eUdnfoforrtthuenalatnelgyu, athgee,sausbesmtapnhtiaaslizceodmipnuatarteiocennatl ′, = {︃∼, iofthe≤ rwis′e (1) resources needed for pre-training language models have resulted in only a few architectures being accessible for Where ′ is the probability of replacing each term , Italian. with a diferent one ′, drawn from the same region

Moving to the geolocation task, Han et al. [ 13 ] pro- ∼ . We experimentally observe the best results in posed a method for geolocation prediction based on iden- terms of performance for ′ = 0.5. tifying location-indicative words. Nevertheless, the work Regarding the contrastive learning strategy, we prewas not focused on dialects. Eisenstein et al. [ 14 ] in- train the model to enhance its ability to discern whether spected the correlation between geographical informa- two tweets belong to the same region. During this pretion and sociolinguistic associations instead of predict- liminary training phase, the model learns to diferentiate ing the demographical attributes of users based on their between tweet pairs and their corresponding regional tweets and their position. Other works have focused on afiliations. Given two tweets, denoted as and , the geolocation task [ 15, 16 ], but they do not take into along with their labels and , the model is trained to account the language varieties. minimize a loss that facilitates this discrimination: 1The code to reproduce all our experiments is available at https: //github.com/koudounasalkis/barotti-GeoLingIt2023 exp(sim(z, z )/ ) ℒ = − log ∑︀2 =1,̸= exp sim(z, z)/

(2)

2Italian Wikipedia: it.wikipedia.org,

Dialettando: dialettando.com [accessed May-2023] where: (3) (4) ℒ = ℒ + ℒ ℒ = ℒ(ˆ, ) = − ∑︁ log(ˆ )

=1 ℒ = ℒ(ˆ, ) + ℒ(ˆ, ) tion of loss functions. For Task A, we aim to maximize the F1 score by minimizing the corresponding cross-entropy loss. For Task B, we minimize the Haversine distance by minimizing the mean squared error (MSE) loss, which helps to reduce the diference between the predicted and target coordinates. Given a tweet, denoted as , the model estimates a loss function ℒ that encompasses both tasks, minimizing the weighted conjunction of a standard cross-entropy loss for classification ℒ and an L-2 loss for regression ℒ: where z and z are the latent representations learned by the model of the tweets and . We set the temperature parameter equal to 1, and the function sim is the cosine similarity. In this approach, we randomly select a sample from the dataset to serve as an anchor. We then create a positive data point by augmenting the anchor with words from the same region and a negative sample by augmenting the anchor with words from a diferent region.

We call this approach Contrastive PT & Data++ (See section 5 for more details). Pre-training with contrastive learning and data augmentation techniques is specifically devised to mitigate the challenges posed by the imbalanced class distribution within the dataset. This has been proven beneficial in several tasks and domains [ 17, 18, 19 ].

Additionally, we empirically observed that a simple logistic regression model performs well in terms of F1 score on various minority classes. We attribute this to a lower model capacity, reducing the amount of overfitting that may occur in minority classes. To leverage this insight, we propose using an exclusive class assignment mechanism (named Entropy-based Ensemble in the following) that uses the confidence of the BERT-based model (further information about the model in Section 4). In other words, when the BERT-based prediction is made with low confidence (according to a specific empiric threshold), we replace the overall prediction with the one made by the logistic regression if the latter’s confidence is higher.

We estimate the confidence of the BERT-based model

using the entropy of its predicted probabilities. Lower entropy is associated with high certainty (i.e., the model predicts one class with high probability and all others with a low one), and vice-versa.

For the training of the logistic regression, we use as input, for each tweet, the respective bag of words as well as the vector = {| ∩ |, ∈ }, i.e., the number of words within each tweet that belong to each region in our dictionary. 301.65 281.03 111.05 99.50 98.41 281.04 263.35 120.02 98.79 97.74

5. Results

4. Experimental setting In Table 2, various methods are evaluated for task B, based on the Haversine distance metric. The Multi-Task model approach demonstrated substantial enhancements Models. We consider various models, including Italian- compared to the baselines, achieving a validation distance BERT [ 20 ] cased and uncased versions, LABSE [ 21 ] and of 111.05 km and a test distance of 120.02 km. This emBART-IT [ 22 ] models pre-trained for the Italian language. phasizes that incorporating a multi-objective function We find the best model, based on the performance ob- helps the model better tackle the given task. Moreover, tained on the validation set, to be bert-base-italian- training the model in a multi-task manner, starting from uncased. Thus this is the base model used to address the the pre-trained model that underwent contrastive learntasks. All the pre-trained checkpoints of these models ing and data augmentation and was fine-tuned for task are taken from the Hugging Face hub repository3. A (referred to as “Continuous Learning” in Table 2), reHyperparameter Setup. We ran a manual hyperpa- sulted in a significant performance boost (test distance rameter search and followed fine-tuning procedures and of 98.79 km). This improvement is likely attributed to guidelines from relevant literature. We provide detailed the model already possessing domain knowledge, leadinformation about the models used for the evaluation, ing to improved performance on the test set. Lastly, the the hyperparameter setup, and the fine-tuning procedure additional rectification “Beyond-Boundaries” module efin the oficial project repository. fectively refines the precision of the model achieving the best performance on the test set (97.74 km) and securing the first position in the GeoLingIt challenge. 5.1. Logistic regression insights Table 1 presents results for task A, evaluating diferent methods based on their validation and test F1 macro As discussed, we used logistic regression in combination scores. The proposed novel approach, which combines with the BERT-based solution to override low-condfience contrastive pre-training and data augmentation, demon- predictions. The weights learned by the logistic regresstrated superior performance compared to other methods. sion can be interpreted as the importance the model asIt achieved the highest F1 scores on the validation and signs to each feature’s presence (and magnitude). The test sets, reaching 72.61% and 53.18%, respectively. We features passed are either the number of occurrences of believe that the reason for this performance diference various words or the overall number of words contained lies in the presence of new out-of-distribution samples that are known to belong to various regions (dialects). in the test set, which our model struggles to recognize Table 3 shows, for each region, the ten features with the accurately. Interestingly, while the proposed multi-task largest weights4. model excels in task B (Multi-Task FT ), surpassing base- In some cases (underlined in the table), actual names of line models performance as shown later, it fails to deliver regions and cities are also relevant indicators of a tweet’s satisfactory results for task A. Nonetheless, it still out- origin. This is a reasonable result, as tweets made in performs the baseline. Conversely, the Entropy-based a certain dialect are intuitively likely to mention geoEnsemble method, which enhances BERT performance graphic places related to the dialect itself. We note that with LR’s one, achieves a high score on the validation the features containing the counts of the words that beset. However, it only slightly outperforms the Multi-Task long to the various language varieties are also sometimes approach on the test set, with an improvement of 0.02%. considered useful indicators by the logistic regression. In

3https://huggingface.co/dbmdz/bert-base-italian-uncased

latest access: May 2023

4Some words may have a negative, ofensive, or misogynistic

connotation. We still report these results for the sake of thoroughness. Abruzzo Basilicata Calabria

Campania

Emilia Romagna Friuli-Venezia Giulia

Lazio

Liguria Lombardia

Marche Molise Piemonte

Puglia Sardegna

Sicilia

Toscana Trentino-Alto Adige

Umbria Valle d’Aosta

Veneto fregna, diaco, st, statt, <# tokens molise>, <# tokens puglia>, asin, cussu, abruzzo, ju

fandom, mezzarella, accusci, ah, pazz, aggia, cazz, hahahaha, cazzu, trmon <# tokens calabria>, calabria, aru, nto, ccu, ciota, capu, jamu, frica, fimmina

napoli, foss, semp, merd, statt, strunzat, ua, ata, sciem, lota socmel, maial, tin, sempar, bologna, cinno, umarell, cagher, veh, soccia

femo, triestin, magnar, ocio, ga, gavemo, mona, trieste, xe, orpo avoja, annamo, avemo, mortacci, artra, nse, artro, aspettamo, ar, stamo ou, cusci, abbelinati, porcu, rasciun, emma, pestu, semmu, zena, zeneize

dighel, pirlata, pheega, nanca, danee, milano, gh, sciuri, gnaro, sciur en, sperem, ecche, roscio, ancona, marche, scritturebrevi, daje, diaulu, sblab2021 pipponi, ah, buongiornoatutti, venta, fior, fatt, vientu, paes, sort, fake picio, piou, boja, piemonte, suma, speruma, fauss, nen, piciu, babaciu salentu, capu, mang, isolitiignoti, munnu, mme, trimone, arret, trmon, bari ajo, macca, sardegna, <# tokens sardegna>, tottu, biri, tontu, nudda, sesi, itte chidda, quantu, nuddu, camurria, fici, soddi, carusi, bonu, semu, <# tokens sicilia> guasi, nsomma, <# tokens toscana>, caa, siuro, boja, diaccio, oglioni, gnamo, tope

10, bicer, maial, tasi, ghe, sberloni, stinc, sior, tai, pu pija, ch, <# tokens umbria>, er, porchetto, mejo, bbona, mixatino, je, umbria carbonada, buonissimo, int, piacione, nasconderti, max, bosc, devise, cher, vivre varda, sboro, dixe, queo, casin, venessia, ciava, <# tokens veneto>, xe, veneto most cases, the count used is the relevant one for the re- sulting in precise and geographically accurate location gion of interest (for example, the number of tokens from predictions. Our methods not only achieved state-of-thethe Venetian language varieties, <# tokens veneto> is art performance, allowing us to be placed first for Task a valuable feature to detect the “Veneto” region). The B, but also provided some model insights into the rich only exception occurs for Abruzzo, where the presence linguistic landscape of Italy. of both token counts from Molise and Puglia are consid- Future work could delve into fine-grained dialect clasered helpful indicators. Given the geographic proximity sification. This involves developing models capable of of these regions, we find this result to be reasonable. identifying specific dialects or regional varieties within

Finally, it can be observed that some situations arise a given region, which would provide a more nuanced where words that are generally not characterizing for understanding of language variation in Italy and enable certain regions still emerge as being significant ones (e.g., more targeted analyses of sociolinguistic phenomena. “hahahaha” for Basilicata, or “sblab2021” for Marche). We believe this to be an overfitting problem due to the lack of meaningful data on some of the minority regions: as Acknowledgments such, it could be addressed by collecting additional data for those regions.

This study was carried out within the FAIR - Future Artificial Intelligence Research and received funding

from the European Union Next-GenerationEU (PIANO 6. Conclusion and future work NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MIS

SIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D.

This paper presented our contributions to the GeoLingIt 1555 11/10/2022, PE00000013), the grant “National Centre shared task. We addressed Task A by designing a pre- for HPC, Big Data and Quantum Computing”, CN000013 training approach that leverages data augmentation and (approved under the M42C Call for Proposals - Investcontrastive learning, surpassing the baseline and demon- ment 1.4 - Notice “Centri Nazionali” - D.D. No. 3138, strating the efectiveness of our approach in region pre- 16.12.2021, admitted for funding by MUR Decree No. diction. For Task B, we introduced a joint multi-task 1031,17.06.2022), as a part of the MALTO (MAchine Learnlearning approach that outperformed the baseline and ing @ poliTO) team, with partial support from Smartincorporated a post-processing rectification module, re- Data@PoliTO center on Big Data and Data Science. This manuscript reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them.

[1]

Maiden ,

Parry , The dialects of Italy, Routledge, 2006 .

[2]

Ramponi , Nlp for language varieties of italy: Challenges and the path forward , arXiv preprint arXiv:2209.09757 ( 2022 ).

[3]

Ramponi , C. Casula, GeoLingIt at EVALITA 2023: Overview of the geolocation of linguistic variation in Italy task , in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[4]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprugnoli , G. Venturi, Evalita 2023 : Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[5]

Ramponi , C. Casula, DiatopIt: A corpus of social media posts for the study of diatopic language variation in Italy , in: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023 ), Association for Computational Linguistics , Dubrovnik, Croatia, 2023 , pp. 187 - 199 . URL: https://aclanthology.org/ 2023 .vardial- 1 . 19 .

[6]

Zampieri ,

Nakov ,

Scherrer , Natural language processing for similar languages, varieties, and dialects: A survey , Natural Language Engineering 26 ( 2020 ) 595 - 612 .

[7]

Gelly ,

J.-L.

Gauvain ,

Lamel ,

Laurent ,

V. B.

Le ,

Messaoudi , Language recognition for dialects and closely related languages ., in: Odyssey , volume 2016 , 2016 , pp. 124 - 131 .

[8]

M. G.

Elfeky ,

Moreno , V. Soto, Multi-dialectical languages efect on speech recognition: Too much choice can hurt , Procedia Computer Science 128 ( 2018 ) 1 - 8 .

[9]

Grieve ,

Montgomery ,

Nini ,

Murakami ,

Guo , Mapping lexical dialect variation in british english using twitter , Frontiers in Artificial Intelligence 2 ( 2019 ) 11 .

[10]

Sadat ,

Kazemi ,

Farzindar , Automatic identiifcation of arabic language varieties and dialects in social media , in: Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP) , 2014 , pp. 22 - 27 .

[11]

Giobergia ,

Cagliero ,

Garza , E. Baralis, Cross-lingual propagation of sentiment information based on bilingual vector space alignment ., in: EDBT/ICDT Workshops, 2020 , pp. 8 - 10 .

[12]

Koudounas ,

M. La

Quatra ,

Vaiani ,

Colomba ,

Attanasio ,

Pastor ,

Cagliero , E. Baralis, Italic: An italian intent classification dataset , arXiv preprint arXiv:2306.08502 ( 2023 ).

[13]

Han , P . Cook, T. Baldwin, Geolocation prediction in social media data by finding location indicative words , in: Proceedings of COLING 2012 , 2012 , pp. 1045 - 1062 .

[14]

Eisenstein ,

N. A.

Smith ,

Xing , Discovering sociolinguistic associations with structured sparsity, in: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , 2011 , pp. 1365 - 1374 .

[15]

Lee ,

Ganti ,

Srivatsa , L. Liu, When twitter meets foursquare: tweet location prediction using foursquare , in: 11th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services , 2014 .

[16]

Rahimi , T. Cohn, T. Baldwin, Twitter user geolocation using a unified text and network prediction model , arXiv preprint arXiv:1506.08259 ( 2015 ).

[17]

Wang ,

G.-J.

Qi , Contrastive learning with stronger augmentations , IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2022 ).

[18]

Zhang ,

Ji ,

Zhang ,

R. J.

Passonneau , Contrastive data and learning for natural language processing , in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts , 2022 , pp. 39 - 47 .

[19]

Vaiani ,

Koudounas ,

M. La

Quatra ,

Cagliero ,

Garza , E. Baralis, Transformer-based non-verbal emotion recognition: Exploring model portability across speakers' genders , in: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge , 2022 , pp. 89 - 94 .

[20]

Schweter , Italian bert and electra models , 2020 . URL: https://doi.org/10.5281/zenodo.4263142. doi: 10 .5281/zenodo.4263142.

[21]

Feng ,

Yang ,

Cer ,

Arivazhagan ,

Wang , Language-agnostic bert sentence embedding , in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, 2022 , pp. 878 - 891 .

[22] M. La Quatra , L. Cagliero , Bart-it: An eficient sequence-to-sequence model for italian text summarization , Future Internet 15 ( 2022 ) 15 .