1. Introduction

Salogni at GeoLingIt: Geolocalization by Fine-tuning BERT

Ilaria Salogni

Università di Pisa

Italy

The recent growing interest in low-resource languages has been significantly bolstered by transformer-based models. By ifne-tuning three such models, two based on BERT and the other on RoBERTa, I aim at geolocating sequences exhibiting non-standard language varieties relying solely on linguistic content. I find that, given that the information contained in the embeddings is all we need to carry out this complex task, a model architecture with less task-specific layers leads to better results. Furthermore, models pre-trained on miscellaneous corpora generalize better than those trained exclusively on tweets. The work also shows that the greater availability of resources of a certain regional variety positively afects the capacity of the model.

1. Introduction

1.1. Task

The goal of this project is to predict its location in terms

of longitude and latitude coordinates (fine-grain geolocation) of tweets exhibiting non-standard language, based solely on linguistic content. This is a (double) regression task. In contrast to previous geolocation shared tasks on other areas ([ 6 ]; [ 7 ]; [ 8 ]), GeoLingIt is focused on Italy.

Recognizing varieties and forming an opinion about where the speaker comes from is something so ingrained in our experience as speakers that it seems innate, and even a little magical. The question that drives this work is: can Large Language Models (LLMs) do what we do and if so, how well can they do it? do they do that in a way that is operationally similar to ours? The Italian scenario is a good testing ground as despite his limited 1.2. Dataset geographical extent, it is one of the most linguisticallydiverse in Europe. In their work, Ramponi and Casula say GeoLingIT task data comprises 15K geotagged tweets that current transformer-based models are rather limited that exhibit non-standard Italian language use (the confor modeling language variation over space in highly mul- tent may be fully written in local language varieties or tilingual areas such as Italy [ 1 ]. I don’t agree completely, exhibiting code-switching with standard italian), and that not only because of the encouraging results of the ap- have been collected in the corpus DiatopIt [ 1 ]. The data plication of LLMs in a always growing number of tasks, is annotated with latitude and longitude. After removing but also because what we can explain on how they work the emojis and tags, all the labeled data provided by the does not highlight anything which may prevent good organizers were merged and then split into train-evalperformance. Furthermore, the work of Lutsai and Lam- test sets. Several crossvalidations were performed with pert [ 2 ] reaches the astonishing result of a median error 3-folds or 2-folds split, using train-eval sets. Target and of 30km worldwide level, and fewer than 15 km on the output coordinate data were normalized using Min-Max US-level datasets for the models trained and evaluated scaling, as this understandably improved the quality of on text features of tweets’ content and meta data context, model prediction. using a BERT model [ 3 ]. The fact that Twitter language identifier classifies with the label designed for standard 2. System description Italian language also contents both partially and fully written in language varieties of Italy, as observed again Knowing that representations learned by transformerby Ramponi and Casula [ 1 ] may suggest that the LLMs based models achieve strong performance across many already have in their pre-training dataset the knowledge tasks with various datasets ([ 9 ], inter alia), I first decided that they need to carry out a geolocalization task. to perform the fine-tuning of three diferent monolin

This document describes the model I submitted to the gual BERT-based [ 3 ] or RoBERTa-based [ 10 ] models, preEVALITA 2023 evaluation campaign [ 4 ] for the task Ge- trained on Italian texts. After picking the best performing oLingIT [ 5 ]. model, I cross-validated it on a diverse set of hyperparameter configurations (e.g., number and size of hidden layers, activation functions) to pick the best task-specific architecture. All the runs were performed on Colab using high-RAM Nvidia A100 GPUs.

EVALITA 2023: 8th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Sep 7 – 8, Parma, IT $ i.salogni@studenti.unipi.it (I. Salogni)

© 2023 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) 2.1. Comparing diferent models

To assess how diferent pre-trained monolingual models

perform on the given dataset, I fine-tuned the RoBERTabased umberto-commoncrawl-cased-v1 [ 11 ] model, and 3. Results the BERT-based models bert-base-italian-cased [ 12 ] and AlBERTo-it [ 13 ], adding a single linear layer with two out- The umberto-commoncrawl-cased-v1 model on the "minput neurons to the pooling layer of each model, without imal" task specific architecture yielded the best Mean activation function. I tested this "minimal" task-specific Absolut Error (MAE) results in 3-fold cross-validation architecture on 3 batch values (50, 100, 150) for 10 epochs, using the provided labelled data, and achieved 128.19 km dividing the train-dev set into 3 folds. of avg distance in km on the blind test set provided by the challenge organizer. Although AlBERTo-it and bert-base2.2. Adding a single hidden layer italian-cased were outperformed, their achieved results are not too distant, as shown in Table 1.

To explore the potential benefits of introducing additional The second best MAE results were achieved using the complexity to the model, I designed a new task-specific 300 neurons single hidden layer task-specific architecture, architecture adding a single hidden layer right after the and then followed by the 5 neurons single hidden layer pooling layer, testing diferent sizes (5 neurons and then architecture, as shown in Table 2. This can be explained 300 neurons) followed by an activation function (Iden- by thinking that adding a small hidden layer after the tity, Sigmoid or ReLU), and finally, a two-neuron output pooling layer leads to an initial drastic reduction in the layer. To reduce the computational cost, only umberto- size of the model output. commoncrawl-cased-v1 was tested using this and the The worst results, on the other hand, were all obtained next architectures. For the same reason batch size 50 was with the 3 hidden layers architecture and ReLU as acmaintained. tivation function. The accuracy dropped possibly because of excessive feature compression: when several 2.3. Adding more hidden layers hidden layers are stacked, this reduction is followed by another further reduction of the size of the input vecStill with the rationale of knowing whether adding fur- tor, and the linear activation function was of no use in ther complexity would enhance the model’s learning ca- this case. Therefore, further complicating the architecpacity, I tested a task-specific setting with 3 hidden layers ture requires an additional regularization efort, which with neurons in combination (5, 5, 10) (10, 5, 5) and (300, the results achieved with only one hidden layer or even

There are no specific areas where the inputs have a larger error. In contrast, inputs from areas in PiedmontLombardy-Veneto and Lazio-Campania have lower error than the others. In fact, two or three marked clusters can be observed in the scatterplots of the outputs (Figure 2), depending on the model configuration, the most persistent of which is between Lazio and Campania, then a cluster that follows the Alpine arc and finally less frequently by a cluster on Sicily. Excluding that this can be attributed to an imbalance in our fine-tuning dataset, this result must come from the representation of the embeddings of each model. Ramponi and Casula [ 1 ] argue on the fact that the pre-training material that had been used by those models may include content in language varieties of Italy, and they attribute it to the over-prediction of Italian of current language identifiers, observing that content both partially and fully written in language varieties of Italy is typically classified as standard italian by the Twitter language identifier. I can further hypothesize that the varieties from the areas with the smallest error are also quantitatively more present in the pre-training dataset of each model, as these are also the ones from the most densely populated areas in Italy.

However, it is very complex to reconnect these observations to one or more linguistic facts concerning the Italian regional varieties. The question then is how did

5. Conclusions

The behavior shown by our models (need for regularization in the presence of numerous layers, better results with a single bigger hidden layer) is what we can expect from a simple neural network. However, it is astonishing that such a simple architecture manages to obtain non-disastrous results in an complex NLP task. The success of this regression task is undoubtedly attributable

1The OSCAR [15] subcorpus also has some subsets in other ital

ian language varieties (such Piedmontese), but the oficial umbertocommoncrawl-cased-v1 model card says that it was pre-trained only on the Italian subcorpus, deduplicated.

Figure 2: Scatterplots of the target (in black) and output (in red) coordinates for each configuration to the high-level representations of the input data, together with BERT’s ability to understand the linguistic context. Therefore, less is more: a simple setup, using even just two output neurons, seems to work better than a more complex one for BERT fine-tuned models on this task. Furthermore, in this work the models pre-trained on a miscellaneous corpus provides embeddings that performed better on tweets than a corpus specifically of the same genre. In conclusion, it is dificult to say how close we came to the goal, if the goal was to adequately map the diatopic variation of contemporary Italian, trying to automatically extract regional and dialectal patterns.

Even if in this work we were unable to further probe the linguistic information used to carry out our task, the studies converge in believing that BERT’s structure is, however, linguistically founded, although perhaps in a way that is more nuanced than can be explained by layers alone [ 17 ].

[1]

Ramponi , C. Casula, DiatopIt: A corpus of social media posts for the study of diatopic language variation in Italy , in: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023 ), Association for Computational Linguistics , Dubrovnik, Croatia, 2023 , pp. 187 - 199 . URL: https://aclanthology.org/ 2023 .vardial- 1 . 19 .

[2]

Lutsai ,

C. H.

Lampert , Geolocation predicting of tweets using bert-based models , 2023 . arXiv: 2303 . 07865 .

[3]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19- 1423. doi: 10 .18653/v1/ N19 -1423.

[4]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprugnoli , G. Venturi, Evalita 2023 : Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[5]

Ramponi , C. Casula, GeoLingIt at EVALITA 2023: Overview of the geolocation of linguistic variation in Italy task , in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[6]

Han , A . Rahimi ,

Derczynski , T. Baldwin, Twitter geolocation prediction shared task of the 2016 workshop on noisy user-generated text , in: Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) , The COLING 2016 Organizing Committee , Osaka, Japan, 2016 , pp. 213 - 217 . URL: https: //aclanthology.org/W16-3928.

[7]

Gaman ,

Hovy , R. T. Ionescu,

Jauhiainen ,

Lindén ,

Ljubešić ,

Partanen ,

Purschke ,

Scherrer ,

Zampieri , A report on the VarDial evaluation campaign 2020 , in: Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects , International Committee on Computational Linguistics (ICCL) , Barcelona, Spain (Online) , 2020 , pp. 1 - 14 . URL: https://aclanthology.org/ 2020 .vardial- 1 .1.

[8]

B. R.

Chakravarthi , G. Mihaela, R. T. Ionescu,

Jauhiainen ,

Lindén ,

Ljubešić ,

Partanen ,

Priyadharshini ,

Purschke , E. Rajagopal,

Scherrer ,

Zampieri , Findings of the VarDial evaluation campaign 2021 , in: Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Association for Computational Linguistics , Kiyv, Ukraine, 2021 , pp. 1 - 11 . URL: https://aclanthology.org/ 2021 .vardial- 1 .1.

[9]

Wang ,

Singh ,

Michael ,

Hill ,

Levy , S. Bowman, GLUE: A multi-task benchmark and analysis platform for natural language understanding , in: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics , Brussels, Belgium, 2018 , pp. 353 - 355 . URL: https://aclanthology.org/W18-5446. doi: 10 .18653/v1/ W18 -5446.

[10]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , 2019 . arXiv: 1907 .11692.

[11]

Parisi ,

Francia ,

Magnani , Umberto: an italian language model trained with whole word masking , https://github.com/musixmatchresearch/umberto, 2020 .

[12]

Schweter , Italian bert and electra models , 2020 . URL: https://doi.org/10.5281/zenodo.4263142. doi: 10 .5281/zenodo.4263142.

[13]

Polignano ,

Basile , M. de Gemmis, G. Semeraro, V. Basile, AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets , in: Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019 ), volume 2481 , CEUR , 2019 . URL: https://www.scopus.com/inward/record.uri?e id= 2 - s2 . 0 - 85074851349 &partnerID=40&md5=7abe d946e06f 76b3825ae5e294fac14.

[14]

Wiedemann ,

Remus ,

Chawla ,

Biemann , Does BERT make any sense? interpretable word sense disambiguation with contextualized embeddings , CoRR abs/ 1909 .10430 ( 2019 ). URL: http: //arxiv.org/abs/ 1909 .10430. arXiv: 1909 .10430.

[15]

Abadji ,

Ortiz Suarez ,

Romary ,

Sagot , Towards a Cleaner Document-Oriented Multilingual Crawled Corpus , arXiv e-prints ( 2022 ) arXiv: 2201 .06642. arXiv: 2201 . 06642 .

[16]

Tiedemann , S. Thottingal, OPUS-MT - building open translation services for the world , in: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, European Association for Machine Translation , Lisboa, Portugal, 2020 , pp. 479 - 480 . URL: https://aclanthology.org/ 2020 .eamt- 1 . 61 .

[17]

Niu ,

Lu , G. Penn, Does BERT rediscover a classical NLP pipeline? , in: Proceedings of the 29th International Conference on Computational Linguistics , International Committee on Computational Linguistics , Gyeongju, Republic of Korea, 2022 , pp. 3143 - 3153 . URL: https://aclanthology.org / 2022 .coling- 1 . 278 .