1. Introduction

DANTE at GeoLingIt: Dialect-Aware Multi-Granularity Pre-training for Locating Tweets within Italy

Giuseppe Gallipoli

giuseppe.gallipoli@polito.it 1

Moreno La Quatra

moreno.laquatra@unikore.it 0

Daniele Rege Cambrin

daniele.regecambrin@polito.it 1

Salvatore Greco

salvatore_greco@polito.it 1

Luca Cagliero

luca.cagliero@polito.it 1 0 Kore University of Enna , Piazza dell'Università, 94100 Enna EN 1 Politecnico di Torino , Corso Duca degli Abruzzi 24, 10129 Turin TO

This paper presents an NLP research system designed to geolocate tweets within Italy, a country renowned for its diverse linguistic landscape. Our methodology consists of a two-step process involving pre-training and fine-tuning phases. In the pre-training step, we take a semi-supervised approach and introduce two additional tasks. The primary objective of these tasks is to provide the language model with comprehensive knowledge of language varieties, focusing on both the sentence and token levels. Subsequently, during the fine-tuning phase, the model is adapted explicitly for two subtasks: coarse- and ifne-grained variety geolocation. To evaluate the efectiveness of our methodology, we participate in the GeoLingIt 2023 shared task and assess our model's performance using standard metrics. Ablation studies demonstrate the crucial role of the pre-training step in enhancing the model's performance on both tasks.

eol>Linguistic Varieties Region Localization Text Classification and Regression Italian NLP

1. Introduction

longitude coordinates corresponding to the origin of a tweet within Italy. Linguistic variations within and across Italy is widely recognized for its linguistic diversity, with regions make it dificult to accurately associate a piece 20 distinct regions, each characterized by various unique of text with its specific geographic origin. The challenge and shared dialects [1]. These dialects exhibit further becomes even more significant due to the similarities variations within each region, often associated with spe- each language variety may share with other languages, cific cities or provinces, and sometimes extend beyond even outside Italy. regional boundaries. The intricate nature of Italy’s lin- This paper presents the DANTE (Dialect ANalysis guistic landscape poses a significant challenge in accu- TEam)1 submission for the GeoLingIt 2023 shared task, rately identifying the origin of a given text within the characterized by a two-step methodology involving precountry. training and fine-tuning phases. By leveraging Italian

This research is conducted in the context of the Ge- or multilingual models, we propose a semi-supervised oLingIt shared task [2] at EVALITA 2023 [3]. It focuses on pre-training approach that combines standard and novel the geolocation of social media data, specifically Twitter pre-training tasks to capture regional dialect information posts. The task comprises two subtasks: Coarse-grained at multiple levels of granularity (i.e., sentence and token variety geolocation (Subtask A), whose aim is to deter- levels). Following the pre-training phase, the model unmine the region from which a tweet originates within the dergoes a standard fine-tuning process tailored to the 20 Italian regions, and Fine-grained variety geolocation two subtasks proposed by the shared task. Through ex(Subtask B), which focuses on predicting the latitude and tensive experiments, we demonstrate the efectiveness of our methodology.

2. Background

Text classification is a fundamental task in NLP whose objective is to assign one (or more) predefined classes to a piece of text. It has many applications ranging from sentiment analysis to topic classification. In this work, we apply it to the prediction of the geographic region associated with the linguistic variety expressed in a tweet.

1The name “DANTE” is inspired by the Italian poet Dante Alighieri,

widely regarded as one of the founding fathers of the Italian language.

The introduction of the Transformer [4] architec- weights and further pre-train it using both standard and ture for machine translation has represented a signif- novel pre-training tasks. icant breakthrough in NLP, achieving superior performance also in other tasks, including text classification. Masked Language Modeling (MLM) & Next SenTransformer-based classification models implement an tence Prediction (NSP). The MLM and NSP tasks are encoder-only architecture whose objective is to extract standard pre-training tasks used to train Transformera continuous representation from the input text. To do based models. Both tasks contribute to language processthis, models are generally pre-trained on large corpora ing by helping the model learn the contextual information of unlabeled text using specific pre-training objectives. of words and their relationships.

The pre-training stage allows the model to learn language representations that enable it to capture the struc- Region Classification (RC). By leveraging regionture and semantics of the text more efectively. Our work specific data, we integrate into pre-training the superfollows the same approach by further pre-training several vised task of predicting the geographic region associated Transformer-based models as discussed in Section 3. Af- with the linguistic variety expressed in a given sentence. ter pre-training, the model is fine-tuned on labeled data tailored to the desired task. Specifically, the architecture Token-level Region Classification (TRC). We also is enriched by additional classification layers (i.e., classifi- include an additional (supervised) token classification cation head) trained in a supervised fashion to output the task. It aims at predicting the geographic region assoifnal probability for each class. Similarly, by introducing ciated with each token in a given sentence. To create one or multiple linear layers, (multi-)regression tasks can training examples, we randomly combine multiple senalso be performed. tences belonging to text snippets labeled with diferent

Some of the most widely adopted Transformer-based regions. This task aims at enabling the model to capture classification models include: BERT and its multilingual regional linguistic information with higher granularity. version mBERT [5], DistilBERT [6], which is a distilled version of BERT, RoBERTa [7], and its multilingual version XLM [8], which are two variations of BERT including Using a multi-task learning approach, the model is dynamic masking. trained on multiple tasks simultaneously, allowing it to

Computational linguistics research in Italian faces learn a shared representation useful for all tasks. We dechallenges due to the scarcity of large-scale datasets ifne a separate linear layer for each task (i.e., task-specific specifically designed for the language, as highlighted re- head) that operates on the shared representation and is cently [9]. Also, the computational efort required to pre- trained using the corresponding labeled data. We experitrain language models has resulted in only a few available mented with two diferent multi-task learning setups: (1) architectures in Italian. Specifically, some of them are task-specific training (TST) , where the model is trained BERT-Italian and ELECTRA-Italian [10]. Furthermore, on a single task at a time, with each batch randomly sealthough they are not encoder-only architectures, the fol- lecting one task from the set of all available tasks, and (2) lowing are some of the other models available in Italian: joint training (JT), where the model is trained on all tasks GePpeTto [11], which is based on GPT-2 [12], IT5 [13], simultaneously, and the loss is computed as the average which is the Italian version of T5 [14], and the recently of the losses of all tasks. These two multi-task learning released BART-IT [15], which is the Italian version of setups were inspired by recent findings in the literature BART [16]. [17].

3. Description of the system The DANTE methodology for the GeoLingIt shared task aims to both identify the region of origin and predict the geographic coordinates of tweets within Italy. 3.1. Pre-training The initial phase of our methodology involves pretraining the model to improve its ability to analyze diferent linguistic varieties. We initialize a Transformer-based encoder model using Italian or multilingual pre-trained 3.2. Subtask A: Coarse-grained variety geolocation The Subtask A within GeoLingIt 2023 shared task involves

identifying the region of origin of a given tweet within Italy. It can be formulated as a classification task, where the model is trained to classify each tweet into its corresponding geographic region (i.e., one of the 20 Italian regions). To this end, we follow a standard fine-tuning approach, where the pre-trained model is adapted for the downstream task using the labeled training data. The representation of a special [CLS] token is used as the input to a linear layer trained to predict the region of origin of the tweet. The model is trained to minimize Subtask A

Subtask B Lat

Lon CLS two tasks and improve the model’s performance. This is one of the possible future directions we plan to explore.

4. Dataset 4.1. Pre-training Dataset To the best of our knowledge, there are no existing large

CLS scale data collections specifically focusing on Italian language varieties. Therefore, we exploited web scrapTransformer Encoder ing to construct our pre-training dataset. From a web Model search, we identified the following two sources: (1) Dialettando2: a website that contains several proverbs, sayings, poems, rhymes, and stories from diferent regions; (2)

CLS Wikipedia: which comprises specific versions for some reFigure 1: Fine-tuning architecture. It includes two branches: gional languages (e.g., nap3 for Neapolitan). They were one for Subtask A which predicts the region class (represented both accessed in January 2023. For the data collected as “R”), and another for Subtask B which predicts the latitude from Wikipedia, we associated each language-specific (represented as “Lat”) and longitude (represented as “Lon”). Wikipedia portal with the region primarily representing the respective language. For example, data collected from the nap Wikipedia portal would be associated with the Campania region, which predominantly represents the cross-entropy loss between the predicted and the the Neapolitan language. After the data collection, we ground-truth labels. For a visual representation of the ended up with a corpus of 273,011 documents containing ifne-tuning process, please refer to the top-left part of linguistic varieties of Italian from diferent regions. Out Figure 1. of these, 12,692 documents were collected from Dialettando, while the majority, 260,319, were obtained from 3.3. Subtask B: Fine-grained variety Wikipedia. From Dialettando, we also collected the 12,692 geolocation Italian translations of the same documents in the corpus.

This was done because the DiatopIt corpus utilized in the The Subtask B aims to localize a given tweet’s origin task contains instances that encompass regional Italian within Italy by predicting its latitude and longitude co- variations. Therefore, the final pre-training dataset is ordinates. The task can be formulated as a regression composed of 285,703 documents. Notice that a document problem, where the model is trained to jointly predict can be a Wikipedia article or any text from Dialettando two values (i.e., the latitude and longitude coordinates (e.g., proverb, saying, or story) without any diference, of the tweet’s origin) using two separate linear layers. even if they can have diferent lengths. Indeed, we found Similar to the previous case, the tweet representation that the mean number of tokens is 48 for Dialettando and is obtained by feeding the [CLS] token representation 147 for Wikipedia4. However, both sources of texts can to both linear layers. The fine-tuning architecture for be helpful during the pre-training phase. this subtask is illustrated in the upper right corner of Figure 2 shows the distribution of the collected docuFigure 1, showcasing the linear layers positioned on top ments for each Italian region. Table 1 details the numof the [CLS] token. ber of documents for each region and data source. As

The overall loss function for the regression task is can be noticed, regions of the north of Italy, such as defined as: Piemonte, Lombardia, and Veneto, are predominant in 1 the dataset, with approximately 60k texts (corresponding ℒ = 2 (ℒ + ℒ) to approximately 20% of the entire collection) each. They are followed by some regions of the south, such as Sicilia, Where ℒ and ℒ represent the mean squared er- Campania, and Puglia, with around 25k, 14k, and 11k ror (MSE) loss for the latitude and longitude predictions, texts, respectively. Finally, regions such as Valle D’Aosta, respectively. Toscana, Umbria, Marche, Lazio, Molise, Abruzzo, and It is worth noting that the model is separately fine-tuned for each task (i.e., coarse- and fine-grained variety geolocation). Thus, we do not use multi-task learning at this stage. Jointly optimizing the two tasks during fine-tuning could help the model to ensure consistency between the

2https://www.dialettando.com

3https://nap.wikipedia.org/ 4Computed with the bert-base-multilingual-cased tokenizer of the HuggingFace library https://huggingface.co/ bert-base-multilingual-cased. Abruzzo Basilicata Calabria Campania Emilia Romagna Friuli Venezia Giulia Lazio Liguria Lombardia Marche Molise Piemonte Puglia Sardegna Sicilia Toscana Trentino Alto Adige Umbria Valle D’Aosta Veneto

Total

Abbr.

ABR BAS CAL CAM EMI FRI LAZ LIG LOM MAR MOL PIE PUG SAR SIC TOS TRE UMB VAL VEN # Documents Dial. Wiki

4.2. GeoLingIt Dataset The GeoLingIt 2023 dataset for the Subtasks A and B is

DiatopIt [18], a corpus of diatopic variations of language in Italy. It is composed of geotagged social media posts from Twitter. Each tweet also comprises the associated latitude, longitude, and the Italian region of origin. The dataset contains 13,669 examples for training, 552 for validation, and 818 for testing. This dataset is exploited in the fine-tuning phase to specialize the models to coarseand fine-grained variety geolocation. The authors of Experimental settings The pre-training phase lasts the competition have already anonymized data. Specif- five epochs and utilizes the Adam optimizer with a linear ically, user mentions, email addresses, and URLs have learning rate scheduler. The scheduler includes a warmup been replaced with specific placeholders for privacy rea- period (10% of the total training steps) followed by a sons. However, the content of tweets is unfiltered and linear decay of the learning rate until the end of training. can exhibit non-standard language use (e.g., insults, bad The fine-tuning phase lasts for ten epochs and utilizes words). the same settings for the optimizer and scheduler as the pre-training phase. considered: (1) multilingual BERT model (mBERT)5, (2) BERT model pre-trained on 13GB of Italian text (BERTIT)6, (3) BERT model pre-trained on 81GB of Italian text (BERT-IT-XXL)7, (4) XLM-RoBERTa (XLM-R)8, and (5) multilingual DistilBERT (dBERT)9. All models are used in their cased versions. By comparing the results of these baseline models with our approach, we can assess the benefits of the proposed pre-training phase.

5. Results Baseline models We compare our models with base

line models pre-trained on Italian or multilingual data, 5https://huggingface.co/bert-base-multilingual-cased which undergo the same fine-tuning process described in 6https://huggingface.co/dbmdz/bert-base-italian-cased Sections 3.2 and 3.3. The following baseline models are 7https://huggingface.co/dbmdz/bert-base-italian-xxl-cased 8https://huggingface.co/xlm-roberta-base 9https://huggingface.co/distilbert-base-multilingual-cased

5.1. Coarse-grained variety geolocation

We report the single models and an ensemble, which in- LogReg cludes all models evaluated on the development set. The MFB ensemble prediction is obtained through majority voting on the individual models’ predictions. In case of a tie, a random selection is employed. The organizers provide Logistic Regression (LogReg) and Most Frequent Baseline (MFB) as baselines. According to the competition rules, we consider macro F1-score, precision, and recall as evaluation metrics.

In Table 2, we reported development and test sets results. There is a noticeable performance improvement dBERT when comparing deep learning techniques to classical machine learning methods. One common aspect shared by both approaches is the observed degradation in per- mBERT formance on the test set compared to the development set. This pattern can be attributed to the fact that the test set contains additional regions that are not present in the development set. Joint-Training (JT) consistently yields XLM-R the best results in terms of F1-score, achieving significant improvements ranging from +2% to +7% compared to the absence of pre-training. This boost primarily manifests as enhanced precision.

Following GeoLingIt guidelines, we can only submit three models for test set evaluation. We have selected Model the top-3 models based on their performance on the development set: Jointly-Trained BERT-IT-XXL, TaskSpecific-Trained BERT-IT-XXL, and the models’ Ensemble. We show in Table 2b the performance of these models on the test set. The results show that the Ensemble method achieved the highest performance, followed by the Task-Specific-Trained (TST) BERT-IT-XXL model and the Jointly-Trained (JT) BERT-IT-XXL model. Surprisingly, the TST pre-training approach outperformed the others, exhibiting a significant +2% improvement in F1score compared to the corresponding model pre-trained using JT. ✗ ✗ 5.2. Fine-grained variety geolocation tiveness of the pre-training with one exception: TaskSpecific-Training on BERT-IT. The diferences span from Task organizers provide K-nearest-neighbor (KNN) and − 19 km to − 161 km with respect to the models withcentroid baseline (CB) models as baselines. Following out a specific pre-training. In most cases, Joint-Training shared task guidelines, the models’ performance is as- consistently yields the best results, except for XLM-R. sessed using haversine distance. In this case, the lower, We submit the top-3 most promising solutions accordthe better. We report the results of single models and an ing to the development set results for evaluation on the ensemble of the top-2 evaluated models. The ensemble test set: Jointly-Trained dBERT, Task-Specific-Trained prediction is obtained using the mean point between the dBERT, and Ensemble. We report their performance on two individual models’ predictions. the test set in Table 3b. The Ensemble model achieved the

In Table 3, we reported results for the evaluated model best performance, with the TST pre-training demonstraton the development and test sets. Similar to Subtask A, ing a 1.5 km average distance improvement compared deep learning models outperform classical approaches. to the JT counterpart.

Notably, the test set’s performance in this task shows higher scores than the development set.

The results on the development set confirm the efecKNN CB ✗ ✗ on the presence of such ofensive language, potentially influencing their region identification capabilities. The proposed methodologies are not intended to ofend anyone; since they may be inaccurate in some cases, it is possible to get improprieties.

7. Conclusion and Future Works This paper presents an efective solution for modeling lan

guage varieties within Italy, achieving excellent results and ranking 1st and 2nd among other teams for Subtask A and Subtask B, respectively. However, there are still promising avenues for future research. Utilizing multitask learning during the fine-tuning phase can improve consistency and performance by training on multiple related tasks using the same backbone model. Regarding model architecture, we aim to investigate the development of a specific model focused on identifying portions of the text belonging to specific language varieties. This model will be designed to identify the distinctive linguistic features within tweets accurately. By successfully identifying these features, the model would have the potential to concentrate on the relevant parts of the text, which may lead to improved localization capabilities. Finally, preliminary experiments show that incorporating curriculum learning techniques during pre-training can optimize the learning process and enhance the overall model’s performance.

Acknowledgments

Table 3 This study was partially carried out within the MICS Subtask B results. The PT column indicates no-pre-training (Made in Italy – Circular and Sustainable) Extended Partwith ✗, Task-Specific Training with TST, Joint Training with nership and received funding from Next-GenerationEU JT, and models ensemble with E. (Italian PNRR – M4 C2, Invest 1.3 – D.D. 1551.11-102022, PE00000004). This study was also partially carried out within the FAIR (Future Artificial Intelligence Re6. Discussion search) and received funding from Next-GenerationEU (Italian PNRR – M4 C2, Invest 1.3 – D.D. 1555.11-10-2022, Our analysis assessed the efectiveness of widely used PE00000013). This manuscript reflects only the authors’ deep language models in the context of both coarse- and views and opinions, neither the European Union nor the ifne-grained variety geolocation tasks. We also ofer an European Commission can be considered responsible for interactive demo10 to showcase our best-performing mod- them. els and release both the code and pre-trained models11.

It is worth noting that social media data composing the ifne-tuning dataset may contain profanities, slurs, hateful References content, and stereotypes. Although pre-training data is collected using controlled sources, a similar statement may apply to them. A community partially manages both the Dialettando website and Wikipedia portals. Therefore their content may not be carefully curated. As a result, the models may exhibit label correlations based [1] A. Ramponi, Nlp for language varieties of italy:

Challenges and the path forward, arXiv preprint arXiv:2209.09757 (2022). [2] A. Ramponi, C. Casula, GeoLingIt at EVALITA 2023:

Overview of the geolocation of linguistic variation in Italy task, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and 10https://huggingface.co/spaces/DGMS/DANTE-GeoLingIT2023 11https://github.com/MorenoLaQuatra/DANTE-GeoLingIT2023

Speech Tools for Italian. Final Workshop (EVALITA [13] G. Sarti, M. Nissim, IT5: Large-scale text-to-text 2023), CEUR.org, Parma, Italy, 2023. pretraining for italian language understanding and [3] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- generation, ArXiv preprint 2203.03759 (2022). URL: noli, G. Venturi, Evalita 2023: Overview of the 8th https://arxiv.org/abs/2203.03759. evaluation campaign of natural language process- [14] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, ing and speech tools for italian, in: Proceedings M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the of the Eighth Evaluation Campaign of Natural Lan- limits of transfer learning with a unified text-to-text guage Processing and Speech Tools for Italian. Final transformer, J. Mach. Learn. Res. 21 (2020). Workshop (EVALITA 2023), CEUR.org, Parma, Italy, [15] M. La Quatra, L. Cagliero, BART-IT: An Eficient 2023. Sequence-to-Sequence Model for Italian Text Sum[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, marization, Future Internet 15 (2022) 15.

L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, [16] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, Attention is all you need, in: Advances in Neural A. Mohamed, O. Levy, V. Stoyanov, L. ZettleInformation Processing Systems, volume 30, Cur- moyer, BART: Denoising sequence-to-sequence ran Associates, Inc., 2017. URL: https://proceedings. pre-training for natural language generation, transneurips.cc/paper_files/paper/2017/file/ lation, and comprehension, in: Proceedings of the 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf . 58th Annual Meeting of the Association for Com[5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: putational Linguistics, 2020, pp. 7871–7880. URL: Pre-training of deep bidirectional transformers for https://aclanthology.org/2020.acl-main.703. doi:10. language understanding, in: Proceedings of the 18653/v1/2020.acl-main.703. 2019 Conference of the North American Chapter [17] C. Fifty, E. Amid, Z. Zhao, T. Yu, R. Anil, C. Finn, of the Association for Computational Linguistics: Eficiently identifying task groupings for multi-task Human Language Technologies, Volume 1 (Long learning, Advances in Neural Information Processand Short Papers), 2019, pp. 4171–4186. URL: https: ing Systems 34 (2021) 27503–27516. //aclanthology.org/N19-1423. doi:10.18653/v1/ [18] A. Ramponi, C. Casula, DiatopIt: A corpus of N19-1423. social media posts for the study of diatopic lan[6] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, guage variation in Italy, in: Tenth Workshop a distilled version of bert: smaller, faster, cheaper on NLP for Similar Languages, Varieties and Diand lighter, ArXiv abs/1910.01108 (2019). alects (VarDial 2023), 2023, pp. 187–199. URL: https: [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, //aclanthology.org/2023.vardial-1.19.

O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http: //arxiv.org/abs/1907.11692. [8] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised crosslingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440– 8451. URL: https://aclanthology.org/2020.acl-main.

747. doi:10.18653/v1/2020.acl-main.747. [9] A. Koudounas, M. La Quatra, L. Vaiani, L. Colomba,

G. Attanasio, E. Pastor, L. Cagliero, E. Baralis, Italic: An italian intent classification dataset, arXiv preprint arXiv:2306.08502 (2023). [10] S. Schweter, Italian bert and electra models, 2020. URL: https://doi.org/10.5281/zenodo.4263142.

doi:10.5281/zenodo.4263142. [11] L. D. Mattei, M. Cafagna, F. Dell’Orletta, M. Nissim,

M. Guerini, Geppetto carves italian into a language model, 2020. arXiv:2004.14253. [12] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,

I. Sutskever, Language models are unsupervised multitask learners, 2019.