<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DANTE at GeoLingIt: Dialect-Aware Multi-Granularity Pre-training for Locating Tweets within Italy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe Gallipoli</string-name>
          <email>giuseppe.gallipoli@polito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Moreno La Quatra</string-name>
          <email>moreno.laquatra@unikore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Rege Cambrin</string-name>
          <email>daniele.regecambrin@polito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salvatore Greco</string-name>
          <email>salvatore_greco@polito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Cagliero</string-name>
          <email>luca.cagliero@polito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kore University of Enna</institution>
          ,
          <addr-line>Piazza dell'Università, 94100 Enna EN</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Politecnico di Torino</institution>
          ,
          <addr-line>Corso Duca degli Abruzzi 24, 10129 Turin TO</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents an NLP research system designed to geolocate tweets within Italy, a country renowned for its diverse linguistic landscape. Our methodology consists of a two-step process involving pre-training and fine-tuning phases. In the pre-training step, we take a semi-supervised approach and introduce two additional tasks. The primary objective of these tasks is to provide the language model with comprehensive knowledge of language varieties, focusing on both the sentence and token levels. Subsequently, during the fine-tuning phase, the model is adapted explicitly for two subtasks: coarse- and ifne-grained variety geolocation. To evaluate the efectiveness of our methodology, we participate in the GeoLingIt 2023 shared task and assess our model's performance using standard metrics. Ablation studies demonstrate the crucial role of the pre-training step in enhancing the model's performance on both tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Linguistic Varieties</kwd>
        <kwd>Region Localization</kwd>
        <kwd>Text Classification and Regression</kwd>
        <kwd>Italian NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>longitude coordinates corresponding to the origin of a
tweet within Italy. Linguistic variations within and across
Italy is widely recognized for its linguistic diversity, with regions make it dificult to accurately associate a piece
20 distinct regions, each characterized by various unique of text with its specific geographic origin. The challenge
and shared dialects [1]. These dialects exhibit further becomes even more significant due to the similarities
variations within each region, often associated with spe- each language variety may share with other languages,
cific cities or provinces, and sometimes extend beyond even outside Italy.
regional boundaries. The intricate nature of Italy’s lin- This paper presents the DANTE (Dialect ANalysis
guistic landscape poses a significant challenge in accu- TEam)1 submission for the GeoLingIt 2023 shared task,
rately identifying the origin of a given text within the characterized by a two-step methodology involving
precountry. training and fine-tuning phases. By leveraging Italian</p>
      <p>This research is conducted in the context of the Ge- or multilingual models, we propose a semi-supervised
oLingIt shared task [2] at EVALITA 2023 [3]. It focuses on pre-training approach that combines standard and novel
the geolocation of social media data, specifically Twitter pre-training tasks to capture regional dialect information
posts. The task comprises two subtasks: Coarse-grained at multiple levels of granularity (i.e., sentence and token
variety geolocation (Subtask A), whose aim is to deter- levels). Following the pre-training phase, the model
unmine the region from which a tweet originates within the dergoes a standard fine-tuning process tailored to the
20 Italian regions, and Fine-grained variety geolocation two subtasks proposed by the shared task. Through
ex(Subtask B), which focuses on predicting the latitude and tensive experiments, we demonstrate the efectiveness
of our methodology.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Text classification is a fundamental task in NLP whose
objective is to assign one (or more) predefined classes to
a piece of text. It has many applications ranging from
sentiment analysis to topic classification. In this work,
we apply it to the prediction of the geographic region
associated with the linguistic variety expressed in a tweet.</p>
      <sec id="sec-2-1">
        <title>1The name “DANTE” is inspired by the Italian poet Dante Alighieri,</title>
        <p>widely regarded as one of the founding fathers of the Italian
language.</p>
        <p>The introduction of the Transformer [4] architec- weights and further pre-train it using both standard and
ture for machine translation has represented a signif- novel pre-training tasks.
icant breakthrough in NLP, achieving superior
performance also in other tasks, including text classification. Masked Language Modeling (MLM) &amp; Next
SenTransformer-based classification models implement an tence Prediction (NSP). The MLM and NSP tasks are
encoder-only architecture whose objective is to extract standard pre-training tasks used to train
Transformera continuous representation from the input text. To do based models. Both tasks contribute to language
processthis, models are generally pre-trained on large corpora ing by helping the model learn the contextual information
of unlabeled text using specific pre-training objectives. of words and their relationships.</p>
        <p>The pre-training stage allows the model to learn
language representations that enable it to capture the struc- Region Classification (RC). By leveraging
regionture and semantics of the text more efectively. Our work specific data, we integrate into pre-training the
superfollows the same approach by further pre-training several vised task of predicting the geographic region associated
Transformer-based models as discussed in Section 3. Af- with the linguistic variety expressed in a given sentence.
ter pre-training, the model is fine-tuned on labeled data
tailored to the desired task. Specifically, the architecture Token-level Region Classification (TRC). We also
is enriched by additional classification layers (i.e., classifi- include an additional (supervised) token classification
cation head) trained in a supervised fashion to output the task. It aims at predicting the geographic region
assoifnal probability for each class. Similarly, by introducing ciated with each token in a given sentence. To create
one or multiple linear layers, (multi-)regression tasks can training examples, we randomly combine multiple
senalso be performed. tences belonging to text snippets labeled with diferent</p>
        <p>Some of the most widely adopted Transformer-based regions. This task aims at enabling the model to capture
classification models include: BERT and its multilingual regional linguistic information with higher granularity.
version mBERT [5], DistilBERT [6], which is a distilled
version of BERT, RoBERTa [7], and its multilingual
version XLM [8], which are two variations of BERT including Using a multi-task learning approach, the model is
dynamic masking. trained on multiple tasks simultaneously, allowing it to</p>
        <p>Computational linguistics research in Italian faces learn a shared representation useful for all tasks. We
dechallenges due to the scarcity of large-scale datasets ifne a separate linear layer for each task (i.e., task-specific
specifically designed for the language, as highlighted re- head) that operates on the shared representation and is
cently [9]. Also, the computational efort required to pre- trained using the corresponding labeled data. We
experitrain language models has resulted in only a few available mented with two diferent multi-task learning setups: (1)
architectures in Italian. Specifically, some of them are task-specific training (TST) , where the model is trained
BERT-Italian and ELECTRA-Italian [10]. Furthermore, on a single task at a time, with each batch randomly
sealthough they are not encoder-only architectures, the fol- lecting one task from the set of all available tasks, and (2)
lowing are some of the other models available in Italian: joint training (JT), where the model is trained on all tasks
GePpeTto [11], which is based on GPT-2 [12], IT5 [13], simultaneously, and the loss is computed as the average
which is the Italian version of T5 [14], and the recently of the losses of all tasks. These two multi-task learning
released BART-IT [15], which is the Italian version of setups were inspired by recent findings in the literature
BART [16]. [17].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Description of the system</title>
      <sec id="sec-3-1">
        <title>The DANTE methodology for the GeoLingIt shared task aims to both identify the region of origin and predict the geographic coordinates of tweets within Italy.</title>
        <sec id="sec-3-1-1">
          <title>3.1. Pre-training</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>The initial phase of our methodology involves pretraining the model to improve its ability to analyze diferent linguistic varieties. We initialize a Transformer-based encoder model using Italian or multilingual pre-trained</title>
        <sec id="sec-3-2-1">
          <title>3.2. Subtask A: Coarse-grained variety geolocation</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>The Subtask A within GeoLingIt 2023 shared task involves</title>
        <p>identifying the region of origin of a given tweet within
Italy. It can be formulated as a classification task, where
the model is trained to classify each tweet into its
corresponding geographic region (i.e., one of the 20 Italian
regions). To this end, we follow a standard fine-tuning
approach, where the pre-trained model is adapted for the
downstream task using the labeled training data. The
representation of a special [CLS] token is used as the
input to a linear layer trained to predict the region of
origin of the tweet. The model is trained to minimize
Subtask A</p>
        <p>R</p>
        <p>Subtask B
Lat</p>
        <p>Lon
CLS
two tasks and improve the model’s performance. This is
one of the possible future directions we plan to explore.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset</title>
      <sec id="sec-4-1">
        <title>4.1. Pre-training Dataset</title>
        <sec id="sec-4-1-1">
          <title>To the best of our knowledge, there are no existing large</title>
          <p>CLS scale data collections specifically focusing on Italian
language varieties. Therefore, we exploited web
scrapTransformer Encoder ing to construct our pre-training dataset. From a web
Model search, we identified the following two sources: (1)
Dialettando2: a website that contains several proverbs, sayings,
poems, rhymes, and stories from diferent regions; (2)</p>
          <p>CLS Wikipedia: which comprises specific versions for some
reFigure 1: Fine-tuning architecture. It includes two branches: gional languages (e.g., nap3 for Neapolitan). They were
one for Subtask A which predicts the region class (represented both accessed in January 2023. For the data collected
as “R”), and another for Subtask B which predicts the latitude from Wikipedia, we associated each language-specific
(represented as “Lat”) and longitude (represented as “Lon”). Wikipedia portal with the region primarily
representing the respective language. For example, data collected
from the nap Wikipedia portal would be associated with
the Campania region, which predominantly represents
the cross-entropy loss between the predicted and the the Neapolitan language. After the data collection, we
ground-truth labels. For a visual representation of the ended up with a corpus of 273,011 documents containing
ifne-tuning process, please refer to the top-left part of linguistic varieties of Italian from diferent regions. Out
Figure 1. of these, 12,692 documents were collected from
Dialettando, while the majority, 260,319, were obtained from
3.3. Subtask B: Fine-grained variety Wikipedia. From Dialettando, we also collected the 12,692
geolocation Italian translations of the same documents in the corpus.</p>
          <p>This was done because the DiatopIt corpus utilized in the
The Subtask B aims to localize a given tweet’s origin task contains instances that encompass regional Italian
within Italy by predicting its latitude and longitude co- variations. Therefore, the final pre-training dataset is
ordinates. The task can be formulated as a regression composed of 285,703 documents. Notice that a document
problem, where the model is trained to jointly predict can be a Wikipedia article or any text from Dialettando
two values (i.e., the latitude and longitude coordinates (e.g., proverb, saying, or story) without any diference,
of the tweet’s origin) using two separate linear layers. even if they can have diferent lengths. Indeed, we found
Similar to the previous case, the tweet representation that the mean number of tokens is 48 for Dialettando and
is obtained by feeding the [CLS] token representation 147 for Wikipedia4. However, both sources of texts can
to both linear layers. The fine-tuning architecture for be helpful during the pre-training phase.
this subtask is illustrated in the upper right corner of Figure 2 shows the distribution of the collected
docuFigure 1, showcasing the linear layers positioned on top ments for each Italian region. Table 1 details the
numof the [CLS] token. ber of documents for each region and data source. As</p>
          <p>The overall loss function for the regression task is can be noticed, regions of the north of Italy, such as
defined as: Piemonte, Lombardia, and Veneto, are predominant in
1 the dataset, with approximately 60k texts (corresponding
ℒ = 2 (ℒ + ℒ) to approximately 20% of the entire collection) each. They
are followed by some regions of the south, such as Sicilia,
Where ℒ and ℒ represent the mean squared er- Campania, and Puglia, with around 25k, 14k, and 11k
ror (MSE) loss for the latitude and longitude predictions, texts, respectively. Finally, regions such as Valle D’Aosta,
respectively. Toscana, Umbria, Marche, Lazio, Molise, Abruzzo, and
It is worth noting that the model is separately fine-tuned
for each task (i.e., coarse- and fine-grained variety
geolocation). Thus, we do not use multi-task learning at this
stage. Jointly optimizing the two tasks during fine-tuning
could help the model to ensure consistency between the</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>2https://www.dialettando.com</title>
          <p>3https://nap.wikipedia.org/
4Computed with the bert-base-multilingual-cased
tokenizer of the HuggingFace library https://huggingface.co/
bert-base-multilingual-cased.
Abruzzo
Basilicata
Calabria
Campania
Emilia Romagna
Friuli Venezia Giulia
Lazio
Liguria
Lombardia
Marche
Molise
Piemonte
Puglia
Sardegna
Sicilia
Toscana
Trentino Alto Adige
Umbria
Valle D’Aosta
Veneto</p>
          <p>Total</p>
          <p>Abbr.</p>
          <p>ABR
BAS
CAL
CAM
EMI
FRI
LAZ
LIG
LOM
MAR
MOL
PIE
PUG
SAR
SIC
TOS
TRE
UMB
VAL
VEN
# Documents
Dial. Wiki</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. GeoLingIt Dataset</title>
        <sec id="sec-4-2-1">
          <title>The GeoLingIt 2023 dataset for the Subtasks A and B is</title>
          <p>DiatopIt [18], a corpus of diatopic variations of language
in Italy. It is composed of geotagged social media posts
from Twitter. Each tweet also comprises the associated
latitude, longitude, and the Italian region of origin. The
dataset contains 13,669 examples for training, 552 for
validation, and 818 for testing. This dataset is exploited in
the fine-tuning phase to specialize the models to
coarseand fine-grained variety geolocation. The authors of Experimental settings The pre-training phase lasts
the competition have already anonymized data. Specif- five epochs and utilizes the Adam optimizer with a linear
ically, user mentions, email addresses, and URLs have learning rate scheduler. The scheduler includes a warmup
been replaced with specific placeholders for privacy rea- period (10% of the total training steps) followed by a
sons. However, the content of tweets is unfiltered and linear decay of the learning rate until the end of training.
can exhibit non-standard language use (e.g., insults, bad The fine-tuning phase lasts for ten epochs and utilizes
words). the same settings for the optimizer and scheduler as the
pre-training phase.
considered: (1) multilingual BERT model (mBERT)5, (2)
BERT model pre-trained on 13GB of Italian text
(BERTIT)6, (3) BERT model pre-trained on 81GB of Italian text
(BERT-IT-XXL)7, (4) XLM-RoBERTa (XLM-R)8, and (5)
multilingual DistilBERT (dBERT)9. All models are used
in their cased versions. By comparing the results of these
baseline models with our approach, we can assess the
benefits of the proposed pre-training phase.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>Baseline models We compare our models with base</title>
        <p>line models pre-trained on Italian or multilingual data, 5https://huggingface.co/bert-base-multilingual-cased
which undergo the same fine-tuning process described in 6https://huggingface.co/dbmdz/bert-base-italian-cased
Sections 3.2 and 3.3. The following baseline models are 7https://huggingface.co/dbmdz/bert-base-italian-xxl-cased
8https://huggingface.co/xlm-roberta-base
9https://huggingface.co/distilbert-base-multilingual-cased</p>
        <sec id="sec-5-1-1">
          <title>5.1. Coarse-grained variety geolocation</title>
          <p>We report the single models and an ensemble, which in- LogReg
cludes all models evaluated on the development set. The MFB
ensemble prediction is obtained through majority voting
on the individual models’ predictions. In case of a tie, a
random selection is employed. The organizers provide
Logistic Regression (LogReg) and Most Frequent
Baseline (MFB) as baselines. According to the competition
rules, we consider macro F1-score, precision, and recall
as evaluation metrics.</p>
          <p>In Table 2, we reported development and test sets
results. There is a noticeable performance improvement dBERT
when comparing deep learning techniques to classical
machine learning methods. One common aspect shared
by both approaches is the observed degradation in per- mBERT
formance on the test set compared to the development
set. This pattern can be attributed to the fact that the test
set contains additional regions that are not present in the
development set. Joint-Training (JT) consistently yields XLM-R
the best results in terms of F1-score, achieving significant
improvements ranging from +2% to +7% compared to the
absence of pre-training. This boost primarily manifests
as enhanced precision.</p>
          <p>Following GeoLingIt guidelines, we can only submit
three models for test set evaluation. We have selected Model
the top-3 models based on their performance on the
development set: Jointly-Trained BERT-IT-XXL,
TaskSpecific-Trained BERT-IT-XXL, and the models’
Ensemble. We show in Table 2b the performance of these
models on the test set. The results show that the Ensemble
method achieved the highest performance, followed by
the Task-Specific-Trained (TST) BERT-IT-XXL model and
the Jointly-Trained (JT) BERT-IT-XXL model.
Surprisingly, the TST pre-training approach outperformed the
others, exhibiting a significant +2% improvement in
F1score compared to the corresponding model pre-trained
using JT.
✗
✗
5.2. Fine-grained variety geolocation tiveness of the pre-training with one exception:
TaskSpecific-Training on BERT-IT. The diferences span from
Task organizers provide K-nearest-neighbor (KNN) and − 19 km to − 161 km with respect to the models
withcentroid baseline (CB) models as baselines. Following out a specific pre-training. In most cases, Joint-Training
shared task guidelines, the models’ performance is as- consistently yields the best results, except for XLM-R.
sessed using haversine distance. In this case, the lower, We submit the top-3 most promising solutions
accordthe better. We report the results of single models and an ing to the development set results for evaluation on the
ensemble of the top-2 evaluated models. The ensemble test set: Jointly-Trained dBERT, Task-Specific-Trained
prediction is obtained using the mean point between the dBERT, and Ensemble. We report their performance on
two individual models’ predictions. the test set in Table 3b. The Ensemble model achieved the</p>
          <p>In Table 3, we reported results for the evaluated model best performance, with the TST pre-training
demonstraton the development and test sets. Similar to Subtask A, ing a 1.5 km average distance improvement compared
deep learning models outperform classical approaches. to the JT counterpart.</p>
          <p>Notably, the test set’s performance in this task shows
higher scores than the development set.</p>
          <p>The results on the development set confirm the
efecKNN
CB
✗
✗
on the presence of such ofensive language, potentially
influencing their region identification capabilities. The
proposed methodologies are not intended to ofend
anyone; since they may be inaccurate in some cases, it is
possible to get improprieties.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion and Future Works</title>
      <sec id="sec-6-1">
        <title>This paper presents an efective solution for modeling lan</title>
        <p>guage varieties within Italy, achieving excellent results
and ranking 1st and 2nd among other teams for Subtask
A and Subtask B, respectively. However, there are still
promising avenues for future research. Utilizing
multitask learning during the fine-tuning phase can improve
consistency and performance by training on multiple
related tasks using the same backbone model. Regarding
model architecture, we aim to investigate the
development of a specific model focused on identifying portions
of the text belonging to specific language varieties. This
model will be designed to identify the distinctive
linguistic features within tweets accurately. By successfully
identifying these features, the model would have the
potential to concentrate on the relevant parts of the text,
which may lead to improved localization capabilities.
Finally, preliminary experiments show that incorporating
curriculum learning techniques during pre-training can
optimize the learning process and enhance the overall
model’s performance.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Table 3 This study was partially carried out within the MICS
Subtask B results. The PT column indicates no-pre-training (Made in Italy – Circular and Sustainable) Extended
Partwith ✗, Task-Specific Training with TST, Joint Training with nership and received funding from Next-GenerationEU
JT, and models ensemble with E. (Italian PNRR – M4 C2, Invest 1.3 – D.D.
1551.11-102022, PE00000004). This study was also partially carried
out within the FAIR (Future Artificial Intelligence
Re6. Discussion search) and received funding from Next-GenerationEU
(Italian PNRR – M4 C2, Invest 1.3 – D.D. 1555.11-10-2022,
Our analysis assessed the efectiveness of widely used PE00000013). This manuscript reflects only the authors’
deep language models in the context of both coarse- and views and opinions, neither the European Union nor the
ifne-grained variety geolocation tasks. We also ofer an European Commission can be considered responsible for
interactive demo10 to showcase our best-performing mod- them.
els and release both the code and pre-trained models11.</p>
      <p>It is worth noting that social media data composing the
ifne-tuning dataset may contain profanities, slurs, hateful References
content, and stereotypes. Although pre-training data is
collected using controlled sources, a similar statement
may apply to them. A community partially manages both
the Dialettando website and Wikipedia portals.
Therefore their content may not be carefully curated. As a
result, the models may exhibit label correlations based
[1] A. Ramponi, Nlp for language varieties of italy:</p>
      <p>Challenges and the path forward, arXiv preprint
arXiv:2209.09757 (2022).
[2] A. Ramponi, C. Casula, GeoLingIt at EVALITA 2023:</p>
      <p>Overview of the geolocation of linguistic variation
in Italy task, in: Proceedings of the Eighth
Evaluation Campaign of Natural Language Processing and
10https://huggingface.co/spaces/DGMS/DANTE-GeoLingIT2023
11https://github.com/MorenoLaQuatra/DANTE-GeoLingIT2023</p>
      <p>Speech Tools for Italian. Final Workshop (EVALITA [13] G. Sarti, M. Nissim, IT5: Large-scale text-to-text
2023), CEUR.org, Parma, Italy, 2023. pretraining for italian language understanding and
[3] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- generation, ArXiv preprint 2203.03759 (2022). URL:
noli, G. Venturi, Evalita 2023: Overview of the 8th https://arxiv.org/abs/2203.03759.
evaluation campaign of natural language process- [14] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
ing and speech tools for italian, in: Proceedings M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
of the Eighth Evaluation Campaign of Natural Lan- limits of transfer learning with a unified text-to-text
guage Processing and Speech Tools for Italian. Final transformer, J. Mach. Learn. Res. 21 (2020).
Workshop (EVALITA 2023), CEUR.org, Parma, Italy, [15] M. La Quatra, L. Cagliero, BART-IT: An Eficient
2023. Sequence-to-Sequence Model for Italian Text
Sum[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, marization, Future Internet 15 (2022) 15.</p>
      <p>L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, [16] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad,
Attention is all you need, in: Advances in Neural A. Mohamed, O. Levy, V. Stoyanov, L.
ZettleInformation Processing Systems, volume 30, Cur- moyer, BART: Denoising sequence-to-sequence
ran Associates, Inc., 2017. URL: https://proceedings. pre-training for natural language generation,
transneurips.cc/paper_files/paper/2017/file/ lation, and comprehension, in: Proceedings of the
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf . 58th Annual Meeting of the Association for
Com[5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: putational Linguistics, 2020, pp. 7871–7880. URL:
Pre-training of deep bidirectional transformers for https://aclanthology.org/2020.acl-main.703. doi:10.
language understanding, in: Proceedings of the 18653/v1/2020.acl-main.703.
2019 Conference of the North American Chapter [17] C. Fifty, E. Amid, Z. Zhao, T. Yu, R. Anil, C. Finn,
of the Association for Computational Linguistics: Eficiently identifying task groupings for multi-task
Human Language Technologies, Volume 1 (Long learning, Advances in Neural Information
Processand Short Papers), 2019, pp. 4171–4186. URL: https: ing Systems 34 (2021) 27503–27516.
//aclanthology.org/N19-1423. doi:10.18653/v1/ [18] A. Ramponi, C. Casula, DiatopIt: A corpus of
N19-1423. social media posts for the study of diatopic
lan[6] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, guage variation in Italy, in: Tenth Workshop
a distilled version of bert: smaller, faster, cheaper on NLP for Similar Languages, Varieties and
Diand lighter, ArXiv abs/1910.01108 (2019). alects (VarDial 2023), 2023, pp. 187–199. URL: https:
[7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, //aclanthology.org/2023.vardial-1.19.</p>
      <p>O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized BERT pretraining
approach, CoRR abs/1907.11692 (2019). URL: http:
//arxiv.org/abs/1907.11692.
[8] A. Conneau, K. Khandelwal, N. Goyal, V.
Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised
crosslingual representation learning at scale, in:
Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, 2020, pp. 8440–
8451. URL: https://aclanthology.org/2020.acl-main.</p>
      <p>747. doi:10.18653/v1/2020.acl-main.747.
[9] A. Koudounas, M. La Quatra, L. Vaiani, L. Colomba,</p>
      <p>G. Attanasio, E. Pastor, L. Cagliero, E. Baralis,
Italic: An italian intent classification dataset, arXiv
preprint arXiv:2306.08502 (2023).
[10] S. Schweter, Italian bert and electra models,
2020. URL: https://doi.org/10.5281/zenodo.4263142.</p>
      <p>doi:10.5281/zenodo.4263142.
[11] L. D. Mattei, M. Cafagna, F. Dell’Orletta, M. Nissim,</p>
      <p>M. Guerini, Geppetto carves italian into a language
model, 2020. arXiv:2004.14253.
[12] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,</p>
      <p>I. Sutskever, Language models are unsupervised
multitask learners, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>