<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ba tti at GeoLingIt: Beyond Boundaries, Enhancing Geolocation Prediction and Dialect Classification on Social Media in Italy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alkis Koudounas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flavio Giobergia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irene Benedetto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Monaco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Cagliero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Apiletti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Baralis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Control and Computer Engineering</institution>
          ,
          <addr-line>Politecnico di Torino, Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MAIZE</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The proliferation of social media platforms has presented researchers with valuable avenues to examine language usage within diverse sociolinguistic frameworks. Italy, renowned for its rich linguistic diversity, provides a distinctive context for exploring diatopic variation, encompassing regional languages, dialects, and variations of Standard Italian. This paper presents our contributions to the GeoLingIt shared task, focusing on predicting the locations of social media posts in Italy based on linguistic content. For Task A, we propose a novel approach, combining data augmentation and contrastive learning, that outperforms the baseline in region prediction. For Task B, we introduce a joint multi-task learning approach leveraging the synergies with Task A and incorporate a post-processing rectification module for improved geolocation accuracy, surpassing the baseline and achieving first place in the competition.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;natural language processing</kwd>
        <kwd>dialect localization</kwd>
        <kwd>diatopic variation</kwd>
        <kwd>deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        to advance the current knowledge of linguistic variation
in Italy by focusing on the prediction of locations of
soThe advent of social media has significantly facilitated cial media posts from Twitter based solely on linguistic
the investigation of language usage across diverse so- content. In this paper, we present our contributions to
ciolinguistic aspects. Italy, in particular, stands out as the GeoLingIt shared task. GeoLingIt proposes two
sepa compelling case study due to its remarkable diatopic arate tasks: Subtask A, a classification task that aims
variation, encompassing an array of local languages, di- to identify the region of provenance of a tweet
exhibitalects, and regional manifestations of Standard Italian ing non-Standard Italian language, and Subtask B, a
rewithin a relatively confined geographic area [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This lin- gression task to identify the fine-grained location of the
guistic heterogeneity stems from historical and cultural provenance of the same tweets, in terms of longitude
influences, with distinct lexicons, grammatical structures, and latitude coordinates. Both tasks are based on the
and pronunciations shaping the various language vari- DiatopIt dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
eties present in the country, each bearing the imprints For Task A, we first gather additional data sources
of historical events, geographical isolation, and cultural specifically for the various Italian regions. We propose
traditions. Furthermore, the integration of regional va- a novel approach involving a pre-training step of
staterieties of Standard Italian further enriches the linguistic of-the-art transformer-based models with a contrastive
mosaic of Italy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Within the digital realm, particu- learning strategy leveraging data augmentation
techlarly on platforms like Twitter, Italian speakers leverage niques. This approach outperforms the baseline,
demonthese linguistic variations to express their social identities strating the efectiveness of leveraging pre-training and
and afiliations, thereby contributing to the visibility and contrastive learning to improve the accuracy of region
preservation of these diverse linguistic forms in the on- prediction. For Task B, we introduce a joint multi-task
line domain. This intriguing sociolinguistic phenomenon learning approach that addresses the challenge of
finehas attracted researchers from computational linguistics grained variety geolocation. Our approach outperforms
and sociolinguistics domains, providing valuable insights the baseline by simultaneously tackling both tasks.
Adinto the nuances of language variation in Italy. ditionally, we introduce a post-processing rectification
The GeoLingIt shared task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] at Evalita 2023 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] aims module that refines the predicted coordinates and
ensures their alignment within the boundaries of Italy. This
module enhances the reliability of the predicted locations,
making them more precise and geographically accurate.
      </p>
      <p>EVALITA 2023: 8th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian, Sep 7 – 8, Parma, IT
$ alkis.koudounas@polito.it (A. Koudounas)</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)</p>
      <p>Our proposed methods1 not only achieve
state-of-theart performance in terms of location prediction (with
a test error lower than 100 km compared to an error
of more than 250km of the baseline methods) but also
ofer some valuable insights into Italy’s diverse linguistic
landscape.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>This section outlines the methodology adopted to au</title>
        <p>tomatically ascertain the region (Task A) and the
coordinates, in terms of latitude and longitude (Task B),
of the origin of a tweet, considering the pronounced
class imbalance prevalent within the training dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>To tackle these challenges, we propose to use two key</title>
        <p>techniques: data augmentation and pre-training with
contrastive learning, and multi-task learning.</p>
      </sec>
      <sec id="sec-3-3">
        <title>The analysis of linguistic varieties and dialects is an</title>
        <p>
          emerging topic in the field of Natural Language
Processing (NLP) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. 3.1. Task A
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Eforts have been made to address and incorporate</title>
        <p>variations in corpora, such as pronunciation and spelling The goal of task A is to identify the origin region of a
diferences. given tweet. We denote the set of regions as . Data</p>
        <p>
          Gelly et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and Elfeky et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] addressed the lan- augmentation plays a crucial role in our methodology, as
guage varieties for speech recognition of the English it is implemented to address the class imbalance during
language, emphasizing the fact that the task is even more the training process. In the initial data collection phase,
challenging as the space of dialects is broad. Many re- we obtain a substantial amount of regional dialect data
searchers in the past have attempted to address the chal- from various online sources2. Dump, editions, and further
lenge of language variation by leveraging social media information on the collected data are available in the
data. Grieve et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] compared regional patterns, both oficial repository. We then pre-process them, obtaining
analyzing dialect labels and geolocation, finding strong an expanded vocabulary that is utilized for the purpose
correlations between the two sources. Sadat et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] of data augmentation. We denote the vocabulary for each
have shown that probabilistic models of language identi- region  as  ( ∈ ).
ifcation can be used to identify Arabic dialects on tweets. We adopt a substitution approach to words in tweets
Eforts in the direction of propagating information (e.g., representing language variations to build an augmented
sentiment) from high-resource languages (e.g., Italian) version of the original dataset. Each tweet  =
to low-resource ones (e.g., regional variations) through {1,, ..., ,} belonging to region  is augmented by
vector space alignments have shown promising results, randomly replacing words that are contained in  with
as shown by Giobergia et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] across languages (from other words from the same region, with a random
probaEnglish to other ones). Recently, Italian computational bility . More formally, each term , ∈  is replaced
linguistics research has encountered dificulties due to with ′,, defined follows:
the limited availability of large-scale datasets
specifisctaulldyyta[1il2o]r.eUdnfoforrtthuenalatnelgyu, athgee,sausbesmtapnhtiaaslizceodmipnuatarteiocennatl ′, = {︃∼,  iofthe≤ rwis′e (1)
resources needed for pre-training language models have
resulted in only a few architectures being accessible for Where ′ is the probability of replacing each term ,
Italian. with a diferent one ′, drawn from the same region
        </p>
        <p>
          Moving to the geolocation task, Han et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] pro- ∼   . We experimentally observe the best results in
posed a method for geolocation prediction based on iden- terms of performance for ′ = 0.5.
tifying location-indicative words. Nevertheless, the work Regarding the contrastive learning strategy, we
prewas not focused on dialects. Eisenstein et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] in- train the model to enhance its ability to discern whether
spected the correlation between geographical informa- two tweets belong to the same region. During this
pretion and sociolinguistic associations instead of predict- liminary training phase, the model learns to diferentiate
ing the demographical attributes of users based on their between tweet pairs and their corresponding regional
tweets and their position. Other works have focused on afiliations. Given two tweets, denoted as  and  ,
the geolocation task [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ], but they do not take into along with their labels  and  , the model is trained to
account the language varieties. minimize a loss that facilitates this discrimination:
1The code to reproduce all our experiments is available at https:
//github.com/koudounasalkis/barotti-GeoLingIt2023
exp(sim(z, z )/ )
ℒ = − log ∑︀2
=1,̸= exp sim(z, z)/
        </p>
        <p>(2)</p>
      </sec>
      <sec id="sec-3-5">
        <title>2Italian Wikipedia: it.wikipedia.org,</title>
        <p>Dialettando: dialettando.com
[accessed May-2023]
where:
(3)
(4)
ℒ = ℒ +  ℒ
ℒ = ℒ(ˆ, ) = −

∑︁  log(ˆ )</p>
        <p>=1
ℒ = ℒ(ˆ, ) + ℒ(ˆ, )
tion of loss functions. For Task A, we aim to maximize the
F1 score by minimizing the corresponding cross-entropy
loss. For Task B, we minimize the Haversine distance by
minimizing the mean squared error (MSE) loss, which
helps to reduce the diference between the predicted and
target coordinates. Given a tweet, denoted as , the
model estimates a loss function ℒ that encompasses
both tasks, minimizing the weighted conjunction of a
standard cross-entropy loss for classification ℒ and an
L-2 loss for regression ℒ:
where z and z are the latent representations learned by
the model of the tweets  and  . We set the
temperature parameter  equal to 1, and the function sim is the
cosine similarity. In this approach, we randomly select a
sample from the dataset to serve as an anchor. We then
create a positive data point by augmenting the anchor
with words from the same region and a negative sample
by augmenting the anchor with words from a diferent
region.</p>
        <p>
          We call this approach Contrastive PT &amp; Data++ (See
section 5 for more details). Pre-training with contrastive
learning and data augmentation techniques is
specifically devised to mitigate the challenges posed by the
imbalanced class distribution within the dataset. This
has been proven beneficial in several tasks and
domains [
          <xref ref-type="bibr" rid="ref17 ref18 ref19">17, 18, 19</xref>
          ].
        </p>
        <p>Additionally, we empirically observed that a simple
logistic regression model performs well in terms of F1 score
on various minority classes. We attribute this to a lower
model capacity, reducing the amount of overfitting that
may occur in minority classes. To leverage this insight,
we propose using an exclusive class assignment
mechanism (named Entropy-based Ensemble in the following)
that uses the confidence of the BERT-based model
(further information about the model in Section 4). In other
words, when the BERT-based prediction is made with low
confidence (according to a specific empiric threshold),
we replace the overall prediction with the one made by
the logistic regression if the latter’s confidence is higher.</p>
      </sec>
      <sec id="sec-3-6">
        <title>We estimate the confidence of the BERT-based model</title>
        <p>using the entropy of its predicted probabilities. Lower
entropy is associated with high certainty (i.e., the model
predicts one class with high probability and all others
with a low one), and vice-versa.</p>
        <p>For the training of the logistic regression, we use as
input, for each tweet, the respective bag of words as well
as the vector  = {| ∩ |,  ∈ }, i.e., the number
of words within each tweet that belong to each region in
our dictionary.
301.65
281.03
111.05
99.50
98.41
281.04
263.35
120.02
98.79
97.74</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>
        4. Experimental setting In Table 2, various methods are evaluated for task B,
based on the Haversine distance metric. The Multi-Task
model approach demonstrated substantial enhancements
Models. We consider various models, including Italian- compared to the baselines, achieving a validation distance
BERT [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] cased and uncased versions, LABSE [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and of 111.05 km and a test distance of 120.02 km. This
emBART-IT [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] models pre-trained for the Italian language. phasizes that incorporating a multi-objective function
We find the best model, based on the performance ob- helps the model better tackle the given task. Moreover,
tained on the validation set, to be bert-base-italian- training the model in a multi-task manner, starting from
uncased. Thus this is the base model used to address the the pre-trained model that underwent contrastive
learntasks. All the pre-trained checkpoints of these models ing and data augmentation and was fine-tuned for task
are taken from the Hugging Face hub repository3. A (referred to as “Continuous Learning” in Table 2),
reHyperparameter Setup. We ran a manual hyperpa- sulted in a significant performance boost (test distance
rameter search and followed fine-tuning procedures and of 98.79 km). This improvement is likely attributed to
guidelines from relevant literature. We provide detailed the model already possessing domain knowledge,
leadinformation about the models used for the evaluation, ing to improved performance on the test set. Lastly, the
the hyperparameter setup, and the fine-tuning procedure additional rectification “Beyond-Boundaries” module
efin the oficial project repository. fectively refines the precision of the model achieving the
best performance on the test set (97.74 km) and securing
the first position in the GeoLingIt challenge.
5.1. Logistic regression insights
Table 1 presents results for task A, evaluating diferent
methods based on their validation and test F1 macro As discussed, we used logistic regression in combination
scores. The proposed novel approach, which combines with the BERT-based solution to override low-condfience
contrastive pre-training and data augmentation, demon- predictions. The weights learned by the logistic
regresstrated superior performance compared to other methods. sion can be interpreted as the importance the model
asIt achieved the highest F1 scores on the validation and signs to each feature’s presence (and magnitude). The
test sets, reaching 72.61% and 53.18%, respectively. We features passed are either the number of occurrences of
believe that the reason for this performance diference various words or the overall number of words contained
lies in the presence of new out-of-distribution samples that are known to belong to various regions (dialects).
in the test set, which our model struggles to recognize Table 3 shows, for each region, the ten features with the
accurately. Interestingly, while the proposed multi-task largest weights4.
model excels in task B (Multi-Task FT ), surpassing base- In some cases (underlined in the table), actual names of
line models performance as shown later, it fails to deliver regions and cities are also relevant indicators of a tweet’s
satisfactory results for task A. Nonetheless, it still out- origin. This is a reasonable result, as tweets made in
performs the baseline. Conversely, the Entropy-based a certain dialect are intuitively likely to mention
geoEnsemble method, which enhances BERT performance graphic places related to the dialect itself. We note that
with LR’s one, achieves a high score on the validation the features containing the counts of the words that
beset. However, it only slightly outperforms the Multi-Task long to the various language varieties are also sometimes
approach on the test set, with an improvement of 0.02%. considered useful indicators by the logistic regression. In
      </p>
      <sec id="sec-4-1">
        <title>3https://huggingface.co/dbmdz/bert-base-italian-uncased</title>
        <p>latest access: May 2023</p>
      </sec>
      <sec id="sec-4-2">
        <title>4Some words may have a negative, ofensive, or misogynistic</title>
        <p>connotation. We still report these results for the sake of
thoroughness.
Abruzzo
Basilicata
Calabria</p>
        <p>Campania</p>
        <p>Emilia Romagna
Friuli-Venezia Giulia</p>
        <p>Lazio</p>
        <p>Liguria
Lombardia</p>
        <p>Marche
Molise
Piemonte</p>
        <p>Puglia
Sardegna</p>
        <p>Sicilia</p>
        <p>Toscana
Trentino-Alto Adige</p>
        <p>Umbria
Valle d’Aosta</p>
        <p>Veneto
fregna, diaco, st, statt, &lt;# tokens molise&gt;, &lt;# tokens puglia&gt;, asin, cussu, abruzzo, ju</p>
        <p>fandom, mezzarella, accusci, ah, pazz, aggia, cazz, hahahaha, cazzu, trmon
&lt;# tokens calabria&gt;, calabria, aru, nto, ccu, ciota, capu, jamu, frica, fimmina</p>
        <p>napoli, foss, semp, merd, statt, strunzat, ua, ata, sciem, lota
socmel, maial, tin, sempar, bologna, cinno, umarell, cagher, veh, soccia</p>
        <p>femo, triestin, magnar, ocio, ga, gavemo, mona, trieste, xe, orpo
avoja, annamo, avemo, mortacci, artra, nse, artro, aspettamo, ar, stamo
ou, cusci, abbelinati, porcu, rasciun, emma, pestu, semmu, zena, zeneize</p>
        <p>dighel, pirlata, pheega, nanca, danee, milano, gh, sciuri, gnaro, sciur
en, sperem, ecche, roscio, ancona, marche, scritturebrevi, daje, diaulu, sblab2021
pipponi, ah, buongiornoatutti, venta, fior, fatt, vientu, paes, sort, fake
picio, piou, boja, piemonte, suma, speruma, fauss, nen, piciu, babaciu
salentu, capu, mang, isolitiignoti, munnu, mme, trimone, arret, trmon, bari
ajo, macca, sardegna, &lt;# tokens sardegna&gt;, tottu, biri, tontu, nudda, sesi, itte
chidda, quantu, nuddu, camurria, fici, soddi, carusi, bonu, semu, &lt;# tokens sicilia&gt;
guasi, nsomma, &lt;# tokens toscana&gt;, caa, siuro, boja, diaccio, oglioni, gnamo, tope</p>
        <p>10, bicer, maial, tasi, ghe, sberloni, stinc, sior, tai, pu
pija, ch, &lt;# tokens umbria&gt;, er, porchetto, mejo, bbona, mixatino, je, umbria
carbonada, buonissimo, int, piacione, nasconderti, max, bosc, devise, cher, vivre
varda, sboro, dixe, queo, casin, venessia, ciava, &lt;# tokens veneto&gt;, xe, veneto
most cases, the count used is the relevant one for the re- sulting in precise and geographically accurate location
gion of interest (for example, the number of tokens from predictions. Our methods not only achieved
state-of-thethe Venetian language varieties, &lt;# tokens veneto&gt; is art performance, allowing us to be placed first for Task
a valuable feature to detect the “Veneto” region). The B, but also provided some model insights into the rich
only exception occurs for Abruzzo, where the presence linguistic landscape of Italy.
of both token counts from Molise and Puglia are consid- Future work could delve into fine-grained dialect
clasered helpful indicators. Given the geographic proximity sification. This involves developing models capable of
of these regions, we find this result to be reasonable. identifying specific dialects or regional varieties within</p>
        <p>Finally, it can be observed that some situations arise a given region, which would provide a more nuanced
where words that are generally not characterizing for understanding of language variation in Italy and enable
certain regions still emerge as being significant ones (e.g., more targeted analyses of sociolinguistic phenomena.
“hahahaha” for Basilicata, or “sblab2021” for Marche). We
believe this to be an overfitting problem due to the lack
of meaningful data on some of the minority regions: as Acknowledgments
such, it could be addressed by collecting additional data
for those regions.</p>
      </sec>
      <sec id="sec-4-3">
        <title>This study was carried out within the FAIR - Future</title>
      </sec>
      <sec id="sec-4-4">
        <title>Artificial Intelligence Research and received funding</title>
        <p>from the European Union Next-GenerationEU (PIANO
6. Conclusion and future work NAZIONALE DI RIPRESA E RESILIENZA (PNRR) –
MIS</p>
      </sec>
      <sec id="sec-4-5">
        <title>SIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D.</title>
        <p>This paper presented our contributions to the GeoLingIt 1555 11/10/2022, PE00000013), the grant “National Centre
shared task. We addressed Task A by designing a pre- for HPC, Big Data and Quantum Computing”, CN000013
training approach that leverages data augmentation and (approved under the M42C Call for Proposals -
Investcontrastive learning, surpassing the baseline and demon- ment 1.4 - Notice “Centri Nazionali” - D.D. No. 3138,
strating the efectiveness of our approach in region pre- 16.12.2021, admitted for funding by MUR Decree No.
diction. For Task B, we introduced a joint multi-task 1031,17.06.2022), as a part of the MALTO (MAchine
Learnlearning approach that outperformed the baseline and ing @ poliTO) team, with partial support from
Smartincorporated a post-processing rectification module, re- Data@PoliTO center on Big Data and Data Science. This
manuscript reflects only the authors’ views and opinions,
neither the European Union nor the European
Commission can be considered responsible for them.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Maiden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Parry</surname>
          </string-name>
          , The dialects of Italy, Routledge,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          ,
          <article-title>Nlp for language varieties of italy: Challenges and the path forward</article-title>
          ,
          <source>arXiv preprint arXiv:2209.09757</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          , C. Casula, GeoLingIt at EVALITA 2023:
          <article-title>Overview of the geolocation of linguistic variation in Italy task</article-title>
          ,
          <source>in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          , G. Venturi,
          <year>Evalita 2023</year>
          :
          <article-title>Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          , C. Casula,
          <article-title>DiatopIt: A corpus of social media posts for the study of diatopic language variation in Italy</article-title>
          ,
          <source>in: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial</source>
          <year>2023</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>199</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .vardial-
          <volume>1</volume>
          .
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Scherrer</surname>
          </string-name>
          ,
          <article-title>Natural language processing for similar languages, varieties, and dialects: A survey</article-title>
          ,
          <source>Natural Language Engineering</source>
          <volume>26</volume>
          (
          <year>2020</year>
          )
          <fpage>595</fpage>
          -
          <lpage>612</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Laurent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. B.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Messaoudi</surname>
          </string-name>
          ,
          <article-title>Language recognition for dialects and closely related languages</article-title>
          .,
          <source>in: Odyssey</source>
          , volume
          <year>2016</year>
          ,
          <year>2016</year>
          , pp.
          <fpage>124</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Elfeky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>Soto, Multi-dialectical languages efect on speech recognition: Too much choice can hurt</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>128</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Grieve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Montgomery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Murakami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Mapping lexical dialect variation in british english using twitter</article-title>
          ,
          <source>Frontiers in Artificial Intelligence</source>
          <volume>2</volume>
          (
          <year>2019</year>
          )
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sadat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kazemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farzindar</surname>
          </string-name>
          ,
          <article-title>Automatic identiifcation of arabic language varieties and dialects in social media</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giobergia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Garza</surname>
          </string-name>
          , E. Baralis,
          <article-title>Cross-lingual propagation of sentiment information based on bilingual vector space alignment</article-title>
          ., in: EDBT/ICDT Workshops,
          <year>2020</year>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Colomba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          , E. Baralis,
          <string-name>
            <surname>Italic:</surname>
          </string-name>
          <article-title>An italian intent classification dataset</article-title>
          ,
          <source>arXiv preprint arXiv:2306.08502</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>P</surname>
          </string-name>
          . Cook, T. Baldwin,
          <article-title>Geolocation prediction in social media data by finding location indicative words</article-title>
          ,
          <source>in: Proceedings of COLING</source>
          <year>2012</year>
          ,
          <year>2012</year>
          , pp.
          <fpage>1045</fpage>
          -
          <lpage>1062</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>Discovering sociolinguistic associations with structured sparsity, in: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies</article-title>
          ,
          <year>2011</year>
          , pp.
          <fpage>1365</fpage>
          -
          <lpage>1374</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Srivatsa</surname>
          </string-name>
          , L. Liu,
          <article-title>When twitter meets foursquare: tweet location prediction using foursquare</article-title>
          ,
          <source>in: 11th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rahimi</surname>
          </string-name>
          , T. Cohn, T. Baldwin,
          <article-title>Twitter user geolocation using a unified text and network prediction model</article-title>
          ,
          <source>arXiv preprint arXiv:1506.08259</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.-J.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <article-title>Contrastive learning with stronger augmentations</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Passonneau</surname>
          </string-name>
          ,
          <article-title>Contrastive data and learning for natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Garza</surname>
          </string-name>
          , E. Baralis,
          <article-title>Transformer-based non-verbal emotion recognition: Exploring model portability across speakers' genders</article-title>
          ,
          <source>in: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schweter</surname>
          </string-name>
          ,
          <source>Italian bert and electra models</source>
          ,
          <year>2020</year>
          . URL: https://doi.org/10.5281/zenodo.4263142. doi:
          <volume>10</volume>
          .5281/zenodo.4263142.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arivazhagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Language-agnostic bert sentence embedding</article-title>
          ,
          <source>in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>878</fpage>
          -
          <lpage>891</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>M. La Quatra</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <article-title>Bart-it: An eficient sequence-to-sequence model for italian text summarization</article-title>
          ,
          <source>Future Internet</source>
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>