<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Basque: Data Creation, Evaluation and Systems for Dialect and Register Awareness in Natural Language Processing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaione Bengoetxea</string-name>
          <email>jaione.bengoetxea@ehu.eus</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Doctoral Symposium on Natural Language Processing</institution>
          ,
          <addr-line>25</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>HiTZ Basque Center for Language Technology - Ixa, University of the Basque Country UPV/EHU</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Works regarding language variation in Natural Language Processing (NLP) are scarce and often focus on languages for which data is easily available, such as English. Thus, no works have dealt with the automatic processing of linguistic variability in Basque, and the few works that exist on variability in NLP have focused on the linguistic theory or feature theoretical characterization. In this context, the objective of this thesis will be to explore the efect of language variability via automatic techniques using Large Language Models (LLM) in Basque. In order to do so, we will generate the first datasets with manually annotated linguistic variability in Basque, data that will serve as a base for the development of Basque variability-aware NLP systems. Prior to this, a careful linguistic analysis of Basque language variability will first be carried out. Additionally, to overcome the limitations of data scarcity, data collection and augmentation methods will be worked on, thus reducing the time, efort and cost that collecting and generating data involves. Finally, specific NLP tasks such as Question-Answering (QA) and Natural Language Inference (NLI) will be evaluated in terms of variability awareness, and the improvement of those tasks when variation is present will be investigated with the aim of developing robust variability-aware NLP systems in Basque.</p>
      </abstract>
      <kwd-group>
        <kwd>Variation</kwd>
        <kwd>Basque</kwd>
        <kwd>evaluation</kwd>
        <kwd>low resource</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Examples of diferent linguistic variation</title>
      <sec id="sec-1-1">
        <title>Register</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Formal</title>
      <p>To
socialize
with
other teenagers is of
utmost importance
during the
adolescent years.</p>
    </sec>
    <sec id="sec-3">
      <title>Informal</title>
    </sec>
    <sec id="sec-4">
      <title>Hanging out with friends their age is super important for teenagers.</title>
      <sec id="sec-4-1">
        <title>Geographical</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Standard</title>
    </sec>
    <sec id="sec-6">
      <title>She often spends her own money.</title>
    </sec>
    <sec id="sec-7">
      <title>Dialectal</title>
    </sec>
    <sec id="sec-8">
      <title>She be spendin’ her own money.</title>
      <p>This language variability creates dificulties in several Natural Language Processing (NLP) tasks,
which for example involve Question-Answering (QA), Natural Language Inference (NLI) or dialogue
system tasks. For instance, Figure 1 illustrates the dificulties of Claude to understand a simple question</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
in a Basque dialect out of context. Consequently, interest in the automatic processing of language
variability has gained considerable interest in recent years, as demonstrated by the success of workshops
such as VarDial, which in 2025 celebrated its 12th edition.</p>
      <p>
        Nevertheless, the majority of current research has focused on a limited list of languages (e.g., Arabic,
German, English), thus leaving a grand majority of languages unexplored. We consider expanding
this field of research to a more diverse set of languages to be of utmost importance, as providing
variability-aware NLP tools is an essential step towards building more equitable NLP systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This
will ultimately provide more accessible resources for every user, despite their social, educational or
communication background.
      </p>
      <p>
        Nowadays, NLP relies on Large Language Models (LLMs), which are usually trained on standard
varieties of language and need large amounts of data to obtain an acceptable performance. Consequently,
due to the lack of language variability data in the training process of LLMs, they considerably struggle
to analyze non-standard texts and their performance in tasks such as NLI and QA drops when language
variation is present [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Therefore, the aim of this thesis is to contribute to this field of growing interest by exploring language
variability in Basque. We consider Basque an interesting and challenging language to work on due
to its low-resource nature, as well as its high linguistic variation between its dialects. The thesis will
be carried out in the HiTZ research group, under the Language Analysis and Processing (UPV/EHU)
doctoral program and directed by Rodrigo Agerri and Itziar Gonzalez-Dios.</p>
      <sec id="sec-8-1">
        <title>2. Background and Related Work</title>
        <p>
          In recent years, there has been an increasing interest in language variability in NLP. This section
introduces the works that have been presented in the fields of dialect and register processing.
2.1. Geographic Variation
Regarding dialects, research has been conducted in several tasks such as dialect identification [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ],
sentiment analysis [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], Machine Translation (MT) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] or dialogue systems [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In fact, Aepli and
Sennrich [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] explored cross-lingual transfer between closely related varieties by adding character-level
noise to high-resource data to improve generalization. Moreover, Ramponi and Casula [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] pretrained
LLMs for geographic variation of Italian tweets. Finally, Demszky et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] showed that BERT models
trained on annotated corpora obtained high accuracy for Indian English feature detection.
        </p>
        <p>
          One of the primary limitations of these studies is the scarcity of available dialectal data. Therefore,
research has largely focused on developing resources such as lexicons and dialectal datasets: Artemova
and Plank [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] propose a bilingual lexicon induction method for German dialects using LLMs, while
Hassan et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] introduce a synthetic data creation method through embeddings by transforming
input data into its dialectic variant.
        </p>
        <p>
          The lack of comprehensive dialectal data has led to research on linguistic variation being limited to
certain languages. The Arabic dialect family, due to its relative data availability, has received the most
attention, followed by languages such as Indic languages, Chinese and German. For more information
on dialectal research in NLP, Joshi et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] provides a comprehensive survey of the latest works, and
Faisal et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] establishes an extensive variability benchmark for several languages.
2.2. Registers
Regarding language variation in terms of register, two main tasks have been researched: style transfer
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and register classification [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The most relevant for this project is style transfer, which involves
converting a sentence from one register to another. Rao and Tetreault [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] studied the transformation
from informal to formal English, finding that Neural Machine Translation (NMT) achieved the highest
formality, while their rule-based approach best preserved meaning. Briakou et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] introduced
XFORMAL, a multilingual dataset with formal sentences derived from informal ones in Brazilian
Portuguese, French and Italian, highlighting that there is still potential for improvement in multilingual
style transfer. However, the majority of the works have been carried in English.
2.3. Language Variation in Basque
In Basque dialectology, Zuazu [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] established an extensive and comprehensive descriptive
representation of features of modern Basque dialects. In NLP, Estarrona et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] worked on a morpho-syntactically
annotated corpus of Basque historical texts as an aid in the normalization process. Moreover, Uria
and Etxepare [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] introduced a corpus of syntactic variation in northern Basque dialects. Additionally,
some dialectal benchmark works have included Basque in their experimentation, where they presented
benchmarks for MT with northern Basque dialects [
          <xref ref-type="bibr" rid="ref11 ref18">18, 11</xref>
          ]. However, no work has yet dealt with
southern Basque dialects in NLP. Following Zuazu [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]’s work, southern Basque dialects would be
Western (traditionally linked to the province of Biscay), Central (traditionally linked to the province of
Gipuzkoa) and Navarrese.
        </p>
        <p>
          Regarding Basque registers, some linguistic theory research on registers involves studies on academic
Basque [
          <xref ref-type="bibr" rid="ref19">19, 20</xref>
          ], as well as an informal form of Basque called ‘hika’ [21]. The closest work to NLP is
Alonso-Ramos and Zabala [22], who extracted academic vocabulary lists to create a writing aid tool.
Thus, to the best of our knowledge, no previous work has been done involving register processing and
understanding in the field of NLP.
        </p>
      </sec>
      <sec id="sec-8-2">
        <title>3. Description of the Proposed Research, Including the Main</title>
      </sec>
      <sec id="sec-8-3">
        <title>Hypotheses for Research</title>
        <p>This research will explore the efect of language variation on the performance of LLMs in Basque, as well
as examine methods to improve their behavior when linguistic variability is added. Based on a linguistic
study, language variability performance will be evaluated in NLP tasks such as Natural Language
Inference (NLI) and Question Answering (QA), exploring data collection methods and implementation
to improve the robustness of language variability in these tasks.</p>
        <p>Our main hypothesis is that LLMs will struggle to perform certain tasks when using linguistically
diverse data as input, especially given the inherently high variability of Basque. Therefore, our first
objective will be to perform a thorough evaluation of current NLP resources. This is a novel and
ambitious line of research for several reasons: (i) currently, there is no available dataset that covers
language variability in Basque regarding southern dialects or diferent registers (ii) although some
aspects of language variability can be generalized, there are many other aspects that are language
specific, thus data collection methods as well as system building approaches will need to be adapted to
Basque, which represents a considerable scientific challenge (iii) there is no NLP system that extensively
supports linguistic variability in Basque.</p>
        <p>
          Achieving the main objective of the thesis will have two main benefits. First of all, the results of this
thesis will contribute to the understanding of the underlying linguistic mechanisms inherent to dialectal
and register variation in Basque. Secondly, the development of variability-aware NLP systems will
bring benefits to several other fields such as QA, NLI, discourse and dialogue systems, summarization
or MT, consequently bringing the field of Language Technology closer to more accessible and equitable
NLP tools. In order to achieve this main objective, the following intermediate tasks have been outlined:
Task 1: Analysis of variation. A linguistic variation analysis, both in terms of dialects and registers, is
essential. More specifically, the adaptation of Zuazu [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]’s work on dialectal Basque features to language
technology tools is imperative for the automatic processing of geographical variation. Additionally,
establishing well-distinguished register boundaries for Basque is necessary for the automatic processing
of diferent formality sentences.
        </p>
        <p>Task 2: Data collection. Conducting a data collection process for low-resource environments specific
to the linguistic variation of Basque will be essential, thus obtaining the first linguistic variability dataset
in Basque. First, a search and collection of publicly available data will be carried out. This process
will be complemented with some experimentation based on paraphrasing and MT approaches, such
as rule-based permutations, lexical normalization or style transfer methods. Additionally, manual
adaptation of datasets into linguistically diverse text will also be explored to obtain gold label quality
data.</p>
        <p>Task 3: Data augmentation with generative language models. An investigation of diferent
techniques to take advantage of language models for text generation such as monolingual (Latxa [23])
and multilingual (Bloom [24]; GPT-4 [25]) LLMs will be conducted, thus facilitating the generation of
synthetic data in language variability tasks in Basque. This will serve both as data transformation as
well as a data augmentation step, which will be significantly relevant for low-resource environments,
as current Deep Learning and Neural Network approaches often demand considerable amounts of data.
Task 4: Assessment. The performance of monolingual and multilingual state-of-the-art LLMs will
be evaluated in diferent NLP tasks when language variability is present. With this assessment, the
shortcomings and limitations of LLMs will be identified and analyzed.</p>
        <p>Task 5: Development and evaluation of variability-aware LLMs. The assessed LLMs will be
adapted, thus improving their performance based on the analysis carried out in Task 4 and the data
collected and generated in Tasks 2 and 3. The tasks of QA and NLI will be evaluated in terms of linguistic
variability as studied in Task 1. Due to data scarcity, the foreseen methods are zero- and few-shot
techniques, which rely on none or few training data points for experimentation.</p>
        <p>In summary, our main hypothesis is that LLMs encounter dificulties when dealing with linguistic
variation across specific tasks, especially in Basque, a high variability language. Consequently, our
objective will be to provide novel resources (such as linguistically diverse datasets, either manually
created, collected or automatically generated), as well as to evaluate the performance of current NLP
tools. Finally, an experimentation step will be carried out to improve the understanding and performance
ability of current NLP resources when dealing with linguistically diverse data and tasks.</p>
      </sec>
      <sec id="sec-8-4">
        <title>4. Methodology and the Proposed Experiments</title>
        <p>Variability processing in NLP is currently marked by deep learning neural systems, often supported
by methods based on linguistic features. However, the scarcity of training data, particularly for
lowresource languages like Basque, poses a significant challenge in language variability processing.</p>
        <p>Thus, this project will require a thorough dialect- and register-aware data collection process, which
will cover a wide range of text types, providing us with a large scope of linguistic variability. In other
words, dialects, registers and text types are interconnected: text types are influenced by the appropriate
register for each context, while certain dialects are more suitable for specific registers. Thus, obtaining
diferent text types will inherently provide us with the linguistic variation that this thesis aims to study,
as well as create robust tools that could deal with diferent types of texts.</p>
        <p>In this context, our methodological proposal will consist of the following novelties: (i) adapting data
collection methods for Basque language variability, (ii) establishing Basque-specific evaluation criteria
for variability detection (iii) exploring few-shot and zero-shot approaches to reduce the need for costly
manual data collection and annotation.</p>
        <p>
          The following experiments have been proposed, which have been organized yearly:
Year 1: linguistic analysis of variation and data collection We will start working on Task 1 by
conducting an extensive study of language variation in Basque. This will imply the analysis of dialectal
features [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], as well as the creation of a general formality typology for Basque, based on previous
domain-specific analyses [ 22].
        </p>
        <p>Large amounts of real and natural linguistically variable data can be found in local news articles,
oral transcriptions, or subtitles, while registers that difer from the neutral form of language could be
extracted from academic and scientific texts [ 22], legal texts or political speech [26, 27].</p>
        <p>In terms of data collection, a gold standard dataset will be created by manually adapting the evaluation
partition of a NLI dataset (XNLI-eu [28]) into diferent variations of Basque. This will allow us to perform
some baseline experiments to measure the level of linguistic variability in understanding these tasks
currently in the NLP field.</p>
        <p>Additionally, some (semi-)automatic methods will be explored, such as rule-based settings, lexical
normalization, or LLM prompting in order to obtain additional silver parallel data.</p>
        <p>The experiments and objectives planned for this year are the following:
1. Theoretical framework for the analysis and processing of variability in Basque, which will establish
concise criteria to determine if a sentence has an adequate dialectal and/or formal form.
2. Development of data collection methods for low-resource environments specific for the linguistic
variation of Basque, exploring rule-based settings as well as lexical normalization approaches
and LLM prompting.
3. Provide the first manually-adapted, publicly available dataset of Basque language variation of
southern dialects.
4. Baseline evaluation of the efect of language variability in the task of NLI. Zero-shot experiments
will be conducted to analyze the efect of variability data in the fine-tuning step.</p>
        <p>5. Publication of data collection approaches for language variability in low-resource environments.
Year 2: generation of language variability data for low-resource environments The
featuretypologies and variability data obtained in the previous year will be used to work on a second iteration
of Task 2, this time focusing on the generation of text containing variability. For this purpose, Task 3
will focus on the study of diferent methods of language generation, experimenting with the creation of
diferent types of linguistic variability, and evaluating the generated variation in terms of the theoretical
framework previously established.</p>
        <p>In this respect, we will work on expanding the adaptation of XNLI-eu to larger evaluation data as well
as training data. Additionally, we will work on other tasks such as QA by adapting available datasets
(e.g., BertaQA [29]) into variability by using the data adaptation methods previously explored.</p>
        <p>The experiments and objectives planned for this year are the following:
1. Development of techniques for the automatic generation of synthetic data with variability through
experiments with generative language models.
2. Variability datasets for QA and NLI tasks. The training data will be obtained through the automatic
generation of linguistic variability. The gold standard dataset (manually permuted by native
Basque linguists, experts in the corresponding variation) obtained in the previous year will be
used for evaluation.
3. Evaluation of the impact of variability in the performance of NLI and QA tasks, now with expanded
data. Results will be compared against previously established zero-shot baselines.
4. Publication of synthetic data generation results for low-resource languages, as well as evaluation
of NLI and QA tasks.</p>
        <p>Year 3: test and improve Basque LLM performance with linguistic variability Work on Task 4
will continue by focusing on the expansion of the number of LLMs evaluated on the previous baseline
for QA and NLI tasks with variation, as well as improving the capacity of those LLMs to process
linguistic variability. In order to do so, Task 5’s focal point will be to experiment with the
state-of-theart monolingual as well as multilingual LLMs available at the time of experimentation, with the aim of
improving their ability to process linguistic variability in the tasks of QA and NLI.</p>
        <p>It is also foreseen to go on a PhD stay to a foreign institution, where methods to deal with variation in
other languages will be explored to see if they are also applicable to Basque or our analysis is applicable
to other languages. This will contribute to the understanding of language-specificity in the linguistic
variability field of NLP, while exploring a multilingual or language-agnostic approach.</p>
        <p>The experiments and objectives planned for this year are the following:
1. Evaluation of monolingual and multilingual LLMs in the tasks of variability-aware QA and NLI.
2. Development of variability-aware LLMs, improving their performance for Basque language
variability.
3. During the PhD stay, expansion of the language variation methods to other languages, thus
analyzing the level of knowledge transfer ability when it comes to language variability.
4. Publication about the development of Basque variability-aware LLMs.</p>
        <p>5. Publication on cross-lingual knowledge transfer across dialects of diferent languages.
Year 4: final experiments and thesis write-up A final iteration of Task 5 will be conducted, taking
the most interesting conclusions obtained from previous years and rounding of new experiments in
the first months of the year. Then the thesis will be written, and its defense preparation will be done.</p>
        <p>The experiments and objectives planned for this year are the following:
1. Finish Task 5, thus finishing tasks and experiments of previous years.
2. Publication of the results of the language-variability aware LLMs.</p>
        <p>3. Write-up of the PhD thesis and defense.</p>
      </sec>
      <sec id="sec-8-5">
        <title>5. Specific Issues of Research to be Discussed</title>
        <p>This thesis will work on the processing of language variation in Basque, both when it comes to dialects
as well as registers. In doing so, the following challenges will need to be addressed:
• Defining variation boundaries in Basque NLP: One of the central dificulties is determining
dialectal and register-based boundaries. How can we reliably annotate or detect linguistic
variation when boundaries are often fluid and context-dependent? This raises both linguistic and
methodological questions, especially for low-resource languages.
• Data scarcity: In NLP, no work has been done in terms of southern Basque dialects or registers.</p>
        <p>Therefore, I would like to discuss some strategies to overcome this constraint, such as data
collection methods in low-resource environments. All collected data and developed software
would be made publicly available under free licenses to support reproducibility and scientific
advancement.
• Synthetic data generation: To alleviate data scarcity, synthetic data generation via LLMs
presents a promising approach, as we can create sentence pairs by prompting models. I have
currently tested some zero-shot prompting methods, and I would additionally like to discuss methods
to produce authentic variability while avoiding overfitting or bias toward overly standardized
forms.
• Evaluation of generated data: Evaluating the linguistic quality and task-relevance of generated
data is a major challenge. I will start by manually evaluating a small sample of generated text in
order to assess its quality. However, I would like some feedback on some form of scalability, so
that I can expand this evaluation to larger amounts of highly variable data.</p>
      </sec>
      <sec id="sec-8-6">
        <title>Acknowledgments</title>
        <p>This thesis is funded by the Basque Government pre-doctoral grant (PRE_2024_1_0028).</p>
      </sec>
      <sec id="sec-8-7">
        <title>Declaration on Generative AI</title>
        <p>The author(s) have not employed any Generative AI tools.
[20] G. Bereziartua Etxeberria, M. M. Boillos Pereira, Euskara, hizkuntza akademikoa: laburpenen
sistematizazioa helburu, Euskera Ikerketa Aldizkaria (2022) 33–62. URL: https://euskera-ikerketa.
euskaltzaindia.eus/index.php/euskera/article/view/6. doi:10.59866/eia.vi67.6.
[21] B. M. Aseguinolaza, G. B. Etxeberria, Hitanoa euskal hiztunen komunitate garaikidean: molde
zaharretatik ertz berrietara, Bat: Soziolinguistika aldizkaria (2022) 135–164.
[22] M. Alonso-Ramos, I. Zabala, Hartaes-vas: Lexical combinations for an academic writing aid tool
in spanish and basque, in: CEUR Workshop Proceedings, volume 3224, CEUR-WS. org, 2022, pp.
22–25.
[23] J. Etxaniz, O. Sainz, N. Miguel, I. Aldabe, G. Rigau, E. Agirre, A. Ormazabal, M. Artetxe,
A. Soroa, Latxa: An open language model and evaluation suite for Basque, in: L.-W. Ku,
A. Martins, V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), Association for Computational
Linguistics, Bangkok, Thailand, 2024, pp. 14952–14972. URL: https://aclanthology.org/2024.acl-long.799/.
doi:10.18653/v1/2024.acl- long.799.
[24] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon,
M. G. et al., Bloom: A 176b-parameter open-access multilingual language model, 2023. URL:
https://arxiv.org/abs/2211.05100. arXiv:2211.05100.
[25] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J.
Altenschmidt, S. A. et al., Gpt-4 technical report, 2024. URL: https://arxiv.org/abs/2303.08774.
arXiv:2303.08774.
[26] J. Alkorta, M. I. Quintian, Adding the Basque parliament corpus to ParlaMint project, in: D. Fišer,
M. Eskevich, J. Lenardič, F. de Jong (Eds.), Proceedings of the Workshop ParlaCLARIN III within the
13th Language Resources and Evaluation Conference, European Language Resources Association,
Marseille, France, 2022, pp. 107–110. URL: https://aclanthology.org/2022.parlaclarin-1.15/.
[27] N. Escribano, J. A. Gonzalez, J. Orbegozo-Terradillos, A. Larrondo-Ureta, S. Peña-Fernández,
O. Perez-de Viñaspre, R. Agerri, BasqueParl: A bilingual corpus of Basque parliamentary
transcriptions, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara,
B. Maegaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.), Proceedings of the Thirteenth
Language Resources and Evaluation Conference, European Language Resources Association, Marseille,
France, 2022, pp. 3382–3390. URL: https://aclanthology.org/2022.lrec-1.361/.
[28] M. Heredia, J. Etxaniz, M. Zulaika, X. Saralegi, J. Barnes, A. Soroa, XNLIeu: a dataset for
crosslingual NLI in Basque, in: K. Duh, H. Gomez, S. Bethard (Eds.), Proceedings of the 2024 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies (Volume 1: Long Papers), Association for Computational Linguistics, Mexico City,
Mexico, 2024, pp. 4177–4188. URL: https://aclanthology.org/2024.naacl-long.234/. doi:10.18653/
v1/2024.naacl- long.234.
[29] J. Etxaniz, G. Azkune, A. Soroa, O. L. de Lacalle, M. Artetxe, Bertaqa: How much do language models
know about local culture?, 2024. URL: https://arxiv.org/abs/2406.07302. arXiv:2406.07302.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Coseriu</surname>
          </string-name>
          ,
          <article-title>La geografía lingüística</article-title>
          , volume
          <volume>11</volume>
          ,
          <string-name>
            <surname>Universidad de la República</surname>
          </string-name>
          , Facultad de Humanidades y Ciencias,
          <year>1956</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dabre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kanojia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dippold</surname>
          </string-name>
          ,
          <article-title>Natural language processing for dialects of a language: A survey</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>57</volume>
          (
          <year>2025</year>
          ). URL: https://doi.org/10.1145/ 3712060. doi:
          <volume>10</volume>
          .1145/3712060.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          , C. Casula,
          <article-title>DiatopIt: A corpus of social media posts for the study of diatopic language variation in Italy</article-title>
          , in: Y. Scherrer,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jauhiainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ljubešić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          , M. Zampieri (Eds.),
          <source>Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial</source>
          <year>2023</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>199</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .vardial-
          <volume>1</volume>
          .19/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .vardial-
          <volume>1</volume>
          .
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ball-Burack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Diferential tweetment: Mitigating racial dialect bias in harmful tweet detection</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>
          , FAccT '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>116</fpage>
          -
          <lpage>128</lpage>
          . URL: https://doi.org/10.1145/3442188.3445875. doi:
          <volume>10</volume>
          .1145/3442188. 3445875.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kuparinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miletić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Scherrer</surname>
          </string-name>
          ,
          <article-title>Dialect-to-standard normalization: A large-scale multilingual evaluation</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>13814</fpage>
          -
          <lpage>13828</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-emnlp.
          <volume>923</volume>
          /. doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2023</year>
          .findings- emnlp.923.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Alshareef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Siddiqui</surname>
          </string-name>
          ,
          <article-title>A seq2seq neural network based conversational agent for gulf arabic dialect</article-title>
          ,
          <source>in: 2020 21st International Arab Conference on Information Technology (ACIT)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACIT50332.
          <year>2020</year>
          .
          <volume>9300059</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Aepli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <article-title>Improving zero-shot cross-lingual transfer between closely related languages by injecting character-level noise</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2022</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>4074</fpage>
          -
          <lpage>4083</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .findings-acl.
          <volume>321</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2022</year>
          .findings- acl.321.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Demszky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          ,
          <article-title>Learning to recognize dialect features</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>2315</fpage>
          -
          <lpage>2338</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>184</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl- main.184.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <article-title>Low-resource bilingual dialect lexicon induction with large language models</article-title>
          , in: T. Alumäe, M. Fishel (Eds.),
          <source>Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)</source>
          , University of Tartu Library, Tórshavn, Faroe Islands,
          <year>2023</year>
          , pp.
          <fpage>371</fpage>
          -
          <lpage>385</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .nodalida-
          <volume>1</volume>
          .39/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elaraby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Tawfik</surname>
          </string-name>
          ,
          <article-title>Synthetic data for neural machine translation of spokendialects</article-title>
          , in: S. Sakti, M. Utiyama (Eds.),
          <source>Proceedings of the 14th International Conference on Spoken Language Translation</source>
          ,
          <source>International Workshop on Spoken Language Translation</source>
          , Tokyo, Japan,
          <year>2017</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>89</lpage>
          . URL: https://aclanthology.org/
          <year>2017</year>
          .iwslt-
          <volume>1</volume>
          .12/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Faisal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ahia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsvetkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anastasopoulos</surname>
          </string-name>
          ,
          <article-title>Dialectbench: A nlp benchmark for dialects, varieties, and closely-related languages</article-title>
          ,
          <source>ArXiv abs/2403</source>
          .11009 (
          <year>2024</year>
          ). URL: https://api.semanticscholar.org/CorpusID:268513057.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          ,
          <article-title>Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer</article-title>
          , in: M.
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Stent (Eds.),
          <source>Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>140</lpage>
          . URL: https://aclanthology.org/N18-1012/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          - 1012.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Eder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Krieg-Holz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <article-title>A question of style: A dataset for analyzing formality on diferent levels</article-title>
          , in: A.
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , I. Augenstein (Eds.),
          <source>Findings of the Association for Computational Linguistics: EACL</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
          <fpage>580</fpage>
          -
          <lpage>593</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-eacl.
          <volume>42</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . findings- eacl.42.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Briakou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Tetreault, Olá, bonjour, salve! XFORMAL:
          <article-title>A benchmark for multilingual formality style transfer</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>3199</fpage>
          -
          <lpage>3216</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>256</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl- main.256.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zuazu</surname>
          </string-name>
          , Euskalkiak. Euskararen dialektoak, Elkar,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Estarrona</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Etxeberria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Etxepare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Padilla-Moyano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soraluze</surname>
          </string-name>
          ,
          <article-title>Dealing with dialectal variation in the construction of the Basque historical corpus</article-title>
          , in: M.
          <string-name>
            <surname>Zampieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ljubešić</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tiedemann</surname>
          </string-name>
          , Y. Scherrer (Eds.),
          <source>Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects</source>
          ,
          <source>International Committee on Computational Linguistics (ICCL)</source>
          , Barcelona,
          <source>Spain (Online)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>89</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .vardial-
          <volume>1</volume>
          .8/.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Uria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Etxepare</surname>
          </string-name>
          ,
          <article-title>Hizkeren arteko aldakortasun sintaktikoa aztertzeko metodologiaren nondik norakoak: Basyque aplikazioa, Lapurdum. Euskal ikerketen aldizkaria| Revue d'études basques| Revista de estudios vascos| Basque studies review (</article-title>
          <year>2012</year>
          )
          <fpage>117</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>M. M. I. Alam</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Anastasopoulos</surname>
          </string-name>
          ,
          <article-title>CODET: A benchmark for contrastive dialectal evaluation of machine translation</article-title>
          , in: Y. Graham, M. Purver (Eds.),
          <source>Findings of the Association for Computational Linguistics: EACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics, St</article-title>
          .
          <source>Julian's, Malta</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1790</fpage>
          -
          <lpage>1859</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .findings-eacl.
          <volume>125</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>I. Z.</given-names>
            <surname>Unzalu</surname>
          </string-name>
          ,
          <article-title>The elaboration of basque in academic and professional domains</article-title>
          , in: L. Grenoble, P. Lane,
          <string-name>
            <given-names>U.</given-names>
            <surname>Royneland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. O.</given-names>
            <surname>Murchadha</surname>
          </string-name>
          (Eds.), Linguistic Minorities in Europe Online, De Gruyter Mouton, Berlin, Boston,
          <year>2019</year>
          . URL: https://doi.org/10.1515/lme.9612443,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>