<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Seminar of the Spanish Society for Natural
Language Processing: Projects and System Demonstrations, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DeepKnowledge: Deep Multilingual Language Model Technology for Language Understanding</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rodrigo Agerri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eneko Agirre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gorka Azkune</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Centeno</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anselmo Peñas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>German Rigau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Álvaro Rodrigo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aitor Soroa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HiTZ Center - Ixa, University of the Basque Country UPV/EHU</institution>
          ,
          <addr-line>Donostia-San Sebastián</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NLP &amp; IR Group</institution>
          ,
          <addr-line>UNED</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>9</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Being language the most eficient system for exchanging information, Natural Language Processing (NLP) is one of the most important technologies of the current digital transformation. In recent years, the NLP community is contributing to the emergence of powerful new deep learning techniques and tools that are revolutionizing the approach to Language Technology (LT) tasks. NLP is moving from a methodology in which a pipeline of multiple modules was the typical way to implement NLP solutions, to architectures based on complex neural networks trained with vast amounts of text data. Thanks to these recent advancements, the NLP community is currently engaged in a paradigm shift with the production and exploitation of large, pre-trained transformer-based language models. Compared to previous work, results are improving so much that systems are claiming to obtain human-level performance in laboratory benchmarks when testing on some dificult language understanding tasks. Despite their impressive capabilities, large pretrained language models do come with severe drawbacks. Currently we have no clear understanding of how they work, when they fail, or which novel ways of exploiting these models can help to improve state-of-the-art in NLP. It is important to understand the limitations of large pretrained language models. DeepKnowledge will investigate on the pre-training of large language models for the oficial languages in Spain in a way that could be used by applying novel techniques to extract a more precise and generalizable knowledge.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Transfer learning</kwd>
        <kwd>Language Models</kwd>
        <kwd>Text Generation</kwd>
        <kwd>Multitask Learning</kwd>
        <kwd>Few-show learning</kwd>
        <kwd>Multimodality</kwd>
        <kwd>Multilingualism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        learning approaches using Transformers [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3, 4</xref>
        ]. Spain (including Spanish, Catalan, Basque, and Galician)
      </p>
      <p>
        Thanks to these recent advancements, the NLP com- and English, in multiple sectors and domains (such as
munity is currently engaged in a paradigm shift with eLearning, eHealth, eHumanities, etc).
the production and exploitation of large, pre-trained
transformer-based language models [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ]. As a result,
many in the industry have started deploying large pre- 2. Related Work
trained neural language models in production. For
instance, Google and Microsoft have integrated them in Currently, the NLP field is undergoing a paradigm shift
their search engines, their flagship product. Compared with the rise of Large Language Models (also known as
to previous work, results are improving so much that Pre-trained Language Models) that are trained on broad
systems are claiming to obtain human-level performance data at scale and are adaptable to a wide range of
monoin laboratory benchmarks when testing on some dificult lingual and multilingual downstream tasks [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Though
language understanding tasks. these models are based on standard self-supervised deep
      </p>
      <p>
        Furthermore, recent work has shown that pre-trained learning and transfer learning, their scale results in new
language models can robustly perform for NLP tasks in emergent and surprising capabilities.
a few-shot or even in zero-shot fashion when given an In self-supervised learning, the language model is
adequate task description in its natural language prompt derived automatically from large volumes of unannotated
[
        <xref ref-type="bibr" rid="ref2">2, 5</xref>
        ]. Surprisingly, fine-tuning pre-trained language language data. There has been considerable progress
models on a collection of tasks described via instructions in self-supervised learning since word embeddings [8]
(or prompts) substantially boosts zero-shot performance associated word vectors with context-independent
vecon unseen tasks [6, 4]. tors. Shortly thereafter, self-supervised learning based
      </p>
      <p>
        Despite their impressive capabilities, large pre-trained on autoregressive language modelling (predict the next
language models do come with severe drawbacks. Cur- word given the previous words) became popular [9]. The
rently we have no clear understanding of how they work, next wave of developments in self-supervised learning
when they fail, and what emergent properties they may — BERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], GPT-3 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], RoBERTa [10], T5 [6], among
present, or which novel ways of exploiting these models others — quickly followed, embracing the Transformer
can help to improve state-of-the-art in NLP. As argued architecture [11], incorporating more powerful deep
bidiby Bender et al. [7], it is important to understand the rectional encoders of sentences, and scaling up to larger
limitations of large pre-trained language models, which models and datasets.
some have called “stochastic parrots”. To tackle these The idea of transfer learning is to take the
knowlquestions, much critical multidisciplinary collaboration edge learned from one task (e.g., predict the next word
and research is needed. given the previous words) and apply it to another task
      </p>
      <p>
        DeepKnowledge will extend the state-of-the-art in nat- (e.g., summarization). With transfer learning, instead of
ural language processing (NLP) and multilingual knowl- starting the learning process from scratch, you start from
edge enabling technologies in seven interrelated areas patterns that have been learned when solving a diferent
of high potential impact. The main research objective of problem. This way you leverage previous learning and
DeepKnowledge consists in advancing the state-of-the- avoid starting from scratch. Within deep learning,
preart towards NLU by (i) generating and exploiting new training is the dominant approach to transfer learning:
language models for the oficial languages of Spain plus the objective is to pre-train a deep transformer model
English by taking into account a multitask and multi- on large amounts of data and then reuse this pre-trained
modal objective during the pre-training; (ii) exploring language model by fine-tuning it on small amounts of
novel ways, such as prompting, of exploiting these lan- (usually annotated) task-specific data. Thus, transfer
guage models to improve NLP results on zero-shot and learning formalizes a two-phase learning framework: a
few-shot settings (without or very little training data for pre-training phase to capture knowledge from one or
the target language or task at hand); (iii) by addressing more source tasks, and a fine-tuning stage to transfer the
language understanding tasks by text generation; (iv) captured knowledge to many target tasks.
by leveraging pre-trained language models and building
knowledge bases from scratch, (v) developing new bench- 2.1. Few-shot Learning
marks and datasets for evaluating and assessing the our Recent work has shown that pre-trained language models
progress towards Natural Language Understanding; (vi) can robustly perform classification tasks in a few-shot or
to apply the newly developed techniques to improve the even in zero-shot fashion, when given an adequate task
state-of-the art in language understanding, especially description in its natural language prompt [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Unlike
trafor settings with few or non-existing training data and ditional supervised learning, which trains a model to take
(vii) by developing a number of advanced content-based in an input and predict an output, prompt-based learning
domain applications for the main oficial languages in
is based on exploiting pre-trained language models to 3, 4]. One of the advantages of these neural models is that
solve a task using text directly [5]. To use these models to they enable end-to-end learning of semantic mappings
perform prediction tasks, the original input is modified from input to output in text generation. These decoder
using a template into a textual string prompt that has models [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3, 4</xref>
        ] are currently the standard architectures
some missing slots, and then the language model is used for generating high quality text which in turn generates
to probabilistically fill the missing information to obtain a crucial need for the evaluation of the generated text.
a final string, from which the final output for the task In DeepKnowledge the progress will be measured by
can be derived. This framework looks very promising developing new understanding and generation natural
for a number of reasons: it allows the language model to language benchmarks and tasks for Basque, Spanish and
be pre-trained on massive amounts of raw text, and by English, focusing on the truthfulness and reliability of
defining a new prompting function the model is able to the output generated by the LLMs. Thus, we will provide
perform few-shot or even zero-shot learning, adapting new benchmarks for popular tasks based on text
generato new scenarios, languages and domains with few or tion and understanding such as Long Answer Question
no labeled data. Thus, some NLP tasks can be solved in Answering, Explanatory Argument Generation and
Infera fully unsupervised fashion by providing a pre-trained ential tasks for which annotated data for evaluation exists
language model with task descriptions in natural lan- only for English. By doing so we are aiming at
signifiguage [6]. Surprisingly, fine-tuning pre-trained language cantly improving the state-of-the-art of AI-based Large
models on a collection of tasks described via instructions Language Models in low resource scenarios for languages
(or prompts) substantially boosts zero-shot performance such as Basque and Spanish thereby contributing to the
on unseen tasks [
        <xref ref-type="bibr" rid="ref2">6, 2, 4</xref>
        ]. improvement of Language Technology Applications and
its deployment in the current digital transformation.
      </p>
      <sec id="sec-1-1">
        <title>2.2. Multilingual Language Models</title>
      </sec>
      <sec id="sec-1-2">
        <title>2.4. Applications</title>
        <p>
          Multilingual Language Models (MLLMs) such as mBERT
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], XLM-RoBERTa [12], mT5 [13], etc. have emerged Current NLP technology allows many advanced
applicaas a viable option for bringing the power of pre-training tions which have been unthinkable only a few years ago.
to a large number of languages. For example, mBERT NLP is present in our daily lives, for example, through
is pre-trained with the Multilingual Masked Language search engines, recommendation systems, virtual
assisModeling (MMLM) task using non-parallel multilingual tants, chatbots, text editors, text predictors, automatic
Wikipedia corpora in 104 languages. mBERT has the abil- translation systems, automatic summaries, inclusive
techity to generalize cross-lingual knowledge in zero-shot nology, etc [16]. Its rapid development in recent years
scenarios. This indicates that even with the same struc- predicts even more encouraging and also exciting results
ture of BERT, using multilingual data can enable the in the near future [17]. Currently, our society is
develmodel to learn cross-lingual representations. A MLLM is oping some fears towards the digital world associated
pre-trained using large amounts of unlabeled data from with information distrust of what is published given the
multiple languages with the hope that low-resource lan- growing amount of false content. Our project aims at
guages may benefit from high-resource languages due to alleviating these problems by developing new methods
a shared vocabulary and latent language properties. The and advancing the state of the art in machine reading
surprisingly good performance of MLLMs in crosslingual comprehension of language and misinformation
detectransfer as well as bilingual tasks motivates the hypothe- tion.
sis that MLLMs are learning universal patterns [14, 15]. In this project we target five application scenarios,
Thus, of particular interest is the ability of MLLMs to namely, eLearning, Question Answering and Machine
facilitate zero-shot crosslingual transfer from a resource- Comprehension, Misinformation, Biomedical Text
Analrich language to a resource-deprived language which ysis and Conversational Agents. In all these application
does not have any task-specific training data, or to fine- areas we will apply the latest neural language model
tune more robust language models by using annotated technology developed within the project.
training data in multiple languages. Recent progress in NLP has been driven by advances in
both language model architecture and model pre-training.
2.3. Text Generation Transformer architectures have facilitated the building
of higher-capacity language models for a wide variety of
Natural Language Generation (NLG) has become one of tasks. Open-source libraries such as Transformers [18]
the most common yet challenging tasks in NLP which is may open up these advances to a wider NLP community.
currently being addressed by the intense development The library consists of carefully engineered state-of-the
and release of many Large Language Models (LLMs) such art Transformer architectures under a unified API and a
as the popular GPT family, Llama and Mistral models [2, curated collection of pre-trained models. Unfortunately,
the resources necessary to create the best-performing
neural language models are found almost exclusively at
US and China technology giants. Moreover, this
transformative technology poses problems from a research
advancement, environmental, and ethical perspective.
        </p>
        <p>For example, models such as GPT-3 or GPT-4 are private,
anglo-centric, and inaccessible to academic organisations
[19]. There are also worrying shortcomings in the text
corpora used to train these models, ranging from a lack
of representation of populations, to a predominance of
harmful stereotypes, and to the inclusion of personal
information.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Methodology and Work Plan</title>
      <sec id="sec-2-1">
        <title>3.1. Objectives DeepKnowledge-EHU</title>
        <p>DeepKnowledge will build models that are capable to
deal with text generation tasks, as well as models that
are trained in a multi-task fashion, which have shown
to generalize better and yield good results work in
zeroshot and few-shot scenarios. We will also work towards
iflling the current gap on language models in these
languages for specific domains, such as Health, Education
and Social media. Regarding text processing
applications, the research team has ample experience
developing NLP tools, both basic NLP modules [20] as well as
advanced semantic processing tools in many languages
[15, 21, 22]. Following this, we list the specific objectives
for DeepKnowledge-EHU:
4. To explore how large language models can
productively interact with existing semantic
networks and ontologies (WP4).
5. To leverage the generated language models
to develop state-of-the-art, ready-to-use,
deeplearning linguistic processors for many NLP tasks,
such as lemmatization, NER, SRL, POS tagging or</p>
        <p>Coreference Resolution, among others (WP2).
6. To improve qualitative and quantitative
evaluation of text generation-based tasks such as text
simplification or argument generation; organize a
shared task to motivate work on this topic (WP5).
7. To leverage the generated models and new
techniques of exploiting them for elearning, Question
Answering, Medical Text Processing,
Misinformation detection and Conversational Agents (WP6).</p>
        <p>DeepInfo-UNED collaborates with two institutions: (i)
Instituto Cervantes and (ii) president Carter Foundation
(USA). One of the goals of Instituto Cervantes is the
certification of human proficiency in the use of Spanish
language. The collaboration between our project and
Instituto Cervantes is focused on: (i) creating a dataset
in Spanish for the evaluation of machine reading and
comprehension capabilities which will address the lack
of training and evaluation resources for other languages
diferent to English, (ii) developing automatic assisting
methods to help evaluators to prepare and check the
exams.</p>
        <p>The Carter Foundation acts as an international
observer in elections all over the world. Traditionally, these
observers were a team of persons that moved physically
to the country and tracked the process. However,
nowadays there is also a need to monitorize political activity
in social networks. By taking into account these two
use cases, the specific objectives of DeepInfo-UNED are
defined as follows:
In this context of paradigm shift within the NLP
community, DeepKnowledge will aim to develop new language
models (i) with multitask and multimodal training objec- 3.2. Objectives DeepInfo-UNED
tive (ii) for specific domains, (iii) and to explore novel
methods of exploiting such language models such as the
use of prompts or text generation, which we believe will
help these pre-trained models to ground their knowledge
improving understanding and generalization skills.
in Spain as well as English. The models will be based on models and carefully designed datasets and knowledge
news technologies, architectures and training paradigms bases to advance the state of the art towards natural
that allow a better generalization between domains and language understanding to English, Spanish, Catalan,
languages. We will build generative models that allow the Basque and Galician in several domains and digital
secgeneration of text in these languages, which is needed in tors. DeepKnowledge will also investigate new text
gentasks such as summarization, simplification or generation eration approaches for applications such as argument
of counter-arguments against misinformation. Besides, generation, text simplification or abstractive
summarizathe project will also build language models adapted to tion. Additionally, DeepKnowledge will apply the new
specific domains of Health, Education, Social media. language models in novel ways for tasks and applications
WP3: Novel paradigms for the exploitation of language such as misinformation detection, Question Answering
models. Develop novel ways to exploit the full potential or elearning.
of large language models, including prompting, gener- Ongoing work can be checked in the project’s
webation and multimodal training. The objective of such site: http://ixa2.si.ehu.eus/deepknowledge/. Future work
exploitation paradigms is two-fold: (i) to improve the includes further experimentation training LLMs for
lowoverall language understanding capabilities of language resource languages and on the evaluation of text
genermodels, and (ii) to make them usable for a great variety ation, a crucial topic to understand the performance of
of applications and languages with minimal preparation our models.
efort, through zero-shot and few-shot learning.</p>
        <p>WP4: Knowledge Acquisition, Integration and
Reasoning. The main objective of this work package is to in- Acknowledgments
vestigate how large language models can productively
interact with existing semantic networks. On the one We acknowledge the support of
DeepKnowlhand, helping on the development of broad-coverage lex- edge (PID2021-127777OB-C21) and DeepInfo
ical knowledge bases such as the Multilingual Central (PID2021-127777OB-C22), projects funded by
Repository [23] in the languages covered by the project MCIN/AEI/10.13039/501100011033 and by FEDER.
and adapted to specific domains such as medicine. On Rodrigo Agerri was also funded by the RYC-2017-23647
the other hand, using these large-scale knowledge bases fellowship (MCIN/AEI/10.13039/501100011033 and by
to generate lexical semantic, world knowledge and com- ESF Investing in your future).
mon sense probes for testing the abilities of modern large
language models. References
WP5: Evaluation. the objective of this work package is
to measure the research progress via objective
evaluation metrics and relevant open evaluation campaigns. An
important component will also be investigating the
evaluation of tasks based on text generation (WP3). Datasets
for Machine Comprehension and Question Answering in
Spanish will be generated. Furthermore, we will organize
a workshop on misinformation.</p>
        <p>WP6: Applications and Use Cases. This work package
aims at demonstrating the scientific advances of
DeepKnowledge in diferent scenarios. It will include
applications in elearning, recommender systems for education
and research, question answering, reading
comprehension, and misinformation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Concluding Remarks</title>
      <p>This paper outlines the DeepKnowledge project, which
is focused on researching and incorporating the latest
insights in deep learning technology, such as large
pretrained language models, transfer learning, few-shot and
zero-shot capabilities, multimodal and multi-task
processing, prompting, etc. DeepKnowledge will leverage
deep learning techniques and large pre-trained language</p>
      <p>I. M. Kloumann, A. V. Korenev, P. S. Koura, M.- A. Siddhant, A. Barua, C. Rafel, mT5: A massively
A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, multilingual pre-trained text-to-text transformer,
Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Moly- in: NAACL, Association for Computational
Linguisbog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, tics, 2021.</p>
      <p>K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Sub- [14] T. Pires, E. Schlinger, D. Garrette, How multilingual
ramanian, X. Tan, B. Tang, R. Taylor, A. Williams, is multilingual BERT?, in: Proceedings of the 57th
J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, Annual Meeting of the Association for
ComputaM. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, tional Linguistics, Association for Computational
S. Edunov, T. Scialom, Llama 2: Open foundation Linguistics, 2019, pp. 4996–5001.
and fine-tuned chat models, ArXiv abs/2307.09288 [15] R. Agerri, E. Agirre, Lessons learned from the
eval(2023). uation of Spanish Language Models, Proces. del
[4] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- Leng. Natural 70 (2023) 157–170.
ford, D. S. Chaplot, D. de Las Casas, F. Bressand, [16] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H.
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.- Nguyen, O. Sainz, E. Agirre, I. Heinz, D. Roth,
ReA. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, cent advances in natural language processing via
T. Lacroix, W. E. Sayed, Mistral 7b, ArXiv large pre-trained language models: A survey, ACM
abs/2310.06825 (2023). Computing Surveys 56 (2021) 1 – 40.
[5] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neu- [17] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo,
big, Pre-train, prompt, and predict: A systematic J. Qiu, Y. Yao, A. Zhang, L. Zhang, et al., Pre-trained
survey of prompting methods in natural language models: Past, present and future, AI Open 2 (2021)
processing, ACM Computing Surveys 55 (2021) 1 – 225–250.</p>
      <p>35. [18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
De[6] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, langue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
FunM. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
limits of transfer learning with a unified text-to-text Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger,
transformer, Journal of Machine Learning Research M. Drame, Q. Lhoest, A. Rush, Transformers:
State21 (2020) 1–67. of-the-art natural language processing, in: Q. Liu,
[7] E. M. Bender, T. Gebru, A. McMillan-Major, D. Schlangen (Eds.), Proceedings of EMNLP: System
M. Mitchell, On the dangers of stochastic parrots: Demonstrations, 2020, pp. 38–45.</p>
      <p>Can language models be too big?, Proceedings of [19] L. Floridi, M. Chiriatti, Gpt-3: Its nature, scope,
the 2021 ACM Conference on Fairness, Account- limits, and consequences, Minds and Machines 30
ability, and Transparency (2021). (2020) 681–694.
[8] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, [20] O. Toporkov, R. Agerri, On the Role of
MorA. Joulin, Advances in pre-training distributed phological Information for Contextual
Lemmatiword representations, in: Proceedings of the 11th zation, Computational Linguistics (2024) 1–35.
Language Resources and Evaluation Conference, doi:10.1162/coli_a_00497.</p>
      <p>Miyazaki, Japan, 2018. [21] O. Sainz, I. García-Ferrero, R. Agerri, O. de Lacalle,
[9] A. M. Dai, Q. V. Le, Semi-supervised sequence learn- G. Rigau, E. Agirre, Gollie: Annotation guidelines
ing, NeurIps (2015). improve zero-shot information-extraction, Twelfth
[10] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, International Conference on Learning
RepresentaO. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, tions (ICLR 2024), 2024.</p>
      <p>RoBERTa: A robustly optimized bert pretraining [22] I. García-Ferrero, R. Agerri, A. A. Salazar, E. Cabrio,
approach, arXiv preprint arXiv:1907.11692 (2019). I. de la Iglesia, A. Lavelli, B. Magnini, B. Molinet,
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, J. Ramirez-Romero, G. Rigau, J. M. Villa-Gonzalez,
L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- S. Villata, A. Zaninello, Medical mT5: An
Opentention is all you need, in: Advances in neural Source Multilingual Text-to-Text LLM for The
Medinformation processing systems, 2017, pp. 5998– ical Domain, Joint International Conference on
6008. Computational Linguistics, Language Resources
[12] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- and Evaluation (LREC-COLING), 2024.
hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, [23] A. G. Agirre, E. Laparra, G. Rigau, Multilingual
L. Zettlemoyer, V. Stoyanov, Unsupervised cross- Central Repository version 3.0: upgrading a very
lingual representation learning at scale, in: Annual large lexical knowledge base, in: GWC 2012 6th
Meeting of the Association for Computational Lin- International Global Wordnet Conference, 2012, p.
guistics, 2019. 118.
[13] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: North American Chapter of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T. J.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Blecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esiobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kardas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kerkez</surname>
          </string-name>
          , M. Khabsa,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>