1. Introduction

Seminar of the Spanish Society for Natural Language Processing: Projects and System Demonstrations, June

DeepKnowledge: Deep Multilingual Language Model Technology for Language Understanding

Rodrigo Agerri

Eneko Agirre

Gorka Azkune

Roberto Centeno

Anselmo Peñas

German Rigau

Álvaro Rodrigo

Aitor Soroa

0 0 HiTZ Center - Ixa, University of the Basque Country UPV/EHU , Donostia-San Sebastián , Spain 1 NLP & IR Group , UNED , Spain

2024

1 9 20

Being language the most eficient system for exchanging information, Natural Language Processing (NLP) is one of the most important technologies of the current digital transformation. In recent years, the NLP community is contributing to the emergence of powerful new deep learning techniques and tools that are revolutionizing the approach to Language Technology (LT) tasks. NLP is moving from a methodology in which a pipeline of multiple modules was the typical way to implement NLP solutions, to architectures based on complex neural networks trained with vast amounts of text data. Thanks to these recent advancements, the NLP community is currently engaged in a paradigm shift with the production and exploitation of large, pre-trained transformer-based language models. Compared to previous work, results are improving so much that systems are claiming to obtain human-level performance in laboratory benchmarks when testing on some dificult language understanding tasks. Despite their impressive capabilities, large pretrained language models do come with severe drawbacks. Currently we have no clear understanding of how they work, when they fail, or which novel ways of exploiting these models can help to improve state-of-the-art in NLP. It is important to understand the limitations of large pretrained language models. DeepKnowledge will investigate on the pre-training of large language models for the oficial languages in Spain in a way that could be used by applying novel techniques to extract a more precise and generalizable knowledge.

eol>Natural Language Processing Deep Learning Transfer learning Language Models Text Generation Multitask Learning Few-show learning Multimodality Multilingualism

1. Introduction

learning approaches using Transformers [ 1, 2, 3, 4 ]. Spain (including Spanish, Catalan, Basque, and Galician)

Thanks to these recent advancements, the NLP com- and English, in multiple sectors and domains (such as munity is currently engaged in a paradigm shift with eLearning, eHealth, eHumanities, etc). the production and exploitation of large, pre-trained transformer-based language models [ 1, 3 ]. As a result, many in the industry have started deploying large pre- 2. Related Work trained neural language models in production. For instance, Google and Microsoft have integrated them in Currently, the NLP field is undergoing a paradigm shift their search engines, their flagship product. Compared with the rise of Large Language Models (also known as to previous work, results are improving so much that Pre-trained Language Models) that are trained on broad systems are claiming to obtain human-level performance data at scale and are adaptable to a wide range of monoin laboratory benchmarks when testing on some dificult lingual and multilingual downstream tasks [ 1, 2 ]. Though language understanding tasks. these models are based on standard self-supervised deep

Furthermore, recent work has shown that pre-trained learning and transfer learning, their scale results in new language models can robustly perform for NLP tasks in emergent and surprising capabilities. a few-shot or even in zero-shot fashion when given an In self-supervised learning, the language model is adequate task description in its natural language prompt derived automatically from large volumes of unannotated [ 2, 5 ]. Surprisingly, fine-tuning pre-trained language language data. There has been considerable progress models on a collection of tasks described via instructions in self-supervised learning since word embeddings [8] (or prompts) substantially boosts zero-shot performance associated word vectors with context-independent vecon unseen tasks [6, 4]. tors. Shortly thereafter, self-supervised learning based

Despite their impressive capabilities, large pre-trained on autoregressive language modelling (predict the next language models do come with severe drawbacks. Cur- word given the previous words) became popular [9]. The rently we have no clear understanding of how they work, next wave of developments in self-supervised learning when they fail, and what emergent properties they may — BERT [ 1 ], GPT-3 [ 2 ], RoBERTa [10], T5 [6], among present, or which novel ways of exploiting these models others — quickly followed, embracing the Transformer can help to improve state-of-the-art in NLP. As argued architecture [11], incorporating more powerful deep bidiby Bender et al. [7], it is important to understand the rectional encoders of sentences, and scaling up to larger limitations of large pre-trained language models, which models and datasets. some have called “stochastic parrots”. To tackle these The idea of transfer learning is to take the knowlquestions, much critical multidisciplinary collaboration edge learned from one task (e.g., predict the next word and research is needed. given the previous words) and apply it to another task

DeepKnowledge will extend the state-of-the-art in nat- (e.g., summarization). With transfer learning, instead of ural language processing (NLP) and multilingual knowl- starting the learning process from scratch, you start from edge enabling technologies in seven interrelated areas patterns that have been learned when solving a diferent of high potential impact. The main research objective of problem. This way you leverage previous learning and DeepKnowledge consists in advancing the state-of-the- avoid starting from scratch. Within deep learning, preart towards NLU by (i) generating and exploiting new training is the dominant approach to transfer learning: language models for the oficial languages of Spain plus the objective is to pre-train a deep transformer model English by taking into account a multitask and multi- on large amounts of data and then reuse this pre-trained modal objective during the pre-training; (ii) exploring language model by fine-tuning it on small amounts of novel ways, such as prompting, of exploiting these lan- (usually annotated) task-specific data. Thus, transfer guage models to improve NLP results on zero-shot and learning formalizes a two-phase learning framework: a few-shot settings (without or very little training data for pre-training phase to capture knowledge from one or the target language or task at hand); (iii) by addressing more source tasks, and a fine-tuning stage to transfer the language understanding tasks by text generation; (iv) captured knowledge to many target tasks. by leveraging pre-trained language models and building knowledge bases from scratch, (v) developing new bench- 2.1. Few-shot Learning marks and datasets for evaluating and assessing the our Recent work has shown that pre-trained language models progress towards Natural Language Understanding; (vi) can robustly perform classification tasks in a few-shot or to apply the newly developed techniques to improve the even in zero-shot fashion, when given an adequate task state-of-the art in language understanding, especially description in its natural language prompt [ 2 ]. Unlike trafor settings with few or non-existing training data and ditional supervised learning, which trains a model to take (vii) by developing a number of advanced content-based in an input and predict an output, prompt-based learning domain applications for the main oficial languages in is based on exploiting pre-trained language models to 3, 4]. One of the advantages of these neural models is that solve a task using text directly [5]. To use these models to they enable end-to-end learning of semantic mappings perform prediction tasks, the original input is modified from input to output in text generation. These decoder using a template into a textual string prompt that has models [ 2, 3, 4 ] are currently the standard architectures some missing slots, and then the language model is used for generating high quality text which in turn generates to probabilistically fill the missing information to obtain a crucial need for the evaluation of the generated text. a final string, from which the final output for the task In DeepKnowledge the progress will be measured by can be derived. This framework looks very promising developing new understanding and generation natural for a number of reasons: it allows the language model to language benchmarks and tasks for Basque, Spanish and be pre-trained on massive amounts of raw text, and by English, focusing on the truthfulness and reliability of defining a new prompting function the model is able to the output generated by the LLMs. Thus, we will provide perform few-shot or even zero-shot learning, adapting new benchmarks for popular tasks based on text generato new scenarios, languages and domains with few or tion and understanding such as Long Answer Question no labeled data. Thus, some NLP tasks can be solved in Answering, Explanatory Argument Generation and Infera fully unsupervised fashion by providing a pre-trained ential tasks for which annotated data for evaluation exists language model with task descriptions in natural lan- only for English. By doing so we are aiming at signifiguage [6]. Surprisingly, fine-tuning pre-trained language cantly improving the state-of-the-art of AI-based Large models on a collection of tasks described via instructions Language Models in low resource scenarios for languages (or prompts) substantially boosts zero-shot performance such as Basque and Spanish thereby contributing to the on unseen tasks [ 6, 2, 4 ]. improvement of Language Technology Applications and its deployment in the current digital transformation.

2.2. Multilingual Language Models 2.4. Applications

Multilingual Language Models (MLLMs) such as mBERT [ 1 ], XLM-RoBERTa [12], mT5 [13], etc. have emerged Current NLP technology allows many advanced applicaas a viable option for bringing the power of pre-training tions which have been unthinkable only a few years ago. to a large number of languages. For example, mBERT NLP is present in our daily lives, for example, through is pre-trained with the Multilingual Masked Language search engines, recommendation systems, virtual assisModeling (MMLM) task using non-parallel multilingual tants, chatbots, text editors, text predictors, automatic Wikipedia corpora in 104 languages. mBERT has the abil- translation systems, automatic summaries, inclusive techity to generalize cross-lingual knowledge in zero-shot nology, etc [16]. Its rapid development in recent years scenarios. This indicates that even with the same struc- predicts even more encouraging and also exciting results ture of BERT, using multilingual data can enable the in the near future [17]. Currently, our society is develmodel to learn cross-lingual representations. A MLLM is oping some fears towards the digital world associated pre-trained using large amounts of unlabeled data from with information distrust of what is published given the multiple languages with the hope that low-resource lan- growing amount of false content. Our project aims at guages may benefit from high-resource languages due to alleviating these problems by developing new methods a shared vocabulary and latent language properties. The and advancing the state of the art in machine reading surprisingly good performance of MLLMs in crosslingual comprehension of language and misinformation detectransfer as well as bilingual tasks motivates the hypothe- tion. sis that MLLMs are learning universal patterns [14, 15]. In this project we target five application scenarios, Thus, of particular interest is the ability of MLLMs to namely, eLearning, Question Answering and Machine facilitate zero-shot crosslingual transfer from a resource- Comprehension, Misinformation, Biomedical Text Analrich language to a resource-deprived language which ysis and Conversational Agents. In all these application does not have any task-specific training data, or to fine- areas we will apply the latest neural language model tune more robust language models by using annotated technology developed within the project. training data in multiple languages. Recent progress in NLP has been driven by advances in both language model architecture and model pre-training. 2.3. Text Generation Transformer architectures have facilitated the building of higher-capacity language models for a wide variety of Natural Language Generation (NLG) has become one of tasks. Open-source libraries such as Transformers [18] the most common yet challenging tasks in NLP which is may open up these advances to a wider NLP community. currently being addressed by the intense development The library consists of carefully engineered state-of-the and release of many Large Language Models (LLMs) such art Transformer architectures under a unified API and a as the popular GPT family, Llama and Mistral models [2, curated collection of pre-trained models. Unfortunately, the resources necessary to create the best-performing neural language models are found almost exclusively at US and China technology giants. Moreover, this transformative technology poses problems from a research advancement, environmental, and ethical perspective.

For example, models such as GPT-3 or GPT-4 are private, anglo-centric, and inaccessible to academic organisations [19]. There are also worrying shortcomings in the text corpora used to train these models, ranging from a lack of representation of populations, to a predominance of harmful stereotypes, and to the inclusion of personal information.

3. Methodology and Work Plan 3.1. Objectives DeepKnowledge-EHU

DeepKnowledge will build models that are capable to deal with text generation tasks, as well as models that are trained in a multi-task fashion, which have shown to generalize better and yield good results work in zeroshot and few-shot scenarios. We will also work towards iflling the current gap on language models in these languages for specific domains, such as Health, Education and Social media. Regarding text processing applications, the research team has ample experience developing NLP tools, both basic NLP modules [20] as well as advanced semantic processing tools in many languages [15, 21, 22]. Following this, we list the specific objectives for DeepKnowledge-EHU: 4. To explore how large language models can productively interact with existing semantic networks and ontologies (WP4). 5. To leverage the generated language models to develop state-of-the-art, ready-to-use, deeplearning linguistic processors for many NLP tasks, such as lemmatization, NER, SRL, POS tagging or

Coreference Resolution, among others (WP2). 6. To improve qualitative and quantitative evaluation of text generation-based tasks such as text simplification or argument generation; organize a shared task to motivate work on this topic (WP5). 7. To leverage the generated models and new techniques of exploiting them for elearning, Question Answering, Medical Text Processing, Misinformation detection and Conversational Agents (WP6).

DeepInfo-UNED collaborates with two institutions: (i) Instituto Cervantes and (ii) president Carter Foundation (USA). One of the goals of Instituto Cervantes is the certification of human proficiency in the use of Spanish language. The collaboration between our project and Instituto Cervantes is focused on: (i) creating a dataset in Spanish for the evaluation of machine reading and comprehension capabilities which will address the lack of training and evaluation resources for other languages diferent to English, (ii) developing automatic assisting methods to help evaluators to prepare and check the exams.

The Carter Foundation acts as an international observer in elections all over the world. Traditionally, these observers were a team of persons that moved physically to the country and tracked the process. However, nowadays there is also a need to monitorize political activity in social networks. By taking into account these two use cases, the specific objectives of DeepInfo-UNED are defined as follows: In this context of paradigm shift within the NLP community, DeepKnowledge will aim to develop new language models (i) with multitask and multimodal training objec- 3.2. Objectives DeepInfo-UNED tive (ii) for specific domains, (iii) and to explore novel methods of exploiting such language models such as the use of prompts or text generation, which we believe will help these pre-trained models to ground their knowledge improving understanding and generalization skills. in Spain as well as English. The models will be based on models and carefully designed datasets and knowledge news technologies, architectures and training paradigms bases to advance the state of the art towards natural that allow a better generalization between domains and language understanding to English, Spanish, Catalan, languages. We will build generative models that allow the Basque and Galician in several domains and digital secgeneration of text in these languages, which is needed in tors. DeepKnowledge will also investigate new text gentasks such as summarization, simplification or generation eration approaches for applications such as argument of counter-arguments against misinformation. Besides, generation, text simplification or abstractive summarizathe project will also build language models adapted to tion. Additionally, DeepKnowledge will apply the new specific domains of Health, Education, Social media. language models in novel ways for tasks and applications WP3: Novel paradigms for the exploitation of language such as misinformation detection, Question Answering models. Develop novel ways to exploit the full potential or elearning. of large language models, including prompting, gener- Ongoing work can be checked in the project’s webation and multimodal training. The objective of such site: http://ixa2.si.ehu.eus/deepknowledge/. Future work exploitation paradigms is two-fold: (i) to improve the includes further experimentation training LLMs for lowoverall language understanding capabilities of language resource languages and on the evaluation of text genermodels, and (ii) to make them usable for a great variety ation, a crucial topic to understand the performance of of applications and languages with minimal preparation our models. efort, through zero-shot and few-shot learning.

WP4: Knowledge Acquisition, Integration and Reasoning. The main objective of this work package is to in- Acknowledgments vestigate how large language models can productively interact with existing semantic networks. On the one We acknowledge the support of DeepKnowlhand, helping on the development of broad-coverage lex- edge (PID2021-127777OB-C21) and DeepInfo ical knowledge bases such as the Multilingual Central (PID2021-127777OB-C22), projects funded by Repository [23] in the languages covered by the project MCIN/AEI/10.13039/501100011033 and by FEDER. and adapted to specific domains such as medicine. On Rodrigo Agerri was also funded by the RYC-2017-23647 the other hand, using these large-scale knowledge bases fellowship (MCIN/AEI/10.13039/501100011033 and by to generate lexical semantic, world knowledge and com- ESF Investing in your future). mon sense probes for testing the abilities of modern large language models. References WP5: Evaluation. the objective of this work package is to measure the research progress via objective evaluation metrics and relevant open evaluation campaigns. An important component will also be investigating the evaluation of tasks based on text generation (WP3). Datasets for Machine Comprehension and Question Answering in Spanish will be generated. Furthermore, we will organize a workshop on misinformation.

WP6: Applications and Use Cases. This work package aims at demonstrating the scientific advances of DeepKnowledge in diferent scenarios. It will include applications in elearning, recommender systems for education and research, question answering, reading comprehension, and misinformation.

4. Concluding Remarks

This paper outlines the DeepKnowledge project, which is focused on researching and incorporating the latest insights in deep learning technology, such as large pretrained language models, transfer learning, few-shot and zero-shot capabilities, multimodal and multi-task processing, prompting, etc. DeepKnowledge will leverage deep learning techniques and large pre-trained language

I. M. Kloumann, A. V. Korenev, P. S. Koura, M.- A. Siddhant, A. Barua, C. Rafel, mT5: A massively A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, multilingual pre-trained text-to-text transformer, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Moly- in: NAACL, Association for Computational Linguisbog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, tics, 2021.

K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Sub- [14] T. Pires, E. Schlinger, D. Garrette, How multilingual ramanian, X. Tan, B. Tang, R. Taylor, A. Williams, is multilingual BERT?, in: Proceedings of the 57th J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, Annual Meeting of the Association for ComputaM. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, tional Linguistics, Association for Computational S. Edunov, T. Scialom, Llama 2: Open foundation Linguistics, 2019, pp. 4996–5001. and fine-tuned chat models, ArXiv abs/2307.09288 [15] R. Agerri, E. Agirre, Lessons learned from the eval(2023). uation of Spanish Language Models, Proces. del [4] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- Leng. Natural 70 (2023) 157–170. ford, D. S. Chaplot, D. de Las Casas, F. Bressand, [16] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.- Nguyen, O. Sainz, E. Agirre, I. Heinz, D. Roth, ReA. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, cent advances in natural language processing via T. Lacroix, W. E. Sayed, Mistral 7b, ArXiv large pre-trained language models: A survey, ACM abs/2310.06825 (2023). Computing Surveys 56 (2021) 1 – 40. [5] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neu- [17] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, big, Pre-train, prompt, and predict: A systematic J. Qiu, Y. Yao, A. Zhang, L. Zhang, et al., Pre-trained survey of prompting methods in natural language models: Past, present and future, AI Open 2 (2021) processing, ACM Computing Surveys 55 (2021) 1 – 225–250.

35. [18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De[6] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. FunM. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, limits of transfer learning with a unified text-to-text Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, transformer, Journal of Machine Learning Research M. Drame, Q. Lhoest, A. Rush, Transformers: State21 (2020) 1–67. of-the-art natural language processing, in: Q. Liu, [7] E. M. Bender, T. Gebru, A. McMillan-Major, D. Schlangen (Eds.), Proceedings of EMNLP: System M. Mitchell, On the dangers of stochastic parrots: Demonstrations, 2020, pp. 38–45.

Can language models be too big?, Proceedings of [19] L. Floridi, M. Chiriatti, Gpt-3: Its nature, scope, the 2021 ACM Conference on Fairness, Account- limits, and consequences, Minds and Machines 30 ability, and Transparency (2021). (2020) 681–694. [8] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, [20] O. Toporkov, R. Agerri, On the Role of MorA. Joulin, Advances in pre-training distributed phological Information for Contextual Lemmatiword representations, in: Proceedings of the 11th zation, Computational Linguistics (2024) 1–35. Language Resources and Evaluation Conference, doi:10.1162/coli_a_00497.

Miyazaki, Japan, 2018. [21] O. Sainz, I. García-Ferrero, R. Agerri, O. de Lacalle, [9] A. M. Dai, Q. V. Le, Semi-supervised sequence learn- G. Rigau, E. Agirre, Gollie: Annotation guidelines ing, NeurIps (2015). improve zero-shot information-extraction, Twelfth [10] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, International Conference on Learning RepresentaO. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, tions (ICLR 2024), 2024.

RoBERTa: A robustly optimized bert pretraining [22] I. García-Ferrero, R. Agerri, A. A. Salazar, E. Cabrio, approach, arXiv preprint arXiv:1907.11692 (2019). I. de la Iglesia, A. Lavelli, B. Magnini, B. Molinet, [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, J. Ramirez-Romero, G. Rigau, J. M. Villa-Gonzalez, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- S. Villata, A. Zaninello, Medical mT5: An Opentention is all you need, in: Advances in neural Source Multilingual Text-to-Text LLM for The Medinformation processing systems, 2017, pp. 5998– ical Domain, Joint International Conference on 6008. Computational Linguistics, Language Resources [12] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- and Evaluation (LREC-COLING), 2024. hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, [23] A. G. Agirre, E. Laparra, G. Rigau, Multilingual L. Zettlemoyer, V. Stoyanov, Unsupervised cross- Central Repository version 3.0: upgrading a very lingual representation learning at scale, in: Annual large lexical knowledge base, in: GWC 2012 6th Meeting of the Association for Computational Lin- International Global Wordnet Conference, 2012, p. guistics, 2019. 118. [13] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou,

[1]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , in: North American Chapter of the Association for Computational Linguistics , 2019 .

[2] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A.

Neelakantan , P.

Shyam , G.

Sastry , A.

Askell , S.

Agarwal , A.

Herbert-Voss , G. Krueger, T. J.

Henighan , R.

Child , A.

Ramesh , D. M.

Ziegler , J.

Wu , C.

Winter , C.

Hesse , M.

Chen , E. Sigler, M.

Litwin , S.

Gray , B.

Chess , J.

Clark , C.

Berner , S.

McCandlish , A.

Radford , I.

Sutskever , D.

Amodei , Language models are few-shot learners , in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , H. Lin (Eds.), Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 1877 - 1901 .

[3]

Touvron ,

Martin ,

K. R.

Stone ,

Albert ,

Almahairi ,

Babaei ,

Bashlykov ,

Batra ,

Bhargava ,

Bhosale ,

D. M.

Bikel ,

Blecher ,

C. C.

Ferrer ,

Chen ,

Cucurull ,

Esiobu ,

Fernandes ,

Fu ,

Fuller ,

Gao ,

Goswami ,

Goyal ,

A. S.

Hartshorn ,

Hosseini ,

Hou ,

Inan ,

Kardas ,

Kerkez , M. Khabsa,