Evaluation of Large Language Models in Multilingual Settings Maite Heredia Arribas Universidad del País Vasco / Euskal Herriko Unibertsitatea, Barrio Sarriena, 48940 Leioa, Bizkaia Abstract Evaluation is an essential step in developing NLP models, that is gaining a lot of relevance alongside the need for more sophisticated metrics to account for the growth of LLMs (Large Language Models). These models have demonstrated very competitive results in a broad range of tasks and are showing new, broader skills that were previously unthinkable. Our research focuses on evaluating the multilingual capabilities of LLMs, specially for the creation of new benchmarks. They have been reported to be able to transfer tasks among languages and have great potential for adapting the available resources for less-resourced languages. We aim to design new benchmarks for these languages to help reduce the gap between the leading languages in NLP (and specifically English) and languages that are still lacking behind in resources and research. More concrete objectives include defining essential evaluation metrics across languages lacking benchmarks, comparing benchmark creation methods (automatic translation, human translation, and creation from scratch), and evaluating multilingual LLM performance with fine-tuning and zero-shot techniques. Initial efforts have been focused on creating benchmarks for common sense reasoning and code-switched text. Future research will expand dataset creation efforts and explore fine-tuning strategies to enhance multilingual LLM performance. Keywords evaluation, Large Language Models, multilingualism, low-resource languages, dataset creation, cross- linguistic evaluation 1. Reason for the proposed research Most modern benchmarks allow us to broadly categorize the performance of LLMs, but they generally fail in three areas: 1) they do not widely evaluate languages other than English, and often overlook low-resource languages, which is the case of Iberian languages like Basque, Catalan and Galician [1]; 2) they are opportunistic, in that they are made from a collection of tasks which were already available, which are oftentimes not what we would naturally use LLMs for, and models that obtain remarkably good results in popular benchmarks have been proven to fail on simple test cases [2]; and 3) they generally only measure a single performance metric, i.e., accuracy, unsuited to accurately measure generative models’ capabilities, since they are not able to capture semantic meaning [3]. To properly evaluate LLMs on languages other than English, we therefore need to devise new benchmarks and metrics which will take into account more than just a single performance metric and will be culturally and linguistically diverse. Doctoral Symposium on Natural Language Processing, 26 September 2024, Valladolid, Spain. $ maite.heredia@ehu.eus (M. Heredia Arribas)  0009-0005-6719-5433 (M. Heredia Arribas) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Therefore, the main objective of this PhD project is to explore novel methods for the creation of relevant benchmarks for LLMs, specifically with and for multilingual settings. Using these benchmarks, we will be able to explore multilingual capabilities of LLMs among high- and low-resource languages, their ability to transfer tasks to different languages and their potential to devise new datasets. 2. Background and related work Large Language Models (LLMs) have recently led to a seemingly unprecedented level of perfor- mance in Natural Language Processing (NLP)[4, 5]. These models are trained on vast amounts of text, with the goal of learning representations of language which can then be transferred to new scenarios with a minimum amount of fine-tuning. While certain models trained by private companies, such as ChatGPT, have perhaps had the most media coverage, academic models (PaLM, LLaMA) have also made progress. Due to their black box nature, LLMs are difficult to interpret and understand. The knowledge acquired by the model during training is distributed among hundreds of millions of parameters, and therefore it is very difficult to interpret the output produced by the models, or the underlying reasons that steer the model to generate certain text. Moreover, large scale models have shown unexpected capabilities that have surprised the research community, as they were not specifically designed to acquire them. These so-called "emergent abilities”, the capacity of the models to resolve tasks for which they have not been previously trained, just by providing them with very few training examples [6], have turned out to be one of the most important characteristics of large neural models that allow their deployment in many NLP applications and domains. While the progress made recently is undoubtable, our understanding of what this progress really means is limited by our evaluation methods, metrics, and benchmarks that are currently available. Neural models have improved to the point where they can often no longer be distinguished based on the surface-level features that older metrics rely on [7], and there is a growing need to devise evaluation strategies that measure not only the progress of the models, but that also help us to understand the properties of large language models, their capabilities, limitations, and risks. This will help in turn overcoming various shortcomings of current LLM approaches that are critical for widespread adoption and that have, so far, not been successfully addressed or solved. We focus on a specific capability of LLMs, multilingualism. Several new benchmarks have been proposed for evaluating LLMs. To create MMLU, [8], for example, collect over 15,000 multiple choice questions in English – taken from American GRE and medical licensing exam preparation courses – and divided into 57 different topics, broadly grouped into STEM, Humanities, Social Sciences, and Other. [9] propose BIG-Bench with the purpose of creating a benchmark that would be more difficult and last longer than previous benchmarks. They take a crowd-sourcing strategy and collaboratively create 204 tasks, some of which are also available in languages other than English. [10] propose HELM (Holistic Evaluation of Language Models), which instead takes a top-down approach, spelling out the scenarios they evaluate on and which are currently missing. Like other proposed benchmarks, they limit themselves to collecting already available resources for several tasks – question answering, information retrieval, summarization, sentiment analysis, toxicity, and text classification – but furthermore incorporate a number of evaluation metrics besides accuracy (bias, fairness, efficiency, robustness, toxicity, and calibration). All the datasets described so far are only available for English. While there exist several LLMs for many languages other than English, the evaluation benchmarks on those languages clearly lag behind. In the Spanish State, the NEL project founded by the national government (https://www.boe.es/diario_boe/txt.php?id=BOE-A-2022-18816 ) seeks to build a new generation of LLMs for Iberic languages that compete in performance with English LLMs [11, 12]. The project emphasizes the need to develop and research on methods to build benchmarks to evaluate the capacities of LLMs for said languages, an objective that is fully aligned with this thesis work. Multilingual models are designed to handle multiple languages simultaneously, leveraging the shared information across languages to improve performance on individual tasks. These models, such as mBERT, XLM-R, and mT5, are trained on multilingual corpora and have demon- strated significant improvements in understanding and generating text in various languages, including those with limited training data [13]. The capability of multilingual LLMs to transfer knowledge across languages makes them particularly valuable for less-resourced languages, where annotated data is scarce. Our research aims to address these challenges by creating new benchmarks specifically tailored for evaluating the multilingual capabilities of LLMs, thereby contributing to the advancement of NLP for a broader range of languages. 3. Description of the proposed research 3.1. Initial proposal To enable the evaluation of multilingual LLMs, we first defined the following specific objectives: • Define a set of variables / metrics that are relevant for evaluating LLMs in languages that do not currently have an evaluation benchmark. These could include (but are not limited to): – Grammatical abilities: How well does the model handle complex morpho-syntax, correference, etc.? – Logic abilities: Can the model reason over complex logical structures? – Common sense: What common sense knowledge is encoded in the LLM? – Code generation abilities: Can the model be used to generate runable code? – Truthfulness: How likely is it that the LLM presents fabricated information as truth? – Bias: What kinds of biases does the model have? – Toxicity: Under what circumstances does the model produce toxic output? – Non-standard language: How well can the model operate with text from social media, non-standard language varieties, or code-switching? An extensive review of the literature will allow us to more clearly define what are the most relevant variables and metrics and find the gaps in current resources and research to try to help in filling them. • In order to create a benchmark, compare automatic translation of benchmarks, human translation, and creation from scratch. • Compare the performance of multilingual / monolingual models using different fine- tuning and prompting techniques, and use this information to improve model weaknesses. 3.2. Current progress So far we have tackled the creation of benchmarks for common sense, specifically for the NLI task1 [14]; and non-standard language, more specifically social media text that includes code-switching [15]. We have centered our work around the Basque language, but we are aiming to expand our research to more languages. Our work so far has allowed us to explore the impact of different methods of creating evaluation sets [16] - machine translation, human translation, from scratch - and we have been able to gather some initial conclusions. More specifically, we have corroborated that machine translation is a useful resource to create datasets, more so if the goal is model comparison, but, when available, human post-edition and even creation from scratch are preferable and can more accurately assess the capabilities of models, which underscores the effectiveness of certain approaches in preserving linguistic nuances and ensuring benchmark quality. Apart from creating our datasets and making our datasets publicly available, we have also performed a first batch of experiments to test different multi- and cross-lingual strategies to leverage the existing resources in English and other high-resource languages for languages with fewer resources. In agreement with other researches’ results [13], our experiments show that using multilingual LLMs with strategies like zero-shot cross-lingual transfer, translate-train or zero-shot prompting can be feasible alternatives in scenarios where there are not enough resources for more standard approaches. 3.3. Next steps We would like to continue with our work so far by developing more datasets, using the knowl- edge that we have gathered about the importance of human supervision, and testing whether our findings hold true in different tasks and settings. Currently, we are also exploring the possibility of creating datasets by making use of generative autoregressive LLMs, and would like to continue researching cross-lingual transfer. 4. Methodology and proposed experiments As an overview, our proposed experiments will mainly consist on researching the most efficient methods for creating new benchmarks, that involve the acquisition of resources and their processing, translation and/or annotation. To assure the effectiveness of these resources, it will be essential to perform experiments with them, that may involve fine-tuning and testing LLMs. We will publicly release the created benchmarks, as well as an environment to evaluate LLMs and share these with the scientific community. In this way we will provide comparable results and we will gain insight about improvements. Although some of the ideas may not provide the desired results, we will fulfill all the objectives and get feedback from the results. All the true hypotheses will be shared in major peer-reviewed conferences. 1 https://github.com/hitz-zentroa/xnli-eu 4.1. Main Hypotheses Here, we present the four main hypotheses that guide our work. These hypotheses are further elaborated and expanded as we progress in our research, incorporating new insights and developments. H0: Transferring English benchmarks to other languages. The benchmarks and metrics that have been developed for English are not suitable to measure the performance in a lot of other languages, especially low-resourced languages, and have to be adapted and improved. H1: Human intervention. Human intervention is essential to create reliable benchmarks for languages other than English, especially low-resourced ones. H2: Cultural impact. Benchmarks devised for English are culturally bounded to the Anglo- sphere. Therefore, the development of resources for different languages should not reproduce these cultural biases, but rather try to adapt to the culture of each language. H3: Non-standard language. A great deal of well-known and widely used benchmarks and metrics are not representative of real language production, but rather a standardized form of language. 4.2. Research Tasks (RT), Questions (RQ) RT0. Prepare the research scenario. The initial task involves defining the variables of interest, finding metrics to correctly measure these variables, and collecting relevant datasets which are available in English and other high-resource languages. We will perform initial experiments in English to determine the feasibility of our metrics and tests RQ0.A) What current datasets are available in English? RQ0.B) Which variables (performance, bias, toxicity, etc) do these datasets measure? RQ0.C) Do the current metrics correctly capture the most important variables? RT1. Explore the most appropriate method for creating successful LLM benchmarks for low- resourced languages. In this task, we will compare the strengths and weaknesses of creating new benchmarks through translation-based transfer or by creating the resources from scratch. RQ1.A) Can automatic translation create robust benchmarks in new languages? RQ1.B) Can human translation create robust benchmarks in new languages? RQ1.C) What culturally specific artifacts are lost in translation? RQ1.D) How does translation-based benchmarking differ from creating new benchmarks from scratch? RT2. Determine the correlations between the performance of LLMs on standardized tests as benchmarks and their performance on other tasks. RQ2.A) How does model performance on available standardized tests correlate with performance on NLP-style tasks? RQ2.B) Does the performance on the standardized tests or NLP tasks correlate with the ability of models to perform useful functions, e.g. write a formal email conditioned on some information? RT3. Compare monolingual and multilingual models, including English models. RQ3.A) What relative strengths/weaknesses do multilingual and monolingual models have? RQ3.B) How does the monolingual performance of these models compare to English models, such as GPT4? RQ3.C) How does finetuning, instruction tuning, and prompt engineering change the performance of these models? 4.3. Research schedule yearly First year: In the first year we will mainly focus on tasks related to RT0. The goal is to prepare the research scenario so we will need to define the main variables of interest and collect available English datasets. We foresee the following tasks: • We will gather the basic resources, such as a collection of the main datasets used to evaluate LLMs in English, as well as any standardized tests available in Iberian languages, which help answer RQ0.A and RQ0.B. • We will create a taxonomy of the variables that are measured in these datasets and their corresponding metrics, in order to answer RQ0B. • We will design and run experiments on state-of-the-art English LLMs on the datasets from the previous tasks to answer RQ0.C. • We aim to submit the answers of RQs to a top journal or conference. Second year: During this year we will mainly focus on tasks RT1 and RT2. For that, we first will experiment with methods to create a comprehensive benchmark for Iberian LLMs, comparing translation methods and in-language annotation. In the second part of the year, we will start working in parallel with the comparison of standardized testing with NLP tasks. For this year, we foresee the following tasks: • We will use automatic translation to transfer the available English resources to Iberian languages and perform an analysis of the resulting translated datasets, paying attention to what errors are introduced, what topics are included, and LLM performance. This task will help answer RQ1.A. • From the analysis of the previous step, we will choose a subset of data to translate via human translators. We will perform a similar analysis of errors, topics, and LLM performance. This task will help answer RQ1.B. • Finally, we will perform an annotation project on a subset of the datasets used in human translation. The goal will be to create similar datasets, but localized and annotated by native speakers. We will compare the distribution of topics, as well as LLM performance on this data with the translated versions. This task will allow us to answer both RQ1.C and RQ1.D • To answer RQ2A, we will compare LLM performance on our benchmark and available standardized exams in Iberian languages. We will perform a regression analysis of performance improvements on the standardized exams and other tasks to determine what relation exists. • For RQ2.B, we will need to determine a small set of tasks that users of LLMs would be interested in, which we would gather via a small community survey. Once we have the results of the survey, we can test the models on these tasks and perform a similar analysis as we did to answer RQ2A. • Note that due to the great number of tasks planned for the second year, we probably need to postpone some tasks to the third year. Third year: During the third year, we will focus on tasks related to RT3. Our objective is to compare the strengths of monolingual Iberian LLMs, multilingual Iberian LLMs, massively multilingual LLMs, and monolingual English LLMs. • In order to answer RQ3.A, we first compare available monolingual and multilingual LLMs on our benchmark. • Compare these results to the results of English LLMs to answer RQ3.B. • Given the insights from the previous experiments, propose finetuning, instruction tuning, and prompting methods to improve the weaknesses of monolingual models (RQ3.C). • Depending on the answers of RQ3s, we will submit our work to a top conference. Fourth year: The most interesting conclusions obtained from previous years will be taken and rounded off through new experiments in the first months of the year. Then the thesis will be written and the thesis defense preparation will be done. The following tasks are planned for this purpose: • Finish tasks and experiments of previous years. • Submitting related research to a top journal or conference. • Write-up of the PhD thesis. 5. Specific issues of research to be discussed Our research is mainly focused on the creation and evaluation of reliable benchmarks for low-resource languages, which is undoubtedly one critical bottleneck in the development of NLP applications. Creating new datasets is always a time-consuming task, specially when working with models that need large amounts of training data. Most new datasets are either opportunistic, in the sense that they are stemmed from already existing data that has been pre-annotated for different purposes, or are created through crowd-sourcing, that can result in lower quality annotations (and, sometimes, unethical work practices [17]). These can be valid approaches to the creation of datasets, but we argue that for benchmarks to be more useful and more accurately measure the capabilities of models, they should ideally be more carefully designed, linguistically motivated, and annotated by professionals of different areas of expertise, depending on the task. These considerations make our work slower and significantly more challenging in different steps of the workflow: the design of the process, data collecting and, most notably, annotating. On the other hand, it is worth mentioning that one of the possible lines of research of this thesis is to evaluate ethical aspects of large language models, including biases and harmfulness. These are sensitive topics that will be approached with the necessary care and consideration. References [1] P. Joshi, S. Santy, A. Budhiraja, K. Bali, M. Choudhury, The state and fate of linguistic diversity and inclusion in the NLP world, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 6282–6293. URL: https://aclanthology.org/2020.acl-main.560. doi:10.18653/v1/2020.acl-main.560. [2] S. R. Bowman, G. Dahl, What will it take to fix benchmarking in natural language understanding?, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 4843–4855. URL: https://aclanthology.org/2021.naacl-main.385. doi:10.18653/v1/2021. naacl-main.385. [3] T. Linzen, How can we accelerate progress towards human-like linguistic generalization?, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 5210–5217. URL: https://aclanthology.org/2020.acl-main.465. doi:10.18653/v1/2020.acl-main.465. [4] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/ paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [5] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, N. Fiedel, Palm: Scaling language modeling with pathways, 2022. arXiv:2204.02311. [6] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, W. Fedus, Emergent abilities of large language models, 2022. arXiv:2206.07682. [7] S. Gehrmann, E. Clark, T. Sellam, Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022. arXiv:2202.06935. [8] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, J. Steinhardt, Measuring massive multitask language understanding, Proceedings of the International Conference on Learning Representations (ICLR) (2021). [9] BIG-Bench authors, Beyond the imitation game: Quantifying and extrapolating the capa- bilities of language models, 2023. arXiv:2206.04615. [10] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Guha, N. Chatterji, O. Khattab, P. Henderson, Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Ganguli, T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary, W. Wang, X. Li, Y. Mai, Y. Zhang, Y. Koreeda, Holistic evaluation of language models, 2023. arXiv:2211.09110. [11] S. Da Dalt, J. Llop, I. Baucells, M. Pamies, Y. Xu, A. Gonzalez-Agirre, M. Villegas, FLOR: On the effectiveness of language adaptation, in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Compu- tational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, Torino, Italia, 2024, pp. 7377–7388. URL: https://aclanthology.org/2024.lrec-main.650. [12] J. Etxaniz, O. Sainz, N. Perez, I. Aldabe, G. Rigau, E. Agirre, A. Ormazabal, M. Artetxe, A. Soroa, Latxa: An open language model and evaluation suite for basque, 2024. URL: https://arxiv.org/abs/2403.20266. arXiv:2403.20266. [13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, 2020. arXiv:1911.02116. [14] A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, V. Stoyanov, XNLI: Evaluating cross-lingual sentence representations, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Lin- guistics, Brussels, Belgium, 2018, pp. 2475–2485. URL: https://aclanthology.org/D18-1269. doi:10.18653/v1/D18-1269. [15] G. Winata, A. F. Aji, Z. X. Yong, T. Solorio, The decades progress on code-switching research in NLP: A systematic survey on trends and challenges, in: A. Rogers, J. Boyd- Graber, N. Okazaki (Eds.), Findings of the Association for Computational Linguis- tics: ACL 2023, Association for Computational Linguistics, Toronto, Canada, 2023, pp. 2936–2978. URL: https://aclanthology.org/2023.findings-acl.185. doi:10.18653/v1/2023. findings-acl.185. [16] M. Artetxe, G. Labaka, E. Agirre, Translation artifacts in cross-lingual transfer learning, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 7674–7684. URL: https://aclanthology.org/2020.emnlp-main.618. doi:10.18653/v1/2020.emnlp-main. 618. [17] B. Shmueli, J. Fell, S. Ray, L.-W. Ku, Beyond fair pay: Ethical implications of NLP crowdsourcing, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguis- tics, Online, 2021, pp. 3758–3769. URL: https://aclanthology.org/2021.naacl-main.295. doi:10.18653/v1/2021.naacl-main.295.