Mult-IT Multiple Choice Questions on Multiple Topics in Italian: A CALAMITA Challenge

Mult-IT Multiple Choice Questions on Multiple Topics in Italian: A CALAMITA Challenge MatteoRinaldi matteo.rinaldi@unito.it University of Turin JacopoGili jacopo.gili584@edu.unito.it University of Turin MariaFrancis maria.francis287@gmail.com CLCG University of Groningen University of Trento MattiaGoffetti mattia_goffetti@alphatest.it Alpha Test

S.R.L

VivianaPatti viviana.patti@unito.it University of Turin MalvinaNissim m.nissim@rug.nl CLCG University of Groningen Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

Mult-IT Multiple Choice Questions on Multiple Topics in Italian: A CALAMITA Challenge 1613-0073 4BBACDDAF4C2C893624EE36697B8C058 GROBID - A machine learning software for extracting information from scholarly documents CALAMITA Challenge Italian Benchmarking Multiple-Choice Questions LLMs

Multi-choice question answering (MCQA) is a powerful tool for evaluating the factual knowledge and reasoning capacities of Large Language Models (LLMs). However, there is a lack of large-scale MCQA datasets originally written in Italian. Existing Italian MCQA benchmarks are often automatically translated from English, an approach with two key drawbacks: Firstly, automatic translations may sound unnatural, contain errors, or use linguistics constructions that do not align with the target language. Secondly, they may introduce topical and ideological biases reflecting Anglo-centric perspectives. To address this gap, we present Mult-IT, an MCQA dataset comprising over 110,000 manually written questions across a wide range of topics. All questions are sourced directly from preparation quizzes for Italian university entrance exams, or for exams for public sector employment in Italy. We are hopeful that this contribution enables a more comprehensive evaluation of LLMs' proficiency, not only in the Italian language, but also in their grasp of Italian cultural and contextual knowledge.

Challenge: Introduction and Motivation

In recent years, multi-choice question answering (MCQA) has established itself as a powerful method to test the factual knowledge and reasoning abilities embedded in large language models (LLMs) as a byproduct of the language modelling objective [1,2,3,4]. The evaluation of MCQAs can be easily automated, offering a significant advantage over other benchmarking formats such as open-end text responses. In addition, with appropriately targeted prompting, the limited num-coherence of discourse, and encourage the presence of linguistic constructions that reflect the source language rather than the target one [9]. The second issue relates to culture and societal norms: text translated from English to Italian will lack topical biases, preferences, conventions, and ways of expressing ideas that are unique to Italian culture. Thus, while the text may be expressed in the Italian language, its content and underlying norms will continue to represent an Anglo-centric, predominantly American perspective.

More generally, training data for LLMs is biased towards English content, and as a result, there is often a gap between English and non-English performance [10]. For example, the Common Crawl dataset 2 , often used as a base for more refined datasets to be employed in the pre-training of LLMs, is composed of 45% English content, while the data for languages such as Spanish, French, Italian, and Chinese are all below 5% each, with the only exceptions being Russian (6.2%) and German (5.1%).

Creating a large-scale multi-choice question answering benchmark using original Italian data will make it possible to investigate the Italian abilities of LLMs in a more natural and transparent way, possibly also leading to a better understanding of how to make multilingual models better at Italian. It will also serve as a core benchmark for assessing the performance of monolingual Italian LLMs. If similar datasets are collected natively for other languages, Mult-IT can be part of a larger MCQA benchmark which is multilingual in the truest sense.

Mult-IT, presented at CALAMITA [11], is the first massive Multi-Choice-Question-Answering dataset specifically designed for the Italian language which draws on Italian culture and Italian-focused knowledge. By providing a comprehensive, culturally relevant benchmark for the Italian language, we aim to set a precedent for the development of similar resources in other languages and cultures, ultimately contributing to a more diverse and inclusive AI landscape.

Challenge: Description

This challenge involves a multiple-choice questioning answering task. The model is prompted with a simple instruction (see Box 1), followed by a question and a set of three and five possible answers, depending on the source and topic of the question. Among these answers, only one is correct, and the others are distractors. The model is expected to identify the correct answer and return the letter corresponding to the option deemed correct.

All questions in the benchmark have been manually crafted for the purpose of training or testing students, job applicants, or learners across a range of topics, including general knowledge and more specialised subjects. These questions make up the Mult-IT dataset: Multiple Choice Questions on Various Topics in Italian, which we are introducing in this contribution. The details of the dataset are described in Section 3. The defining feature of the Mult-IT challenge is that all of the MCQs are natively Italian, both in language and in content. While this is an advantage to gain a better understanding of model behaviour on Italian data, we do expect a decline in model performance. Considering that even models which have been trained on multilingual data have a heavy bias towards English and American-centric culture, it is expected that the correctness of the answer may be affected by a cultural (and possibly language) gap. Should the battery of the models tested also include Italian monolingual or bilingual English-Italian models trained on a substantial amount of Italian text, this benchmark will make it possible to underscore differences in performance possibly associated to the language specificity of such models.

Data description

Mult-IT contains quizzes designed to assess candidates' knowledge in open competitive exams, whether for admission to national universities or for positions in Italian institutions. This approach offers several advantages. First of all, these public competitions encompass very general topics such as language comprehension, basic history, and common knowledge, but also more specialised ones, focusing on specific laws needed for certain professions or the security measures required for jobs such as policemen or firefighters. Our benchmark, therefore, contains questions that range from a low level of difficulty to a very high and specific level, setting high standards for the performance of the models, and it may also be useful to assess specific knowledge valuable for the adoption of models in Public Administration scenarios. The inclusion of profession-specific questions in Mult-IT tests the ability of LLMs to apply their knowledge in practical, real-world scenarios, a feature that could prove particularly valuable in assessing the potential of AI systems to support specialised fields and decision-making processes in professional and administrative contexts in the Italian landscape. Moreover, the quizzes contained in the dataset also present challenges regarding reasoning, such as logical thinking and mathematical reasoning, as well as quizzes specifically designed to assess knowledge and mastery of the Italian language, for example, text comprehension or detailed understanding of grammatical phenomena.

Mult-IT consists of two core subsets, which are divided by the origin of the data. Both subsets are made of quizzes that test knowledge of general Italian culture and that are used in the public recruitment processes for governmentbased positions. They are described in more details below.

Mult-IT-A Mult-IT-

A is a collection of MCQs provided by Alpha test. It contains a total of 1,692 questions in Italian, spanning over 17 categories which corresponds to topics featuring in entry exams for Italian universities (see Table 1 below for details) or question answering tests employed in public competitions. The quizzes in the dataset falling into the categories of law, pedagogy, psychology and criminology originates from public competitions. For each question, four or five possible answers are provided, out of which only one is correct. An example from topic 'sinonimi' (synonyms) is shown in Figure 1.

Indolente e' un sinonimo di:

(A) tenero, (B) doloroso, (C) pigro, (D) insensibile Mult-IT-C Mult-IT-C is a large collection of MCQs, organised in groups of questions ("quizzes") around multiple topics, which we have obtained from publicly accessible online platforms through data-gathering and web-scraping. The quizzes are meant to be used by people who need to prepare to apply for job positions in the public sector. One of the most interesting feature of Mult-IT-C is its size: it contains more than 100,000 questions, making it almost six times larger than MMLU. An example from topic 'geografia' (geography) is shown in Figure 2.

In quale nazione si trova il Lago Balaton?

(A) Ucraina, (B) Ungheria, (C) Romania, (D) Repubblica Ceca (E) Bulgaria

Origin of data

Mult-IT-A All the materials of Mult-IT-A were obtained thanks to the generosity of Alpha Test3 . Alpha Test S.r.l. is an Italian publishing house and educational training company, founded in Milan in 1987, that specialises in study aid materials and courses for high school, university, professional tests, exams and certifications. Alpha Test is the main reference for high school students preparing for university admission. Each year, Alpha Test gathers new data from the entrance exams of public and private universities and military schools, mainly in the form of multiple choice questions. The publishing house enhances such materials with comments and explanations, and creates variations or completely new versions of the original quizzes. All the materials in Mult-IT-A have been sourced from original, public data, and represent a varied sample of quizzes about general culture, STEM, and juridical disciplines.

Mult-IT-C

All the materials of Mult-IT-C were obtained using a web-scraping process from the website "Concorsi Pubblici"4 via customised Python scripts. While there exist many websites collecting public competitions exams, we found Concorsi Pubblici to be the most complete. Because the same public competition can be listed in several platforms, gathering all the data from a single websites avoided the risks of data duplication. The quizzes on Concorsi Pubblici are organised by topic (see Appendix A), and were extracted in time interval 1997-2024.

Data format

Overall, the data format is consistent across the two Mult-IT subsets, which allows for a single evaluation procedure on Mult-IT. The larger size of the Mult-IT-C dataset allowed us to include additional information, including details about the quiz's administration presented in the form of quiz blocks and a multi-level topic taxonomy. This feature is absent in Mult-IT-A as questions are collected by subject without being grouped in quizzes.

Common data fields in all Mult-IT are:

• origin: It can be either 'C' or 'A', to discern if the question belongs to Multi-IT-A or Multi-it-C • question: The question.

• choices: The list of possible answers.

• answer: The array-index corresponding to the correct answer in the choices array.

These common fields are crucial to the evaluation task, but each of the sub-portions has additional information added.

Mult-IT-A

Examples taken from Mult-IT-A are given in Figure 3.

{ "origin": "A", "topic": "informatica", "question": "Le dimensioni del monitor si misurano in:", "choices": [ "megahertz", "pixel", "centimetri", "pollici" ], "answer": 3 }, { "origin": "A", "topic": "psicologia e sociologia del disadattamento", "question": "Come viene definito lo stimolo funzionale a provocare un cambiamento?" , "choices": [ "Stress", "Output", "Input", "Matrice" ], "answer": 0 }, The only additional field is Topic, pointing to the topic of the question. Its distribution and statistics about token count and char count are available in Figure 6.

Mult-IT-C

The dataset consists of two files: quiz.jsonl contains the actual questions, while metadata.jsonl contains additional information about the questions. An example from the quiz.jsonl file is given in Figure 4.

Data fields unique of this subportion are:

• quiz_id: The ID of the quiz to which the question pertains. • question_id: The unique identifier of the question inside the quiz. In combination with the quiz_id it forms the unique identifier of each question and can be used to retrieve the metadata of the question from the metadata.jsonl file.

Data fields of the metadata are:

• id: The unique identifier of the question.

• title: Title of the quiz sourced from the original website. • tags: List of word tags.

{ "quiz_id": 2250, "question_id": 20, "question": "La Costituzione riconosce allo Stato una potesta' legislativa esclusiva in materia di:\n\n", "choices": [ "organizzazione della rete scolastica", "norme generali sull'istruzione", "ricerca scientifica e tecnologica", "istruzione professionale" ], "answer": 1 } { "quiz_id": 1253, "question_id": 63, "question": "In un ingranaggio con piu' ruote dentate, una ruota denominata R1 ha 25 denti e fa muovere una seconda ruota denominata R2 da 50 denti, che a sua volta fa muovere una terza ruota R3 da 150 denti. Se la ruota dentata R3 fa un giro e mezzo, quanti ne fa la ruota dentata R1?\n\n", "choices": [ "3", "5", "6", "9", "12" ], "answer": 4 } An example of an item from the metadata file is given in Figure5.

Zero-shot prompting

We evaluate our models in a zero-shot setting, thereby imitating the conditions of a real use-case scenario. The prompt we chose is designed to encourage the model to output only the letter corresponding to the answer. The original prompt, together with its English translation, is presented in Box 1.

{ "id": 2250, "title": "Area 3 Giuridico Amministrativa Finanziaria -25 domande concorso dirigente scolastico Miur", "tags": [ "concorsi dirigenti scolastici", "concorso dirigente scolastico", "dirigente scolastico concorso dirigente scolastico 2017", "miur", "miur concorso", "miur concorso dirigente scolastico miur concorsi scuola", "concorso scuola", "bando concorso scuola", "bandi miur" ], "class_lev1": [ "Miur", "dirigente scolastico" ], "source": [ "Fgl Cgil, Miur" ], "difficulty": [ "medio" ], "class_lev2": [ "Istruzione", "Altre" ], "class_lev3": [ "Societa' e Diritto", "Altro" ] }

Prompt for the LLM

Di seguito è riportata una domanda a scelta multipla e varie possibili risposte, ciascuna indicata da una lettera. Scegli la risposta che meglio risponde alla domanda, e riporta in output soltanto la lettera corrispondente a quella risposta, senza spiegazioni.

Below is a multi-choice question together with possible answers, each indicated by a letter.

Choose the best answer for the question, and report as output only the letter corresponding to that answer, without any explanation.

Box 1: Zero-shot prompt and English translation.

We decided to write the prompt in Italian in order to better represent a multilingual scenario. The prompt does not contain any information about the subject of the question or any other informative cues. In this way, our benchmark not only tests the model in question answering, but also indirectly tests the instruction-following abilities of the model in a language different than English.

Previous work on evaluating the performance of LLMs on MCQ datasets has identified two aspects which can interfere with the model's answers and therefore accuracy. One has to do with the order of possible answers: Wang et al. [12] show that the first presented option out of the possible choices tends to be preferred in the model's answer, making it quite important to take the order of possible answers into account. The other has to do with the prompt's (and even the question's) formulation: Singhal et al. [13] experiment with multiple types of prompts and also show that prompt formulation affects the model's output.

Because the position of the correct answer in the original data was already randomly distributed, which we verified with a supplementary analysis on the data (see Appendix B), performing a random permutation of the possible answers was not necessary.

Data statistics

Mult-IT-A The Mult-IT-A dataset is composed of 1,692 questions, spanning over 17 topics, all centered around knowledge required for entry exams at Italian Universities. The topics, and some additional information on the dataset composition, are provided in Table 1.

On average, questions are 83.76 characters long and contain 25 tokens counted with the tiktoken cl100k base5 tokenizer or 16.5 if counted with the Spacy6 library using the it_core_news_lg model 7 .

Further statistics about quiz distribution and answer position are available in Appendix ?? and C.

It's worth noting that permutating the order of the answers would be recommended to avoid any kind of unbalance, as Multi-IT-A shows an uneven correct answer distribution leaning heavily on the first choice, acquired from the source data.

Mult-IT-C

The Mult-IT-C dataset is composed of 108,773 questions divided into 4,129 quizzes.

To avoid confusion, we decided to give unequivocal names to the items of the subjects. A "quiz" is defined as a set of multiple "questions". Quizzes come from real-world examples, so they are provided with a specific name and Mult-IT-A: Topics included in the dataset, number of questions per topic, total tokens per topic, and average length of question per topic in terms of tokens. The topic "pedagogia" in the Table is short for "pedagogia con particolare riferimento agli interventi relativi all'osservazione e al trattamento dei detenuti e degli internati"; the topic "elementi di diritto costituzionale ed amministrativo" is short for "elementi di diritto costituzionale ed amministrativo con particolare riferimento al rapporto di pubblico impiego".

Category

Limitations

The vast majority of data comes from sources linked with Italian public Institutions, and can be considered official documents. For this reason, we expect an high quality regarding the formulation of the quizzes and the correctness of the answers. Nonetheless, given the large amount of data, we cannot guarantee the absence of errors in the single questions. Human errors can happen, even in official selection, although it should be considered a rare occasion. This aspect can be improved by analysing the results obtained by the model in the benchmarks: the more the benchmark is going to be used, the more it will be possible to isolate and eventually remove or correct problematic quizzes with data analytics techniques. Moreover, considered that the quizzes encompass almost a thirty years time span, it is possible that some quizzes, particularly the ones regarding laws, may be outdated. Nonetheless, thanks to the availability of meta- data, it is possible to further refine this dataset to also account for specific historical knowledge about laws by providing metadata to the model. However, we believe that for the first run of this evaluation this point will not create particular issues as we expect the potentially outdated questions to be limited.

Given the publicity of the data, it is possible that the original exams are already present in the models training data as they can easily obtained on the Internet. At the same time, it is likely that some sources, for example complete laws of the Italian legislation, are present in the training data, but we consider this eventuality positive given that one of the benchmark's aim is to evaluate the knowledge and capacity of the model to adapt to the Italian landscape.

Data license and copyright issues

Information about license and copyright issues is mandatory.

A. Appendix A: Detailed Statistics per category (Multi-IT-C)

Figure 1 :1Figure 1: Example of question and possible answers from Mult-IT-A for the topic 'sinonimi' (synonyms). The correct answer is (C).

Figure 2 :2Figure 2: Example of question and possible answers from Mult-IT-C for the topic'geografia' (geography). The correct answer is (B).

Figure 3 :3Figure 3: Data format used in Mult-IT-A.

Figure 4 :4Figure 4: Data format used in the quiz.jsonl file of Mult-IT-C.

Figure 5 :5Figure 5: Data format used in the metadata.jsonl file of Mult-IT-C.

Figure 6 :6Figure 6: Mult-IT-A: Topic distribution percentage-wise.

Figure 7 :7Figure 7: Distribution of number of items by token count for Mult-IT-A.

Figure 8 :8Figure 8: Quiz percentage distribution, taxonomy level 2 (top 15 categories, Mult-IT-C)

Figure 10 :Figure 11 :1011Figure 10: Quiz percentage distribution, taxonomy level 2 (all the categories, Mult-IT-C)

Figure 12 :Figure 13 :1213Figure 12: Distribution of answers' positions compared with a random distribution. The lower amount of items on values 3 and 4 of the x-axis is expected because only some questions have 4 or 5, respectively, possible choices

Total #tokens Avg Token/Quizinformatica128199715.602sintassi121361729.893grammatica119242920.412completamento frasi115303426.383geografia114124710.939geometria114354131.061ortografia113227320.115biologia105458843.695storia100174417.440psicologia e sociologia del disadattamento100289028.900elementi di criminologia100238623.860pedagogia100227122.710elementi di diritto costituzionale ed amministrativo100181318.130sinonimi98166717.010chimica83298335.940fisica43225152.349deduzione logica39124731.974

Table 11

Total #Tokens Avg Token-Quiz

Altre31,281911,97529.154Medicina28,376822,54128.987Corpo Pubblico25,208604,00723.961Giurisprudenza15,540482,40331.043Competenza Linguistica7,142196,61027.529Cultura Generale7,111162,30022.824Informatica4,39180,86918.417Logica3,374130,25838.606Farmacia3,33665,89819.754Geografia2,88644,70715.491Storia2,15049,94523.23APES2,13970,86833.131Scienze Motorie2,06656,27827.24Matematica1,93152,85827.373Lingua1,92936,43918.89Pubblica Amministrazione1,56550,43232.225Educazione civica1,46429,51020.157Letteratura85317,81120.88Biochimica78612,34615.707Chimica78414,03717.904Istruzione74518,52824.87Architettura38223,28060.942Fisica3568,96925.194Biologia33610,51231.286Economia30910,21033.042Scienze2353,71215.796Biotecnologie1855,74131.032Scienze naturali1852,68514.514Arte1804,41924.55profilo psicoattitudinale1355,02037.185Scienze della Comunicazione901,78719.856Cucina751,13015.067Scienze dei Beni culturali4050212.55

Table 22Level 2 of the taxonomy for Mult-IT-C

Table 3 :3Level 1 of the taxonomy, Mult-IT-C

CategoryTotalQuizzesTotalTokens Per-Avg Token-QuizzesPercentageTokenscentageQuizOperatore Socio Sanitario124347.39%2724036.03%21.908Arma dei Carabinieri81794.86%1563583.46%19.117Carabiniere80944.81%1544353.42%19.08Istruttore Amministrativo61883.68%1991014.41%32.175Diritto Amministrativo61563.66%2192614.85%35.617Poliziotto Municipale58343.47%1604593.55%27.504Infermiere55313.29%1347252.98%24.358Informatica40112.38%743621.65%18.54Guardia di Finanza40002.38%1163682.58%29.092Formez39452.34%778641.72%19.737Farmacia33361.98%658981.46%19.754Agente di Polizia Municipale32321.92%949882.1%29.39Assistente Amministrativo31471.87%828031.83%26.312cultura generale31391.87%717251.59%22.85Polizia Municipale26151.55%776411.72%29.691Medicina e chirurgia24751.47%1351792.99%54.618Polizia di Stato24411.45%544871.21%22.322Istruttore Amministrativo Contabile22961.36%659251.46%28.713Professioni Sanitarie22601.34%809031.79%35.798Cultura generale : Prove Concorsuali21991.31%496291.1%22.569Assistente giudiziario21231.26%779061.72%36.696Scienze Motorie e Sportive20541.22%561111.24%27.318Medico19571.16%437040.97%22.332Matematica17311.03%486651.08%28.114Cultura generale : Eserciziario17241.02%398580.88%23.119Grammatica generale17131.02%365720.81%21.35Diritto Costituzionale16490.98%338790.75%20.545Scienze infermieristiche ed ostetriche15700.93%613151.36%39.054Educazione civica14640.87%295100.65%20.157Azienda Sanitaria Locale (ASL)14560.87%412520.91%28.332INPS14400.86%484161.07%33.622Collaboratore Amministrativo13340.79%396730.88%29.74Legislazione sanitaria13000.77%279760.62%21.52Diritto del Lavoro12840.76%553421.22%43.101Logica : Ragionamento logico12800.76%797821.77%62.33Inglese12740.76%246160.54%19.322Odontoiatria e protesi dentarie11880.71%623151.38%52.454successioni di numeri e lettere11700.7%209700.46%17.923Contabilità pubblica11390.68%497061.1%43.64Significato parole11200.67%199140.44%17.78Geografia11150.66%167750.37%15.045Istruttore direttivo amministrativo11130.66%305930.68%27.487Storia10320.61%229130.51%22.203Comprensione di testi10300.61%938312.08%91.098lingua italiana9720.58%182260.4%18.751Attualità9600.57%204770.45%21.33Geometra9480.56%322250.71%33.993Poliziotto di stato (Agente)9300.55%209020.46%22.475

9: Quiz percentage distribution, taxonomy level 1 (top 30 categories, Mult-IT-C)

Category Total Quizzes Quizzes Percentage Total Tokens Tokens Percentage Avg Token-QuizAltre Scienze e Tecniche4100629.03%109253828.54%26.643Società e Diritto3981228.19%105455927.55%26.488Altro3361523.8%98867925.83%29.412Cultura98887.0%2236055.84%22.614Lingua90716.42%2330496.09%25.692Matematica e Logica52883.74%1826524.77%34.541Scienze MMFFNN25591.81%530281.39%20.722

Table 44Level 3 of the taxonomy, Mult-IT-CB. Appendix B: Distribution ofposition of correct answer(Mult-IT-C)

https://commoncrawl.github.io/cc-crawl-statistics/plots/ languages.html, accessed on 18/09/24 https://www.alphatest.it/ https://www.concorsipubblici.com/ https://github.com/openai/tiktoken https://github.com/explosion/spaCy https://github.com/explosion/spacy-models/releases/tag/it_core_ news_lg-3.7.0 See footnote 5 See footnote 6 See footnote 7 https://github.com/EleutherAI/lm-evaluation-harness

Acknowledgments

The authors would like to thank Alpha Test -https:// www.alphatest.it, and in particular Martha Fabbri, for their interest in the Mult-IT CALAMITA challenge and for the extremely valuable exchange of ideas and data, that allowed us to shape a task of high potential impact also in the field of educational training and assessment.

(M. Nissim) https://github.com/mrinaldi97 (M. Rinaldi); https://github.com/Jj-source (J. Gili); https://github.com/rosakun (M. Francis); https://github.com/vivpatti (V. Patti); https://github.com/malvinanissim (M. Nissim) 0009-0004-7488-8855 (M. Rinaldi); 0009-0007-1343-3760 (J. Gili); 0009-0007-7638-9963 (M. Francis); 0000-0001-5991-370X (V. Patti); 0000-0001-5289-0971 (M. Nissim)

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models ASrivastava DKleyjo ZWu Transactions on Machine Learning Research 2023 Benchmarking large language models on cmexam-a comprehensive chinese medical exam dataset JLiu PZhou YHua DChong ZTian ALiu HWang CYou ZGuo LZhu Advances in Neural Information Processing Systems 36 2024 TruthfulQA: Measuring how models mimic human falsehoods SLin JHilton OEvans 10.18653/v1/2022.acl-long.229 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics SMuresan PNakov AVillavicencio the 60th Annual Meeting of the Association for Computational Linguistics

Dublin, Ireland

2022 1 : Long Papers), Association for Computational Linguistics Pinto: Faithful language reasoning using promptgenerated rationales PWang AChan FIlievski MChen XRen Workshop on Trustworthy and Socially Responsible Machine Learning

NeurIPS

2022. 2022 Measuring massive multitask language understanding DHendrycks CBurns SBasart AZou MMazeika DSong JSteinhardt International Conference on Learning Representations 2021 YWang XMa GZhang YNi AChandra SGuo WRen AArulraj XHe ZJiang arXiv:2406.01574 Mmlupro: A more robust and challenging multi-task language understanding benchmark 2024 arXiv preprint PRajpurkar JZhang KLopyrev PLiang Squad: 100,000+ questions for machine comprehension of text 2016 Neural learning for question answering in italian DCroce AZelenanska RBasili AI*IA 2018 -Advances in Artificial Intelligence CGhidini BMagnini APasserini PTraverso

Cham

Springer International Publishing 2018 IPlaza NMelero CDel Pozo JConde PReviriego MMayor-Rocher MGrandury arXiv:2406.17789 Spanish and llm benchmarks: is mmlu lost in translation? 2024 arXiv preprint Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning VLai NNgo APouran Ben HVeyseh FMan TDernoncourt TBui Nguyen Findings of the Association for Computational Linguistics: EMNLP 2023 2023 CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian GAttanasio PBasile FBorazio DCroce MFrancis JGili EMusacchio MNissim VPatti MRinaldi DScalena Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024) CEUR Workshop Proceedings the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

Pisa, Italy

December 4 -December 6, 2024. 2024 Large language models are not fair evaluators PWang LLi LChen ZCai DZhu BLin YCao QLiu TLiu ZSui arXiv:2305.17926 2023 KSinghal TTu JGottweis RSayres EWulczyn LHou KClark SPfohl HCole-Lewis DNeal arXiv:2305.09617 Towards expert-level medical question answering with large language models 2023. 18/09/24 [8 arXiv preprint charge [1. mmlu paper. Online Resources The sources for the ceur-art style are available via • GitHub. Overleaf template