<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Mult-IT Multiple Choice Questions on Multiple Topics in Italian: A CALAMITA Challenge</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Matteo</forename><surname>Rinaldi</surname></persName>
							<email>matteo.rinaldi@unito.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Turin</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jacopo</forename><surname>Gili</surname></persName>
							<email>jacopo.gili584@edu.unito.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Turin</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Maria</forename><surname>Francis</surname></persName>
							<email>maria.francis287@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="department">CLCG</orgName>
								<orgName type="institution">University of Groningen</orgName>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">University of Trento</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mattia</forename><surname>Goffetti</surname></persName>
							<email>mattia_goffetti@alphatest.it</email>
							<affiliation key="aff3">
								<orgName type="institution">Alpha Test</orgName>
								<address>
									<country>S.R.L</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Viviana</forename><surname>Patti</surname></persName>
							<email>viviana.patti@unito.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Turin</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Malvina</forename><surname>Nissim</surname></persName>
							<email>m.nissim@rug.nl</email>
							<affiliation key="aff1">
								<orgName type="department">CLCG</orgName>
								<orgName type="institution">University of Groningen</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff4">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Mult-IT Multiple Choice Questions on Multiple Topics in Italian: A CALAMITA Challenge</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">4BBACDDAF4C2C893624EE36697B8C058</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:33+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>CALAMITA Challenge</term>
					<term>Italian</term>
					<term>Benchmarking</term>
					<term>Multiple-Choice Questions</term>
					<term>LLMs</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Multi-choice question answering (MCQA) is a powerful tool for evaluating the factual knowledge and reasoning capacities of Large Language Models (LLMs). However, there is a lack of large-scale MCQA datasets originally written in Italian. Existing Italian MCQA benchmarks are often automatically translated from English, an approach with two key drawbacks: Firstly, automatic translations may sound unnatural, contain errors, or use linguistics constructions that do not align with the target language. Secondly, they may introduce topical and ideological biases reflecting Anglo-centric perspectives. To address this gap, we present Mult-IT, an MCQA dataset comprising over 110,000 manually written questions across a wide range of topics. All questions are sourced directly from preparation quizzes for Italian university entrance exams, or for exams for public sector employment in Italy. We are hopeful that this contribution enables a more comprehensive evaluation of LLMs' proficiency, not only in the Italian language, but also in their grasp of Italian cultural and contextual knowledge.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Challenge: Introduction and Motivation</head><p>In recent years, multi-choice question answering (MCQA) has established itself as a powerful method to test the factual knowledge and reasoning abilities embedded in large language models (LLMs) as a byproduct of the language modelling objective <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>. The evaluation of MCQAs can be easily automated, offering a significant advantage over other benchmarking formats such as open-end text responses. In addition, with appropriately targeted prompting, the limited num-coherence of discourse, and encourage the presence of linguistic constructions that reflect the source language rather than the target one <ref type="bibr" target="#b8">[9]</ref>. The second issue relates to culture and societal norms: text translated from English to Italian will lack topical biases, preferences, conventions, and ways of expressing ideas that are unique to Italian culture. Thus, while the text may be expressed in the Italian language, its content and underlying norms will continue to represent an Anglo-centric, predominantly American perspective.</p><p>More generally, training data for LLMs is biased towards English content, and as a result, there is often a gap between English and non-English performance <ref type="bibr" target="#b9">[10]</ref>. For example, the Common Crawl dataset <ref type="foot" target="#foot_0">2</ref> , often used as a base for more refined datasets to be employed in the pre-training of LLMs, is composed of 45% English content, while the data for languages such as Spanish, French, Italian, and Chinese are all below 5% each, with the only exceptions being Russian (6.2%) and German (5.1%).</p><p>Creating a large-scale multi-choice question answering benchmark using original Italian data will make it possible to investigate the Italian abilities of LLMs in a more natural and transparent way, possibly also leading to a better understanding of how to make multilingual models better at Italian. It will also serve as a core benchmark for assessing the performance of monolingual Italian LLMs. If similar datasets are collected natively for other languages, Mult-IT can be part of a larger MCQA benchmark which is multilingual in the truest sense.</p><p>Mult-IT, presented at CALAMITA <ref type="bibr" target="#b10">[11]</ref>, is the first massive Multi-Choice-Question-Answering dataset specifically designed for the Italian language which draws on Italian culture and Italian-focused knowledge. By providing a comprehensive, culturally relevant benchmark for the Italian language, we aim to set a precedent for the development of similar resources in other languages and cultures, ultimately contributing to a more diverse and inclusive AI landscape.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Challenge: Description</head><p>This challenge involves a multiple-choice questioning answering task. The model is prompted with a simple instruction (see Box 1), followed by a question and a set of three and five possible answers, depending on the source and topic of the question. Among these answers, only one is correct, and the others are distractors. The model is expected to identify the correct answer and return the letter corresponding to the option deemed correct.</p><p>All questions in the benchmark have been manually crafted for the purpose of training or testing students, job applicants, or learners across a range of topics, including general knowledge and more specialised subjects. These questions make up the Mult-IT dataset: Multiple Choice Questions on Various Topics in Italian, which we are introducing in this contribution. The details of the dataset are described in Section 3. The defining feature of the Mult-IT challenge is that all of the MCQs are natively Italian, both in language and in content. While this is an advantage to gain a better understanding of model behaviour on Italian data, we do expect a decline in model performance. Considering that even models which have been trained on multilingual data have a heavy bias towards English and American-centric culture, it is expected that the correctness of the answer may be affected by a cultural (and possibly language) gap. Should the battery of the models tested also include Italian monolingual or bilingual English-Italian models trained on a substantial amount of Italian text, this benchmark will make it possible to underscore differences in performance possibly associated to the language specificity of such models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data description</head><p>Mult-IT contains quizzes designed to assess candidates' knowledge in open competitive exams, whether for admission to national universities or for positions in Italian institutions. This approach offers several advantages. First of all, these public competitions encompass very general topics such as language comprehension, basic history, and common knowledge, but also more specialised ones, focusing on specific laws needed for certain professions or the security measures required for jobs such as policemen or firefighters. Our benchmark, therefore, contains questions that range from a low level of difficulty to a very high and specific level, setting high standards for the performance of the models, and it may also be useful to assess specific knowledge valuable for the adoption of models in Public Administration scenarios. The inclusion of profession-specific questions in Mult-IT tests the ability of LLMs to apply their knowledge in practical, real-world scenarios, a feature that could prove particularly valuable in assessing the potential of AI systems to support specialised fields and decision-making processes in professional and administrative contexts in the Italian landscape. Moreover, the quizzes contained in the dataset also present challenges regarding reasoning, such as logical thinking and mathematical reasoning, as well as quizzes specifically designed to assess knowledge and mastery of the Italian language, for example, text comprehension or detailed understanding of grammatical phenomena.</p><p>Mult-IT consists of two core subsets, which are divided by the origin of the data. Both subsets are made of quizzes that test knowledge of general Italian culture and that are used in the public recruitment processes for governmentbased positions. They are described in more details below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mult-IT-A Mult-IT-</head><p>A is a collection of MCQs provided by Alpha test. It contains a total of 1,692 questions in Italian, spanning over 17 categories which corresponds to topics featuring in entry exams for Italian universities (see Table <ref type="table" target="#tab_1">1</ref> below for details) or question answering tests employed in public competitions. The quizzes in the dataset falling into the categories of law, pedagogy, psychology and criminology originates from public competitions. For each question, four or five possible answers are provided, out of which only one is correct. An example from topic 'sinonimi' (synonyms) is shown in Figure <ref type="figure" target="#fig_0">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Indolente e' un sinonimo di:</head><p>(A) tenero, (B) doloroso, (C) pigro, (D) insensibile Mult-IT-C Mult-IT-C is a large collection of MCQs, organised in groups of questions ("quizzes") around multiple topics, which we have obtained from publicly accessible online platforms through data-gathering and web-scraping. The quizzes are meant to be used by people who need to prepare to apply for job positions in the public sector. One of the most interesting feature of Mult-IT-C is its size: it contains more than 100,000 questions, making it almost six times larger than MMLU. An example from topic 'geografia' (geography) is shown in Figure <ref type="figure" target="#fig_1">2</ref>.</p><p>In quale nazione si trova il Lago Balaton? </p><formula xml:id="formula_0">(A) Ucraina, (B) Ungheria, (C) Romania, (D) Repubblica Ceca (E) Bulgaria</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Origin of data</head><p>Mult-IT-A All the materials of Mult-IT-A were obtained thanks to the generosity of Alpha Test<ref type="foot" target="#foot_1">3</ref> . Alpha Test S.r.l. is an Italian publishing house and educational training company, founded in Milan in 1987, that specialises in study aid materials and courses for high school, university, professional tests, exams and certifications. Alpha Test is the main reference for high school students preparing for university admission. Each year, Alpha Test gathers new data from the entrance exams of public and private universities and military schools, mainly in the form of multiple choice questions. The publishing house enhances such materials with comments and explanations, and creates variations or completely new versions of the original quizzes. All the materials in Mult-IT-A have been sourced from original, public data, and represent a varied sample of quizzes about general culture, STEM, and juridical disciplines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mult-IT-C</head><p>All the materials of Mult-IT-C were obtained using a web-scraping process from the website "Concorsi Pubblici"<ref type="foot" target="#foot_2">4</ref> via customised Python scripts. While there exist many websites collecting public competitions exams, we found Concorsi Pubblici to be the most complete. Because the same public competition can be listed in several platforms, gathering all the data from a single websites avoided the risks of data duplication. The quizzes on Concorsi Pubblici are organised by topic (see Appendix A), and were extracted in time interval 1997-2024.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Data format</head><p>Overall, the data format is consistent across the two Mult-IT subsets, which allows for a single evaluation procedure on Mult-IT. The larger size of the Mult-IT-C dataset allowed us to include additional information, including details about the quiz's administration presented in the form of quiz blocks and a multi-level topic taxonomy. This feature is absent in Mult-IT-A as questions are collected by subject without being grouped in quizzes.</p><p>Common data fields in all Mult-IT are:</p><p>• origin: It can be either 'C' or 'A', to discern if the question belongs to Multi-IT-A or Multi-it-C • question: The question.</p><p>• choices: The list of possible answers.</p><p>• answer: The array-index corresponding to the correct answer in the choices array.</p><p>These common fields are crucial to the evaluation task, but each of the sub-portions has additional information added.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mult-IT-A</head><p>Examples taken from Mult-IT-A are given in Figure <ref type="figure" target="#fig_2">3</ref>.</p><p>{ "origin": "A", "topic": "informatica", "question": "Le dimensioni del monitor si misurano in:", "choices": [ "megahertz", "pixel", "centimetri", "pollici" ], "answer": 3 }, { "origin": "A", "topic": "psicologia e sociologia del disadattamento", "question": "Come viene definito lo stimolo funzionale a provocare un cambiamento?" , "choices": [ "Stress", "Output", "Input", "Matrice" ], "answer": 0 }, The only additional field is Topic, pointing to the topic of the question. Its distribution and statistics about token count and char count are available in Figure <ref type="figure" target="#fig_5">6</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mult-IT-C</head><p>The dataset consists of two files: quiz.jsonl contains the actual questions, while metadata.jsonl contains additional information about the questions. An example from the quiz.jsonl file is given in Figure <ref type="figure" target="#fig_3">4</ref>.</p><p>Data fields unique of this subportion are:</p><p>• quiz_id: The ID of the quiz to which the question pertains. • question_id: The unique identifier of the question inside the quiz. In combination with the quiz_id it forms the unique identifier of each question and can be used to retrieve the metadata of the question from the metadata.jsonl file.</p><p>Data fields of the metadata are:</p><p>• id: The unique identifier of the question.</p><p>• title: Title of the quiz sourced from the original website. • tags: List of word tags.</p><p>{ "quiz_id": 2250, "question_id": 20, "question": "La Costituzione riconosce allo Stato una potesta' legislativa esclusiva in materia di:\n\n", "choices": [ "organizzazione della rete scolastica", "norme generali sull'istruzione", "ricerca scientifica e tecnologica", "istruzione professionale" ], "answer": 1 } { "quiz_id": 1253, "question_id": 63, "question": "In un ingranaggio con piu' ruote dentate, una ruota denominata R1 ha 25 denti e fa muovere una seconda ruota denominata R2 da 50 denti, che a sua volta fa muovere una terza ruota R3 da 150 denti. Se la ruota dentata R3 fa un giro e mezzo, quanti ne fa la ruota dentata R1?\n\n", "choices": [ "3", "5", "6", "9", "12" ], "answer": 4 } An example of an item from the metadata file is given in Figure5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Zero-shot prompting</head><p>We evaluate our models in a zero-shot setting, thereby imitating the conditions of a real use-case scenario. The prompt we chose is designed to encourage the model to output only the letter corresponding to the answer. The original prompt, together with its English translation, is presented in Box 1.</p><p>{ "id": 2250, "title": "Area 3 Giuridico Amministrativa Finanziaria -25 domande concorso dirigente scolastico Miur", "tags": [ "concorsi dirigenti scolastici", "concorso dirigente scolastico", "dirigente scolastico concorso dirigente scolastico 2017", "miur", "miur concorso", "miur concorso dirigente scolastico miur concorsi scuola", "concorso scuola", "bando concorso scuola", "bandi miur" ], "class_lev1": [ "Miur", "dirigente scolastico" ], "source": [ "Fgl Cgil, Miur" ], "difficulty": [ "medio" ], "class_lev2": [ "Istruzione", "Altre" ], "class_lev3": [ "Societa' e Diritto", "Altro" ] } </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Prompt for the LLM</head><p>Di seguito è riportata una domanda a scelta multipla e varie possibili risposte, ciascuna indicata da una lettera. Scegli la risposta che meglio risponde alla domanda, e riporta in output soltanto la lettera corrispondente a quella risposta, senza spiegazioni.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Below is a multi-choice question together with possible answers, each indicated by a letter.</head><p>Choose the best answer for the question, and report as output only the letter corresponding to that answer, without any explanation.</p><p>Box 1: Zero-shot prompt and English translation.</p><p>We decided to write the prompt in Italian in order to better represent a multilingual scenario. The prompt does not contain any information about the subject of the question or any other informative cues. In this way, our benchmark not only tests the model in question answering, but also indirectly tests the instruction-following abilities of the model in a language different than English.</p><p>Previous work on evaluating the performance of LLMs on MCQ datasets has identified two aspects which can interfere with the model's answers and therefore accuracy. One has to do with the order of possible answers: Wang et al. <ref type="bibr" target="#b11">[12]</ref> show that the first presented option out of the possible choices tends to be preferred in the model's answer, making it quite important to take the order of possible answers into account. The other has to do with the prompt's (and even the question's) formulation: Singhal et al. <ref type="bibr" target="#b12">[13]</ref> experiment with multiple types of prompts and also show that prompt formulation affects the model's output.</p><p>Because the position of the correct answer in the original data was already randomly distributed, which we verified with a supplementary analysis on the data (see Appendix B), performing a random permutation of the possible answers was not necessary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Data statistics</head><p>Mult-IT-A The Mult-IT-A dataset is composed of 1,692 questions, spanning over 17 topics, all centered around knowledge required for entry exams at Italian Universities. The topics, and some additional information on the dataset composition, are provided in Table <ref type="table" target="#tab_1">1</ref>.</p><p>On average, questions are 83.76 characters long and contain 25 tokens counted with the tiktoken cl100k base<ref type="foot" target="#foot_3">5</ref> tokenizer or 16.5 if counted with the Spacy<ref type="foot" target="#foot_4">6</ref> library using the it_core_news_lg model <ref type="foot" target="#foot_5">7</ref> .</p><p>Further statistics about quiz distribution and answer position are available in Appendix ?? and C.</p><p>It's worth noting that permutating the order of the answers would be recommended to avoid any kind of unbalance, as Multi-IT-A shows an uneven correct answer distribution leaning heavily on the first choice, acquired from the source data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mult-IT-C</head><p>The Mult-IT-C dataset is composed of 108,773 questions divided into 4,129 quizzes.</p><p>To avoid confusion, we decided to give unequivocal names to the items of the subjects. A "quiz" is defined as a set of multiple "questions". Quizzes come from real-world examples, so they are provided with a specific name and Mult-IT-A: Topics included in the dataset, number of questions per topic, total tokens per topic, and average length of question per topic in terms of tokens. The topic "pedagogia" in the Table is short for "pedagogia con particolare riferimento agli interventi relativi all'osservazione e al trattamento dei detenuti e degli internati"; the topic "elementi di diritto costituzionale ed amministrativo" is short for "elementi di diritto costituzionale ed amministrativo con particolare riferimento al rapporto di pubblico impiego".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Category</head><p>a categorisation originating from the original data source.</p><p>A quiz can contain a variable number of questions. The average number of questions per single quiz is 26, and the maximum is 250. There are 1623 quizzes with more than 25 items, 298 with more than 50, and only 22 with more than 100 items. The original categorisation made by the authors of the website "Concorsi Pubblici" was problematic for our purposes: some categories were near-duplicates of each other, containing only slightly different words. Moreover, we believed that 186 categories were too many for a meaningful visualisation and management of the data. For this reason, we created a hierarchy of three levels in which the first (bottom) level corresponds to the original categorisation of the data, then the second level groups the categorisation into 36 areas, and finally, the third level, the more abstract, has only 7 categories. The drawback of this approach, as it can be seen in the tables and graphs contained in Appendix A, is that in both the supplementary categorisations there is a significant amount of quizzes falling into the category "Other".</p><p>Nonetheless, we believe that this abstract categorisation can be good for having a general look at the data composition and thus the performance of the models in terms of macro-areas. On the other hand, keeping the original very detailed categorisation in the data allows for more in-depth analysis of model performances in specific aspects. In Appendix A, all the statistics of the 186 categories are listed in the form of a table. To appreciate the level of specificity reached by the first level of categorisation, it's interesting to notice, as examples, categories such as "Verbs", "Diphthongs", or "Word Meanings" referring to specific language abilities. These categories are then grouped in level 2 as "Linguistic Competence" and in level 3 as "Language". As another example, we can see categories that refer to specific aspects of the Italian Public Administrations: we can see in category 1 fields such as "INPS", that is, National Institute for Social Security, or "ASL" that is "Local Health Authority".</p><p>We believe that having such a precise categorisation at our disposal is of great help in understanding the abilities and weaknesses of models in very specific aspects, thus being helpful on one hand for assessing the possibility of direct practical employment of models in Italian public administration and, on the other hand, to improve the scientific understanding of models and how they deal with different kinds of challenges. This last aspect can also be helpful for interpretability studies of LLMs.</p><p>On average, questions are 104 characters long, they contain 27.5 tokens counted with "tiktoken cl100k base"<ref type="foot" target="#foot_6">8</ref> or 19.8 if counted with the "Spacy" library<ref type="foot" target="#foot_7">9</ref> using the "it_core_news_lg" model <ref type="foot" target="#foot_8">10</ref> . The longest question is 1363 token long. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Evaluation</head><p>We will use accuracy to evaluate the LLMs' performance on Mult-IT. Accuracy is defined as the ratio of correctly answered questions to the total number of questions, and it is a straightforward and easily interpretable measure of performance on MCQ tasks. Accuracy will be reported overall, and also separately for the two subsets Mult-IT-A and Mult-IT-C.</p><p>While accuracy is indeed a straightforward evaluation metric for this task, deciding which is the answer identified by the model as correct is not necessarily as straightforward for a couple of reasons.</p><p>As mentioned in Section 3.3, the position of the correct answer in the prompt is randomly distributed, reducing the likelihood of bias resulting from its placement, al-though the models might have a tendency to select the first answer more frequently.</p><p>A related issue is the fact that the model's output, in spite of the specific request in the prompt, might not always just be the letter corresponding to the chosen answer. In the case of longer outputs, simple regular expressions will be applied to extract the relevant letter.</p><p>In practice, as for all the CALAMITA challenges, the evaluation of the LLMs on Mult-IT will be carried out on the LM-evaluation-harness framework developed by EleutherAI <ref type="foot" target="#foot_9">11</ref> .  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Category</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Limitations</head><p>The vast majority of data comes from sources linked with Italian public Institutions, and can be considered official documents. For this reason, we expect an high quality regarding the formulation of the quizzes and the correctness of the answers. Nonetheless, given the large amount of data, we cannot guarantee the absence of errors in the single questions. Human errors can happen, even in official selection, although it should be considered a rare occasion. This aspect can be improved by analysing the results obtained by the model in the benchmarks: the more the benchmark is going to be used, the more it will be possible to isolate and eventually remove or correct problematic quizzes with data analytics techniques. Moreover, considered that the quizzes encompass almost a thirty years time span, it is possible that some quizzes, particularly the ones regarding laws, may be outdated. Nonetheless, thanks to the availability of meta- data, it is possible to further refine this dataset to also account for specific historical knowledge about laws by providing metadata to the model. However, we believe that for the first run of this evaluation this point will not create particular issues as we expect the potentially outdated questions to be limited.</p><p>Given the publicity of the data, it is possible that the original exams are already present in the models training data as they can easily obtained on the Internet. At the same time, it is likely that some sources, for example complete laws of the Italian legislation, are present in the training data, but we consider this eventuality positive given that one of the benchmark's aim is to evaluate the knowledge and capacity of the model to adapt to the Italian landscape.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Data license and copyright issues</head><p>Information about license and copyright issues is mandatory.    </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Appendix A: Detailed Statistics per category (Multi-IT-C)</head></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Example of question and possible answers from Mult-IT-A for the topic 'sinonimi' (synonyms). The correct answer is (C).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Example of question and possible answers from Mult-IT-C for the topic'geografia' (geography). The correct answer is (B).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Data format used in Mult-IT-A.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Data format used in the quiz.jsonl file of Mult-IT-C.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Data format used in the metadata.jsonl file of Mult-IT-C.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Mult-IT-A: Topic distribution percentage-wise.</figDesc><graphic coords="7,89.29,84.19,416.69,361.13" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Distribution of number of items by token count for Mult-IT-A.</figDesc><graphic coords="8,89.29,480.73,203.36,152.52" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Quiz percentage distribution, taxonomy level 2 (top 15 categories, Mult-IT-C)</figDesc><graphic coords="9,89.29,84.19,416.69,250.01" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 10 :Figure 11 :</head><label>1011</label><figDesc>Figure 10: Quiz percentage distribution, taxonomy level 2 (all the categories, Mult-IT-C)</figDesc><graphic coords="16,89.29,183.34,416.69,361.13" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 12 :Figure 13 :</head><label>1213</label><figDesc>Figure 12: Distribution of answers' positions compared with a random distribution. The lower amount of items on values 3 and 4 of the x-axis is expected because only some questions have 4 or 5, respectively, possible choices</figDesc><graphic coords="17,89.29,415.73,416.69,208.35" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="18,89.29,117.78,416.69,250.02" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Total #tokens Avg Token/Quiz</head><label></label><figDesc></figDesc><table><row><cell>informatica</cell><cell>128</cell><cell>1997</cell><cell>15.602</cell></row><row><cell>sintassi</cell><cell>121</cell><cell>3617</cell><cell>29.893</cell></row><row><cell>grammatica</cell><cell>119</cell><cell>2429</cell><cell>20.412</cell></row><row><cell>completamento frasi</cell><cell>115</cell><cell>3034</cell><cell>26.383</cell></row><row><cell>geografia</cell><cell>114</cell><cell>1247</cell><cell>10.939</cell></row><row><cell>geometria</cell><cell>114</cell><cell>3541</cell><cell>31.061</cell></row><row><cell>ortografia</cell><cell>113</cell><cell>2273</cell><cell>20.115</cell></row><row><cell>biologia</cell><cell>105</cell><cell>4588</cell><cell>43.695</cell></row><row><cell>storia</cell><cell>100</cell><cell>1744</cell><cell>17.440</cell></row><row><cell>psicologia e sociologia del disadattamento</cell><cell>100</cell><cell>2890</cell><cell>28.900</cell></row><row><cell>elementi di criminologia</cell><cell>100</cell><cell>2386</cell><cell>23.860</cell></row><row><cell>pedagogia</cell><cell>100</cell><cell>2271</cell><cell>22.710</cell></row><row><cell>elementi di diritto costituzionale ed amministrativo</cell><cell>100</cell><cell>1813</cell><cell>18.130</cell></row><row><cell>sinonimi</cell><cell>98</cell><cell>1667</cell><cell>17.010</cell></row><row><cell>chimica</cell><cell>83</cell><cell>2983</cell><cell>35.940</cell></row><row><cell>fisica</cell><cell>43</cell><cell>2251</cell><cell>52.349</cell></row><row><cell>deduzione logica</cell><cell>39</cell><cell>1247</cell><cell>31.974</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc></figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Total #Tokens Avg Token-Quiz</head><label></label><figDesc></figDesc><table><row><cell>Altre</cell><cell>31,281</cell><cell>911,975</cell><cell>29.154</cell></row><row><cell>Medicina</cell><cell>28,376</cell><cell>822,541</cell><cell>28.987</cell></row><row><cell>Corpo Pubblico</cell><cell>25,208</cell><cell>604,007</cell><cell>23.961</cell></row><row><cell>Giurisprudenza</cell><cell>15,540</cell><cell>482,403</cell><cell>31.043</cell></row><row><cell>Competenza Linguistica</cell><cell>7,142</cell><cell>196,610</cell><cell>27.529</cell></row><row><cell>Cultura Generale</cell><cell>7,111</cell><cell>162,300</cell><cell>22.824</cell></row><row><cell>Informatica</cell><cell>4,391</cell><cell>80,869</cell><cell>18.417</cell></row><row><cell>Logica</cell><cell>3,374</cell><cell>130,258</cell><cell>38.606</cell></row><row><cell>Farmacia</cell><cell>3,336</cell><cell>65,898</cell><cell>19.754</cell></row><row><cell>Geografia</cell><cell>2,886</cell><cell>44,707</cell><cell>15.491</cell></row><row><cell>Storia</cell><cell>2,150</cell><cell>49,945</cell><cell>23.23</cell></row><row><cell>APES</cell><cell>2,139</cell><cell>70,868</cell><cell>33.131</cell></row><row><cell>Scienze Motorie</cell><cell>2,066</cell><cell>56,278</cell><cell>27.24</cell></row><row><cell>Matematica</cell><cell>1,931</cell><cell>52,858</cell><cell>27.373</cell></row><row><cell>Lingua</cell><cell>1,929</cell><cell>36,439</cell><cell>18.89</cell></row><row><cell>Pubblica Amministrazione</cell><cell>1,565</cell><cell>50,432</cell><cell>32.225</cell></row><row><cell>Educazione civica</cell><cell>1,464</cell><cell>29,510</cell><cell>20.157</cell></row><row><cell>Letteratura</cell><cell>853</cell><cell>17,811</cell><cell>20.88</cell></row><row><cell>Biochimica</cell><cell>786</cell><cell>12,346</cell><cell>15.707</cell></row><row><cell>Chimica</cell><cell>784</cell><cell>14,037</cell><cell>17.904</cell></row><row><cell>Istruzione</cell><cell>745</cell><cell>18,528</cell><cell>24.87</cell></row><row><cell>Architettura</cell><cell>382</cell><cell>23,280</cell><cell>60.942</cell></row><row><cell>Fisica</cell><cell>356</cell><cell>8,969</cell><cell>25.194</cell></row><row><cell>Biologia</cell><cell>336</cell><cell>10,512</cell><cell>31.286</cell></row><row><cell>Economia</cell><cell>309</cell><cell>10,210</cell><cell>33.042</cell></row><row><cell>Scienze</cell><cell>235</cell><cell>3,712</cell><cell>15.796</cell></row><row><cell>Biotecnologie</cell><cell>185</cell><cell>5,741</cell><cell>31.032</cell></row><row><cell>Scienze naturali</cell><cell>185</cell><cell>2,685</cell><cell>14.514</cell></row><row><cell>Arte</cell><cell>180</cell><cell>4,419</cell><cell>24.55</cell></row><row><cell>profilo psicoattitudinale</cell><cell>135</cell><cell>5,020</cell><cell>37.185</cell></row><row><cell>Scienze della Comunicazione</cell><cell>90</cell><cell>1,787</cell><cell>19.856</cell></row><row><cell>Cucina</cell><cell>75</cell><cell>1,130</cell><cell>15.067</cell></row><row><cell>Scienze dei Beni culturali</cell><cell>40</cell><cell>502</cell><cell>12.55</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>Level 2 of the taxonomy for Mult-IT-C</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3 :</head><label>3</label><figDesc>Level 1 of the taxonomy, Mult-IT-C</figDesc><table><row><cell>Category</cell><cell>Total</cell><cell>Quizzes</cell><cell>Total</cell><cell>Tokens Per-</cell><cell>Avg Token-</cell></row><row><cell></cell><cell>Quizzes</cell><cell>Percentage</cell><cell>Tokens</cell><cell>centage</cell><cell>Quiz</cell></row><row><cell>Operatore Socio Sanitario</cell><cell>12434</cell><cell>7.39%</cell><cell>272403</cell><cell>6.03%</cell><cell>21.908</cell></row><row><cell>Arma dei Carabinieri</cell><cell>8179</cell><cell>4.86%</cell><cell>156358</cell><cell>3.46%</cell><cell>19.117</cell></row><row><cell>Carabiniere</cell><cell>8094</cell><cell>4.81%</cell><cell>154435</cell><cell>3.42%</cell><cell>19.08</cell></row><row><cell>Istruttore Amministrativo</cell><cell>6188</cell><cell>3.68%</cell><cell>199101</cell><cell>4.41%</cell><cell>32.175</cell></row><row><cell>Diritto Amministrativo</cell><cell>6156</cell><cell>3.66%</cell><cell>219261</cell><cell>4.85%</cell><cell>35.617</cell></row><row><cell>Poliziotto Municipale</cell><cell>5834</cell><cell>3.47%</cell><cell>160459</cell><cell>3.55%</cell><cell>27.504</cell></row><row><cell>Infermiere</cell><cell>5531</cell><cell>3.29%</cell><cell>134725</cell><cell>2.98%</cell><cell>24.358</cell></row><row><cell>Informatica</cell><cell>4011</cell><cell>2.38%</cell><cell>74362</cell><cell>1.65%</cell><cell>18.54</cell></row><row><cell>Guardia di Finanza</cell><cell>4000</cell><cell>2.38%</cell><cell>116368</cell><cell>2.58%</cell><cell>29.092</cell></row><row><cell>Formez</cell><cell>3945</cell><cell>2.34%</cell><cell>77864</cell><cell>1.72%</cell><cell>19.737</cell></row><row><cell>Farmacia</cell><cell>3336</cell><cell>1.98%</cell><cell>65898</cell><cell>1.46%</cell><cell>19.754</cell></row><row><cell>Agente di Polizia Municipale</cell><cell>3232</cell><cell>1.92%</cell><cell>94988</cell><cell>2.1%</cell><cell>29.39</cell></row><row><cell>Assistente Amministrativo</cell><cell>3147</cell><cell>1.87%</cell><cell>82803</cell><cell>1.83%</cell><cell>26.312</cell></row><row><cell>cultura generale</cell><cell>3139</cell><cell>1.87%</cell><cell>71725</cell><cell>1.59%</cell><cell>22.85</cell></row><row><cell>Polizia Municipale</cell><cell>2615</cell><cell>1.55%</cell><cell>77641</cell><cell>1.72%</cell><cell>29.691</cell></row><row><cell>Medicina e chirurgia</cell><cell>2475</cell><cell>1.47%</cell><cell>135179</cell><cell>2.99%</cell><cell>54.618</cell></row><row><cell>Polizia di Stato</cell><cell>2441</cell><cell>1.45%</cell><cell>54487</cell><cell>1.21%</cell><cell>22.322</cell></row><row><cell>Istruttore Amministrativo Contabile</cell><cell>2296</cell><cell>1.36%</cell><cell>65925</cell><cell>1.46%</cell><cell>28.713</cell></row><row><cell>Professioni Sanitarie</cell><cell>2260</cell><cell>1.34%</cell><cell>80903</cell><cell>1.79%</cell><cell>35.798</cell></row><row><cell>Cultura generale : Prove Concorsuali</cell><cell>2199</cell><cell>1.31%</cell><cell>49629</cell><cell>1.1%</cell><cell>22.569</cell></row><row><cell>Assistente giudiziario</cell><cell>2123</cell><cell>1.26%</cell><cell>77906</cell><cell>1.72%</cell><cell>36.696</cell></row><row><cell>Scienze Motorie e Sportive</cell><cell>2054</cell><cell>1.22%</cell><cell>56111</cell><cell>1.24%</cell><cell>27.318</cell></row><row><cell>Medico</cell><cell>1957</cell><cell>1.16%</cell><cell>43704</cell><cell>0.97%</cell><cell>22.332</cell></row><row><cell>Matematica</cell><cell>1731</cell><cell>1.03%</cell><cell>48665</cell><cell>1.08%</cell><cell>28.114</cell></row><row><cell>Cultura generale : Eserciziario</cell><cell>1724</cell><cell>1.02%</cell><cell>39858</cell><cell>0.88%</cell><cell>23.119</cell></row><row><cell>Grammatica generale</cell><cell>1713</cell><cell>1.02%</cell><cell>36572</cell><cell>0.81%</cell><cell>21.35</cell></row><row><cell>Diritto Costituzionale</cell><cell>1649</cell><cell>0.98%</cell><cell>33879</cell><cell>0.75%</cell><cell>20.545</cell></row><row><cell>Scienze infermieristiche ed ostetriche</cell><cell>1570</cell><cell>0.93%</cell><cell>61315</cell><cell>1.36%</cell><cell>39.054</cell></row><row><cell>Educazione civica</cell><cell>1464</cell><cell>0.87%</cell><cell>29510</cell><cell>0.65%</cell><cell>20.157</cell></row><row><cell>Azienda Sanitaria Locale (ASL)</cell><cell>1456</cell><cell>0.87%</cell><cell>41252</cell><cell>0.91%</cell><cell>28.332</cell></row><row><cell>INPS</cell><cell>1440</cell><cell>0.86%</cell><cell>48416</cell><cell>1.07%</cell><cell>33.622</cell></row><row><cell>Collaboratore Amministrativo</cell><cell>1334</cell><cell>0.79%</cell><cell>39673</cell><cell>0.88%</cell><cell>29.74</cell></row><row><cell>Legislazione sanitaria</cell><cell>1300</cell><cell>0.77%</cell><cell>27976</cell><cell>0.62%</cell><cell>21.52</cell></row><row><cell>Diritto del Lavoro</cell><cell>1284</cell><cell>0.76%</cell><cell>55342</cell><cell>1.22%</cell><cell>43.101</cell></row><row><cell>Logica : Ragionamento logico</cell><cell>1280</cell><cell>0.76%</cell><cell>79782</cell><cell>1.77%</cell><cell>62.33</cell></row><row><cell>Inglese</cell><cell>1274</cell><cell>0.76%</cell><cell>24616</cell><cell>0.54%</cell><cell>19.322</cell></row><row><cell>Odontoiatria e protesi dentarie</cell><cell>1188</cell><cell>0.71%</cell><cell>62315</cell><cell>1.38%</cell><cell>52.454</cell></row><row><cell>successioni di numeri e lettere</cell><cell>1170</cell><cell>0.7%</cell><cell>20970</cell><cell>0.46%</cell><cell>17.923</cell></row><row><cell>Contabilità pubblica</cell><cell>1139</cell><cell>0.68%</cell><cell>49706</cell><cell>1.1%</cell><cell>43.64</cell></row><row><cell>Significato parole</cell><cell>1120</cell><cell>0.67%</cell><cell>19914</cell><cell>0.44%</cell><cell>17.78</cell></row><row><cell>Geografia</cell><cell>1115</cell><cell>0.66%</cell><cell>16775</cell><cell>0.37%</cell><cell>15.045</cell></row><row><cell>Istruttore direttivo amministrativo</cell><cell>1113</cell><cell>0.66%</cell><cell>30593</cell><cell>0.68%</cell><cell>27.487</cell></row><row><cell>Storia</cell><cell>1032</cell><cell>0.61%</cell><cell>22913</cell><cell>0.51%</cell><cell>22.203</cell></row><row><cell>Comprensione di testi</cell><cell>1030</cell><cell>0.61%</cell><cell>93831</cell><cell>2.08%</cell><cell>91.098</cell></row><row><cell>lingua italiana</cell><cell>972</cell><cell>0.58%</cell><cell>18226</cell><cell>0.4%</cell><cell>18.751</cell></row><row><cell>Attualità</cell><cell>960</cell><cell>0.57%</cell><cell>20477</cell><cell>0.45%</cell><cell>21.33</cell></row><row><cell>Geometra</cell><cell>948</cell><cell>0.56%</cell><cell>32225</cell><cell>0.71%</cell><cell>33.993</cell></row><row><cell>Poliziotto di stato (Agente)</cell><cell>930</cell><cell>0.55%</cell><cell>20902</cell><cell>0.46%</cell><cell>22.475</cell></row></table><note>9: Quiz percentage distribution, taxonomy level 1 (top 30 categories, Mult-IT-C)</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Category Total Quizzes Quizzes Percentage Total Tokens Tokens Percentage Avg Token-Quiz</head><label></label><figDesc></figDesc><table><row><cell>Altre Scienze e Tecniche</cell><cell>41006</cell><cell>29.03%</cell><cell>1092538</cell><cell>28.54%</cell><cell>26.643</cell></row><row><cell>Società e Diritto</cell><cell>39812</cell><cell>28.19%</cell><cell>1054559</cell><cell>27.55%</cell><cell>26.488</cell></row><row><cell>Altro</cell><cell>33615</cell><cell>23.8%</cell><cell>988679</cell><cell>25.83%</cell><cell>29.412</cell></row><row><cell>Cultura</cell><cell>9888</cell><cell>7.0%</cell><cell>223605</cell><cell>5.84%</cell><cell>22.614</cell></row><row><cell>Lingua</cell><cell>9071</cell><cell>6.42%</cell><cell>233049</cell><cell>6.09%</cell><cell>25.692</cell></row><row><cell>Matematica e Logica</cell><cell>5288</cell><cell>3.74%</cell><cell>182652</cell><cell>4.77%</cell><cell>34.541</cell></row><row><cell>Scienze MMFFNN</cell><cell>2559</cell><cell>1.81%</cell><cell>53028</cell><cell>1.39%</cell><cell>20.722</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 4</head><label>4</label><figDesc>Level 3 of the taxonomy, Mult-IT-C</figDesc><table><row><cell>B. Appendix B: Distribution of</cell></row><row><cell>position of correct answer</cell></row><row><cell>(Mult-IT-C)</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">https://commoncrawl.github.io/cc-crawl-statistics/plots/ languages.html, accessed on 18/09/24</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">https://www.alphatest.it/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">https://www.concorsipubblici.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">https://github.com/openai/tiktoken</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">https://github.com/explosion/spaCy</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5">https://github.com/explosion/spacy-models/releases/tag/it_core_ news_lg-3.7.0</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_6">See footnote 5</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_7">See footnote 6</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_8">See footnote 7</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_9">https://github.com/EleutherAI/lm-evaluation-harness</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The authors would like to thank Alpha Test -https:// www.alphatest.it, and in particular Martha Fabbri, for their interest in the Mult-IT CALAMITA challenge and for the extremely valuable exchange of ideas and data, that allowed us to shape a task of high potential impact also in the field of educational training and assessment.</p></div>
			</div>


			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>(M. Nissim) https://github.com/mrinaldi97 (M. Rinaldi); https://github.com/Jj-source (J. Gili); https://github.com/rosakun (M. Francis); https://github.com/vivpatti (V. Patti); https://github.com/malvinanissim (M. Nissim) 0009-0004-7488-8855 (M. Rinaldi); 0009-0007-1343-3760 (J. Gili); 0009-0007-7638-9963 (M. Francis); 0000-0001-5991-370X (V. Patti); 0000-0001-5289-0971 (M. Nissim)</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Beyond the imitation game: Quantifying and extrapolating the capabilities of language models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kleyjo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions on Machine Learning Research</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Benchmarking large language models on cmexam-a comprehensive chinese medical exam dataset</title>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Hua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>You</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">TruthfulQA: Measuring how models mimic human falsehoods</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hilton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Evans</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.acl-long.229</idno>
		<ptr target="https://aclanthology.org/2022.acl-long.229.doi:10.18653/v1/2022.acl-long.229" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Muresan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Villavicencio</surname></persName>
		</editor>
		<meeting>the 60th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="3214" to="3252" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Pinto: Faithful language reasoning using promptgenerated rationales</title>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ilievski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ren</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Trustworthy and Socially Responsible Machine Learning</title>
				<meeting><address><addrLine>NeurIPS</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Measuring massive multitask language understanding</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Burns</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Basart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mazeika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chandra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Arulraj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Jiang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2406.01574</idno>
		<title level="m">Mmlupro: A more robust and challenging multi-task language understanding benchmark</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Rajpurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lopyrev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1606.05250.arXiv:1606.05250" />
		<title level="m">Squad: 100,000+ questions for machine comprehension of text</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Neural learning for question answering in italian</title>
		<author>
			<persName><forename type="first">D</forename><surname>Croce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zelenanska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Basili</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AI*IA 2018 -Advances in Artificial Intelligence</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Ghidini</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Passerini</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Traverso</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="389" to="402" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Plaza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Melero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Del Pozo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Conde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Reviriego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mayor-Rocher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Grandury</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2406.17789</idno>
		<title level="m">Spanish and llm benchmarks: is mmlu lost in translation?</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning</title>
		<author>
			<persName><forename type="first">V</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ngo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pouran Ben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Veyseh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Man</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Dernoncourt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Bui</surname></persName>
		</author>
		<author>
			<persName><surname>Nguyen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP 2023</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="13171" to="13189" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</title>
		<author>
			<persName><forename type="first">G</forename><surname>Attanasio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Borazio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Croce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Francis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Musacchio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Patti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rinaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Scalena</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting>the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)<address><addrLine>Pisa, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024-12-06">December 4 -December 6, 2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Large language models are not fair evaluators</title>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sui</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.17926</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Singhal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gottweis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sayres</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Wulczyn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pfohl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Cole-Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Neal</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.09617</idno>
		<ptr target="https://arxiv.org/abs/2406.17789v17" />
		<title level="m">Towards expert-level medical question answering with large language models</title>
				<imprint>
			<date type="published" when="2008">2023. 18/09/24 [8</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
	<note>charge [1. mmlu paper. Online Resources The sources for the ceur-art style are available via • GitHub. Overleaf template</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
