-

M. Francis);

Mult-IT Multiple Choice Questions on Multiple Topics in Italian: A CALAMITA Challenge

Matteo Rinaldi

Jacopo Gili

Maria Francis

1 2

Mattia Gofetti

Viviana Patti

Malvina Nissim

1 0 Alpha Test , S.R.L 1 CLCG, University of Groningen 2 University of Trento 3 University of Turin

2024

000 0 0001

Multi-choice question answering (MCQA) is a powerful tool for evaluating the factual knowledge and reasoning capacities of Large Language Models (LLMs). However, there is a lack of large-scale MCQA datasets originally written in Italian. Existing Italian MCQA benchmarks are often automatically translated from English, an approach with two key drawbacks: Firstly, automatic translations may sound unnatural, contain errors, or use linguistics constructions that do not align with the target language. Secondly, they may introduce topical and ideological biases reflecting Anglo-centric perspectives. To address this gap, we present Mult-IT, an MCQA dataset comprising over 110,000 manually written questions across a wide range of topics. All questions are sourced directly from preparation quizzes for Italian university entrance exams, or for exams for public sector employment in Italy. We are hopeful that this contribution enables a more comprehensive evaluation of LLMs' proficiency, not only in the Italian language, but also in their grasp of Italian cultural and contextual knowledge.

eol>CALAMITA Challenge Italian Benchmarking Multiple-Choice Questions LLMs

1. Challenge: Introduction and

Motivation coherence of discourse, and encourage the presence of job applicants, or learners across a range of topics, inlinguistic constructions that reflect the source language cluding general knowledge and more specialised subjects. rather than the target one [9]. The second issue relates to These questions make up the Mult-IT dataset: Multiple culture and societal norms: text translated from English Choice Questions on Various Topics in Italian, which we to Italian will lack topical biases, preferences, conven- are introducing in this contribution. The details of the tions, and ways of expressing ideas that are unique to dataset are described in Section 3. The defining feature Italian culture. Thus, while the text may be expressed in of the Mult-IT challenge is that all of the MCQs are nathe Italian language, its content and underlying norms tively Italian, both in language and in content. While will continue to represent an Anglo-centric, predomi- this is an advantage to gain a better understanding of nantly American perspective. model behaviour on Italian data, we do expect a decline in

More generally, training data for LLMs is biased to- model performance. Considering that even models which wards English content, and as a result, there is often a have been trained on multilingual data have a heavy bias gap between English and non-English performance [10]. towards English and American-centric culture, it is exFor example, the Common Crawl dataset2, often used pected that the correctness of the answer may be afected as a base for more refined datasets to be employed in by a cultural (and possibly language) gap. Should the the pre-training of LLMs, is composed of 45% English battery of the models tested also include Italian monocontent, while the data for languages such as Spanish, lingual or bilingual English-Italian models trained on a French, Italian, and Chinese are all below 5% each, with substantial amount of Italian text, this benchmark will the only exceptions being Russian (6.2%) and German make it possible to underscore diferences in performance (5.1%). possibly associated to the language specificity of such

Creating a large-scale multi-choice question answer- models. ing benchmark using original Italian data will make it possible to investigate the Italian abilities of LLMs in a more natural and transparent way, possibly also leading 3. Data description to a better understanding of how to make multilingual models better at Italian. It will also serve as a core benchmark for assessing the performance of monolingual Italian LLMs. If similar datasets are collected natively for other languages, Mult-IT can be part of a larger MCQA benchmark which is multilingual in the truest sense.

Mult-IT, presented at CALAMITA [11], is the first massive Multi-Choice-Question-Answering dataset specifically designed for the Italian language which draws on Italian culture and Italian-focused knowledge. By providing a comprehensive, culturally relevant benchmark for the Italian language, we aim to set a precedent for the development of similar resources in other languages and cultures, ultimately contributing to a more diverse and inclusive AI landscape.

Mult-IT contains quizzes designed to assess candidates’ knowledge in open competitive exams, whether for admission to national universities or for positions in Italian institutions. This approach ofers several advantages.

First of all, these public competitions encompass very general topics such as language comprehension, basic history, and common knowledge, but also more specialised ones, focusing on specific laws needed for certain professions or the security measures required for jobs such as policemen or firefighters. Our benchmark, therefore, contains questions that range from a low level of dificulty to a very high and specific level, setting high standards for the performance of the models, and it may also be useful to assess specific knowledge valuable for the adoption of models in Public Administration scenarios. The inclusion of profession-specific questions in Mult-IT tests 2. Challenge: Description the ability of LLMs to apply their knowledge in practical, real-world scenarios, a feature that could prove particThis challenge involves a multiple-choice questioning ularly valuable in assessing the potential of AI systems answering task. The model is prompted with a simple to support specialised fields and decision-making proinstruction (see Box 1), followed by a question and a set of cesses in professional and administrative contexts in the three and five possible answers, depending on the source Italian landscape. Moreover, the quizzes contained in and topic of the question. Among these answers, only the dataset also present challenges regarding reasoning, one is correct, and the others are distractors. The model such as logical thinking and mathematical reasoning, as is expected to identify the correct answer and return the well as quizzes specifically designed to assess knowledge letter corresponding to the option deemed correct. and mastery of the Italian language, for example, text

All questions in the benchmark have been manually comprehension or detailed understanding of grammatical crafted for the purpose of training or testing students, phenomena.

Mult-IT consists of two core subsets, which are divided by the origin of the data. Both subsets are made of quizzes 2https://commoncrawl.github.io/cc-crawl-statistics/plots/ languages.html, accessed on 18/09/24 that test knowledge of general Italian culture and that are used in the public recruitment processes for governmentbased positions. They are described in more details below.

Mult-IT-A All the materials of Mult-IT-A were obtained thanks to the generosity of Alpha Test 3. Alpha Test S.r.l. is an Italian publishing house and educational Mult-IT-A Mult-IT-A is a collection of MCQs provided training company, founded in Milan in 1987, that speby Alpha test. It contains a total of 1,692 questions in cialises in study aid materials and courses for high school, Italian, spanning over 17 categories which corresponds university, professional tests, exams and certifications. to topics featuring in entry exams for Italian universi- Alpha Test is the main reference for high school students ties (see Table 1 below for details) or question answering preparing for university admission. Each year, Alpha Test tests employed in public competitions. The quizzes in the gathers new data from the entrance exams of public and dataset falling into the categories of law, pedagogy, psy- private universities and military schools, mainly in the chology and criminology originates from public competi- form of multiple choice questions. The publishing house tions. For each question, four or five possible answers are enhances such materials with comments and explanaprovided, out of which only one is correct. An example tions, and creates variations or completely new versions from topic ’sinonimi’ (synonyms) is shown in Figure 1. of the original quizzes. All the materials in Mult-IT-A have been sourced from original, public data, and represent a varied sample of quizzes about general culture, STEM, and juridical disciplines. Indolente e’ un sinonimo di: (A) tenero, (B) doloroso, (C) pigro, (D) insensibile • origin: It can be either ’C’ or ’A’, to discern if the

question belongs to Multi-IT-A or Multi-it-C • question: The question. • choices: The list of possible answers. • answer: The array-index corresponding to the

correct answer in the choices array.

These common fields are crucial to the evaluation task, but each of the sub-portions has additional information added.

3https://www.alphatest.it/ 4https://www.concorsipubblici.com/

In quale nazione si trova il Lago

Balaton? (A) Ucraina, (B) Ungheria, (C) Romania, (D) Repubblica Ceca (E) Bulgaria Mult-IT-C The dataset consists of two files: quiz.jsonl contains the actual questions, while metadata.jsonl contains additional information about the questions. An example from the quiz.jsonl file is given in Figure 4.

Data fields unique of this subportion are: • quiz_id: The ID of the quiz to which the question pertains. • question_id: The unique identifier of the question inside the quiz. In combination with the quiz_id it forms the unique identifier of each question and can be used to retrieve the metadata of the question from the metadata.jsonl file.

Data fields of the metadata are: • id: The unique identifier of the question. • title: Title of the quiz sourced from the original website. • tags: List of word tags. { } { } "quiz_id": 2250, "question_id": 20, "question": "La Costituzione riconosce allo Stato una potesta’ legislativa esclusiva in materia di:\n\n", "choices": [ "organizzazione della rete scolastica", "norme generali sull’istruzione", "ricerca scientifica e tecnologica", "istruzione professionale" ], "answer": 1 "quiz_id": 1253, "question_id": 63, "question": "In un ingranaggio con piu’ ruote dentate, una ruota denominata R1 ha 25 denti e fa muovere una seconda ruota denominata R2 da 50 denti, che a sua volta fa muovere una terza ruota R3 da 150 denti. Se la ruota dentata R3 fa un giro e mezzo, quanti ne fa la ruota dentata R1?\n\n", "choices": [ "3", "5", "6", "9", "12" ], "answer": 4

3.3. Zero-shot prompting

We evaluate our models in a zero-shot setting, thereby imitating the conditions of a real use-case scenario. The prompt we chose is designed to encourage the model to output only the letter corresponding to the answer. The original prompt, together with its English translation, is presented in Box 1.

Box 1: Zero-shot prompt and English translation.

We decided to write the prompt in Italian in order to better represent a multilingual scenario. The prompt does not contain any information about the subject of the question or any other informative cues. In this way, our benchmark not only tests the model in question answering, but also indirectly tests the instruction-following abilities of the model in a language diferent than English.

Previous work on evaluating the performance of LLMs on MCQ datasets has identified two aspects which can interfere with the model’s answers and therefore accuracy. One has to do with the order of possible answers: Wang et al. [12] show that the first presented option out of the possible choices tends to be preferred in the model’s answer, making it quite important to take the order of possible answers into account. The other has to do with the prompt’s (and even the question’s) formulation: Singhal et al. [13] experiment with multiple types of prompts and also show that prompt formulation afects the model’s output.

Because the position of the correct answer in the original data was already randomly distributed, which we verified with a supplementary analysis on the data (see Appendix B), performing a random permutation of the possible answers was not necessary.

3.4. Data statistics

Mult-IT-A The Mult-IT-A dataset is composed of 1,692 questions, spanning over 17 topics, all centered around knowledge required for entry exams at Italian Universities. The topics, and some additional information on the dataset composition, are provided in Table 1.

On average, questions are 83.76 characters long and contain 25 tokens counted with the tiktoken cl100k base 5 tokenizer or 16.5 if counted with the Spacy 6 library using the it_core_news_lg model 7.

Further statistics about quiz distribution and answer position are available in Appendix ?? and C.

It’s worth noting that permutating the order of the answers would be recommended to avoid any kind of unbalance, as Multi-IT-A shows an uneven correct answer distribution leaning heavily on the first choice, acquired from the source data.

Mult-IT-C The Mult-IT-C dataset is composed of 108,773 questions divided into 4,129 quizzes.

To avoid confusion, we decided to give unequivocal names to the items of the subjects. A "quiz" is defined as a set of multiple "questions". Quizzes come from real-world examples, so they are provided with a specific name and

5https://github.com/openai/tiktoken

6https://github.com/explosion/spaCy 7https://github.com/explosion/spacy-models/releases/tag/it_core_ news_lg-3.7.0 informatica sintassi grammatica completamento frasi geografia geometria ortografia biologia storia psicologia e sociologia del disadattamento elementi di criminologia pedagogia elementi di diritto costituzionale ed amministrativo sinonimi chimica fisica deduzione logica

Total #tokens

Avg Token/Quiz 128 121 119 115 114 114 113 105 100 100 100 100 100 98 83 43 39 1997 3617 2429 3034 1247 3541 2273 4588 1744 2890 2386 2271 1813 1667 2983 2251 1247 a categorisation originating from the original data source. level of specificity reached by the first level of categoriA quiz can contain a variable number of questions. The sation, it’s interesting to notice, as examples, categories average number of questions per single quiz is 26, and such as "Verbs", "Diphthongs", or "Word Meanings" rethe maximum is 250. There are 1623 quizzes with more ferring to specific language abilities. These categories than 25 items, 298 with more than 50, and only 22 with are then grouped in level 2 as "Linguistic Competence" more than 100 items. and in level 3 as "Language". As another example, we

The original categorisation made by the authors of can see categories that refer to specific aspects of the the website "Concorsi Pubblici" was problematic for our Italian Public Administrations: we can see in category 1 purposes: some categories were near-duplicates of each ifelds such as "INPS", that is, National Institute for Social other, containing only slightly diferent words. More- Security, or "ASL" that is "Local Health Authority". over, we believed that 186 categories were too many for We believe that having such a precise categorisation at a meaningful visualisation and management of the data. our disposal is of great help in understanding the abilities For this reason, we created a hierarchy of three levels in and weaknesses of models in very specific aspects, thus which the first (bottom) level corresponds to the original being helpful on one hand for assessing the possibility of categorisation of the data, then the second level groups direct practical employment of models in Italian public the categorisation into 36 areas, and finally, the third administration and, on the other hand, to improve the level, the more abstract, has only 7 categories. The draw- scientific understanding of models and how they deal back of this approach, as it can be seen in the tables and with diferent kinds of challenges. This last aspect can graphs contained in Appendix A, is that in both the sup- also be helpful for interpretability studies of LLMs. plementary categorisations there is a significant amount On average, questions are 104 characters long, they of quizzes falling into the category "Other". contain 27.5 tokens counted with "tiktoken cl100k base"

Nonetheless, we believe that this abstract categorisa- 8 or 19.8 if counted with the "Spacy" library 9 using the tion can be good for having a general look at the data "it_core_news_lg" model 10. The longest question is 1363 composition and thus the performance of the models in token long. terms of macro-areas. On the other hand, keeping the original very detailed categorisation in the data allows for more in-depth analysis of model performances in specific aspects. In Appendix A, all the statistics of the 186 cate- 8See footnote 5 gories are listed in the form of a table. To appreciate the 9See footnote 6 10See footnote 7

4. Evaluation

We will use accuracy to evaluate the LLMs’ performance on Mult-IT. Accuracy is defined as the ratio of correctly answered questions to the total number of questions, and it is a straightforward and easily interpretable measure of performance on MCQ tasks. Accuracy will be reported overall, and also separately for the two subsets Mult-IT-A and Mult-IT-C.

While accuracy is indeed a straightforward evaluation metric for this task, deciding which is the answer identified by the model as correct is not necessarily as straightforward for a couple of reasons.

As mentioned in Section 3.3, the position of the correct answer in the prompt is randomly distributed, reducing the likelihood of bias resulting from its placement, although the models might have a tendency to select the ifrst answer more frequently.

A related issue is the fact that the model’s output, in spite of the specific request in the prompt, might not always just be the letter corresponding to the chosen answer. In the case of longer outputs, simple regular expressions will be applied to extract the relevant letter.

In practice, as for all the CALAMITA challenges, the evaluation of the LLMs on Mult-IT will be carried out on the LM-evaluation-harness framework developed by EleutherAI11. Altre Medicina Corpo Pubblico Giurisprudenza Competenza Linguistica Cultura Generale Informatica Logica Farmacia Geografia Storia APES Scienze Motorie Matematica Lingua Pubblica Amministrazione Educazione civica Letteratura Biochimica Chimica Istruzione Architettura Fisica Biologia Economia Scienze Biotecnologie Scienze naturali Arte profilo psicoattitudinale Scienze della Comunicazione Cucina Scienze dei Beni culturali Total

5. Limitations

The vast majority of data comes from sources linked with Italian public Institutions, and can be considered oficial documents. For this reason, we expect an high quality regarding the formulation of the quizzes and the correctness of the answers. Nonetheless, given the large amount of data, we cannot guarantee the absence of errors in the single questions. Human errors can happen, even in oficial selection, although it should be considered a rare occasion. This aspect can be improved by analysing the results obtained by the model in the benchmarks: the more the benchmark is going to be used, the more it will be possible to isolate and eventually remove or correct problematic quizzes with data analytics techniques.

Moreover, considered that the quizzes encompass almost a thirty years time span, it is possible that some quizzes, particularly the ones regarding laws, may be outdated. Nonetheless, thanks to the availability of metadata, it is possible to further refine this dataset to also account for specific historical knowledge about laws by providing metadata to the model. However, we believe that for the first run of this evaluation this point will not create particular issues as we expect the potentially outdated questions to be limited.

Given the publicity of the data, it is possible that the original exams are already present in the models training data as they can easily obtained on the Internet. At the same time, it is likely that some sources, for example complete laws of the Italian legislation, are present in the training data, but we consider this eventuality positive given that one of the benchmark’s aim is to evaluate the knowledge and capacity of the model to adapt to the Italian landscape. 6. Data license and copyright issues Information about license and copyright issues is mandatory.

Acknowledgments

The authors would like to thank Alpha Test - https:// www.alphatest.it, and in particular Martha Fabbri, for their interest in the Mult-IT CALAMITA challenge and for the extremely valuable exchange of ideas and data, that allowed us to shape a task of high potential impact also in the field of educational training and assessment. guistics, Dublin, Ireland, 2022, pp. 3214–3252. URL: https://huggingface.co/spaces/open-llm-leaderboard https://aclanthology.org/2022.acl-long.229. doi:10. [4] https://arxiv.org/pdf/2406.01574 [5] https://www. 18653/v1/2022.acl-long.229. cia.gov/the-world-factbook/about/archives/2022/ [4] P. Wang, A. Chan, F. Ilievski, M. Chen, X. Ren, countries/world/ [6] https://arxiv.org/pdf/2304.05613 Pinto: Faithful language reasoning using prompt- [7] https://commoncrawl.github.io/cc-crawl-statistics/ generated rationales, in: Workshop on Trustwor- plots/languages.html Accessed on 18/09/24 [8] thy and Socially Responsible Machine Learning, https://arxiv.org/abs/2406.17789v1

NeurIPS 2022, 2022. [5] D. Hendrycks, C. Burns, S. Basart, A. Zou,

M. Mazeika, D. Song, J. Steinhardt, Measuring mas- 7. Online Resources sive multitask language understanding, in: International Conference on Learning Representations, The sources for the ceur-art style are available via 2021. • GitHub, [6] Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, • Overleaf template.

W. Ren, A. Arulraj, X. He, Z. Jiang, et al., Mmlupro: A more robust and challenging multi-task language understanding benchmark, arXiv preprint arXiv:2406.01574 (2024). [7] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions for machine comprehension of text, 2016. URL: https://arxiv.org/abs/1606.05250.

arXiv:1606.05250. [8] D. Croce, A. Zelenanska, R. Basili, Neural learning for question answering in italian, in: C. Ghidini, B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA 2018 – Advances in Artificial Intelligence, Springer

International Publishing, Cham, 2018, pp. 389–402. [9] I. Plaza, N. Melero, C. del Pozo, J. Conde, P. Reviriego, M. Mayor-Rocher, M. Grandury, Spanish and llm benchmarks: is mmlu lost in translation?, arXiv preprint arXiv:2406.17789 (2024). [10] V. Lai, N. Ngo, A. Pouran Ben Veyseh, H. Man,

F. Dernoncourt, T. Bui, T. Nguyen, Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning, in: Findings of the Association for Computational

Linguistics: EMNLP 2023, 2023, pp. 13171–13189. [11] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Francis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Rinaldi, D. Scalena, CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian, in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy, December 4 - December 6, 2024, CEUR Workshop Proceedings, CEUR-WS.org, 2024. [12] P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y. Cao,

Q. Liu, T. Liu, Z. Sui, Large language models are not fair evaluators, 2023. arXiv:2305.17926. [13] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn,

L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal, et al., Towards expert-level medical question answering with large language models, arXiv preprint arXiv:2305.09617 (2023). charge [ 1 ] mmlu paper [ 2 ] https://github. com/EleutherAI/lm-evaluation-harness [ 3 ] A. Appendix A: Detailed Statistics

per category (Multi-IT-C) 9419 16414 7503 6342 5302 10015 5370 4458 10142 6145 3712 6441 4969 4866 4847 5217 2972 2535 3557 3547 5741 2685 5814 4807 2648 12928 4065 2874 2329 1815 2432 5233 3448 5020 4368 2332 2083 3002 3114 B. Appendix B: Distribution of position of correct answer (Mult-IT-C)

Total Quizzes

Quizzes Percentage

Total Tokens

Tokens Percentage Figure 10: Quiz percentage distribution, taxonomy level 2 (all the categories, Mult-IT-C) C. Appendix C: Distribution of position of correct answer (Mult-IT-A)

[1]

Srivastava ,

Kleyjo ,

Wu , Beyond the imitation game: Quantifying and extrapolating the capabilities of language models , Transactions on Machine Learning Research ( 2023 ).

[2]

Liu ,

Zhou ,

Hua ,

Chong ,

Tian , A. Liu,

Wang ,

You ,

Guo ,

Zhu , et al., Benchmarking large language models on cmexam-a comprehensive chinese medical exam dataset , Advances in Neural Information Processing Systems 36 ( 2024 ).

[3]

Lin ,

Hilton , O. Evans, TruthfulQA: Measuring how models mimic human falsehoods , in: S. Muresan,

Nakov , A . Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Lin-