Mult-IT
                                Multiple Choice Questions on Multiple Topics in Italian:
                                A CALAMITA Challenge
                                Matteo Rinaldi1,† , Jacopo Gili1,† , Maria Francis2,3,† , Mattia Goffetti4 , Viviana Patti1,‡ and
                                Malvina Nissim2,*,‡
                                1
                                  University of Turin
                                2
                                  CLCG, University of Groningen
                                3
                                  University of Trento
                                4
                                  Alpha Test, S.R.L.


                                               Abstract
                                               Multi-choice question answering (MCQA) is a powerful tool for evaluating the factual knowledge and reasoning capacities of
                                               Large Language Models (LLMs). However, there is a lack of large-scale MCQA datasets originally written in Italian. Existing
                                               Italian MCQA benchmarks are often automatically translated from English, an approach with two key drawbacks: Firstly,
                                               automatic translations may sound unnatural, contain errors, or use linguistics constructions that do not align with the target
                                               language. Secondly, they may introduce topical and ideological biases reflecting Anglo-centric perspectives. To address
                                               this gap, we present Mult-IT, an MCQA dataset comprising over 110,000 manually written questions across a wide range of
                                               topics. All questions are sourced directly from preparation quizzes for Italian university entrance exams, or for exams for
                                               public sector employment in Italy. We are hopeful that this contribution enables a more comprehensive evaluation of LLMs’
                                               proficiency, not only in the Italian language, but also in their grasp of Italian cultural and contextual knowledge.

                                               Keywords
                                               CALAMITA Challenge, Italian, Benchmarking, Multiple-Choice Questions, LLMs


                                1. Challenge: Introduction and                                                                          ber of possible choices leaves less room for ambiguities
                                                                                                                                        in the model’s answers.
                                   Motivation                                                                                              It is no surprise then that the Massive Multitask Lan-
                                In recent years, multi-choice question answering (MCQA)                                                 guage Understanding (MMLU1 ) benchmark [5] has be-
                                has established itself as a powerful method to test the fac-                                            come the standard for the evaluation of factual knowl-
                                tual knowledge and reasoning abilities embedded in large                                                edge and reasoning abilities of LLMs. Containing 15,908
                                language models (LLMs) as a byproduct of the language                                                   English quizzes, this benchmark spans diverse disciplines,
                                modelling objective [1, 2, 3, 4].                                                                       including humanities, law, STEM, and ethics. To keep
                                   The evaluation of MCQAs can be easily automated,                                                     up with the development of models which are rapidly
                                offering a significant advantage over other benchmark-                                                  improving at answering MMLU questions, Wang et al.
                                ing formats such as open-end text responses. In addition,                                               [6] have developed MMLU-Pro, an extended version of
                                with appropriately targeted prompting, the limited num-                                                 MMLU that includes more reasoning-focused questions
                                                                                                                                        and more distractors per question (from four to ten),
                                                                                                                                        while removing questions that are too simple or noisy.
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy                                                                             Although MMLU has proven to be a useful testbed
                                *
                                  Corresponding author.                                                                                 for LLMs, it is currently centered around the English
                                †
                                  Shared first authorship.                                                                              language. Multiple choice question datasets in other lan-
                                ‡
                                  Shared supervision.                                                                                   guages tend to be translations of originally English data,
                                $ matteo.rinaldi@unito.it (M. Rinaldi); jacopo.gili584@edu.unito.it                                     rather than being developed natively in the target lan-
                                (J. Gili); maria.francis287@gmail.com (M. Francis);                                                     guage. This also holds for Italian, for which a translation
                                mattia_goffetti@alphatest.it (M. Goffetti); viviana.patti@unito.it                                      of the Squad dataset [7], namely Squad-IT [8], has been
                                (V. Patti); m.nissim@rug.nl (M. Nissim)
                                 https://github.com/mrinaldi97 (M. Rinaldi);                                                           the reference for evaluating models on QA-tasks. There
                                https://github.com/Jj-source (J. Gili); https://github.com/rosakun                                      are at least two problems with using translated data: First,
                                (M. Francis); https://github.com/vivpatti (V. Patti);                                                   translations are often generated automatically, resulting
                                https://github.com/malvinanissim (M. Nissim)                                                            in data that sounds unnatural or is even incorrect - auto-
                                 0009-0004-7488-8855 (M. Rinaldi); 0009-0007-1343-3760 (J. Gili);                                      matic translation can easily introduce artifacts, break the
                                0009-0007-7638-9963 (M. Francis); 0000-0001-5991-370X (V. Patti);
                                0000-0001-5289-0971 (M. Nissim)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                         Attribution 4.0 International (CC BY 4.0).                                                         https://github.com/hendrycks/test


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
coherence of discourse, and encourage the presence of           job applicants, or learners across a range of topics, in-
linguistic constructions that reflect the source language       cluding general knowledge and more specialised subjects.
rather than the target one [9]. The second issue relates to     These questions make up the Mult-IT dataset: Multiple
culture and societal norms: text translated from English        Choice Questions on Various Topics in Italian, which we
to Italian will lack topical biases, preferences, conven-       are introducing in this contribution. The details of the
tions, and ways of expressing ideas that are unique to          dataset are described in Section 3. The defining feature
Italian culture. Thus, while the text may be expressed in       of the Mult-IT challenge is that all of the MCQs are na-
the Italian language, its content and underlying norms          tively Italian, both in language and in content. While
will continue to represent an Anglo-centric, predomi-           this is an advantage to gain a better understanding of
nantly American perspective.                                    model behaviour on Italian data, we do expect a decline in
   More generally, training data for LLMs is biased to-         model performance. Considering that even models which
wards English content, and as a result, there is often a        have been trained on multilingual data have a heavy bias
gap between English and non-English performance [10].           towards English and American-centric culture, it is ex-
For example, the Common Crawl dataset2 , often used             pected that the correctness of the answer may be affected
as a base for more refined datasets to be employed in           by a cultural (and possibly language) gap. Should the
the pre-training of LLMs, is composed of 45% English            battery of the models tested also include Italian mono-
content, while the data for languages such as Spanish,          lingual or bilingual English-Italian models trained on a
French, Italian, and Chinese are all below 5% each, with        substantial amount of Italian text, this benchmark will
the only exceptions being Russian (6.2%) and German             make it possible to underscore differences in performance
(5.1%).                                                         possibly associated to the language specificity of such
   Creating a large-scale multi-choice question answer-         models.
ing benchmark using original Italian data will make it
possible to investigate the Italian abilities of LLMs in a
more natural and transparent way, possibly also leading         3. Data description
to a better understanding of how to make multilingual
                                                             Mult-IT contains quizzes designed to assess candidates’
models better at Italian. It will also serve as a core bench-
                                                             knowledge in open competitive exams, whether for ad-
mark for assessing the performance of monolingual Ital-
                                                             mission to national universities or for positions in Italian
ian LLMs. If similar datasets are collected natively for
                                                             institutions. This approach offers several advantages.
other languages, Mult-IT can be part of a larger MCQA
                                                             First of all, these public competitions encompass very
benchmark which is multilingual in the truest sense.
                                                             general topics such as language comprehension, basic his-
   Mult-IT, presented at CALAMITA [11], is the first mas-
                                                             tory, and common knowledge, but also more specialised
sive Multi-Choice-Question-Answering dataset specifi-
                                                             ones, focusing on specific laws needed for certain profes-
cally designed for the Italian language which draws on
                                                             sions or the security measures required for jobs such as
Italian culture and Italian-focused knowledge. By provid-
                                                             policemen or firefighters. Our benchmark, therefore, con-
ing a comprehensive, culturally relevant benchmark for
                                                             tains questions that range from a low level of difficulty to
the Italian language, we aim to set a precedent for the
                                                             a very high and specific level, setting high standards for
development of similar resources in other languages and
                                                             the performance of the models, and it may also be useful
cultures, ultimately contributing to a more diverse and
                                                             to assess specific knowledge valuable for the adoption
inclusive AI landscape.
                                                             of models in Public Administration scenarios. The in-
                                                             clusion of profession-specific questions in Mult-IT tests
2. Challenge: Description                                    the ability of LLMs to apply their knowledge in practical,
                                                             real-world scenarios, a feature that could prove partic-
This challenge involves a multiple-choice questioning ularly valuable in assessing the potential of AI systems
answering task. The model is prompted with a simple to support specialised fields and decision-making pro-
instruction (see Box 1), followed by a question and a set of cesses in professional and administrative contexts in the
three and five possible answers, depending on the source Italian landscape. Moreover, the quizzes contained in
and topic of the question. Among these answers, only the dataset also present challenges regarding reasoning,
one is correct, and the others are distractors. The model such as logical thinking and mathematical reasoning, as
is expected to identify the correct answer and return the well as quizzes specifically designed to assess knowledge
letter corresponding to the option deemed correct.           and mastery of the Italian language, for example, text
    All questions in the benchmark have been manually comprehension or detailed understanding of grammatical
crafted for the purpose of training or testing students, phenomena.
                                                                Mult-IT consists of two core subsets, which are divided
2
  https://commoncrawl.github.io/cc-crawl-statistics/plots/   by the origin of the data. Both subsets are made of quizzes
languages.html, accessed on 18/09/24
that test knowledge of general Italian culture and that are 3.1. Origin of data
used in the public recruitment processes for government-
                                                            Mult-IT-A All the materials of Mult-IT-A were ob-
based positions. They are described in more details below.
                                                            tained thanks to the generosity of Alpha Test 3 . Alpha
                                                            Test S.r.l. is an Italian publishing house and educational
Mult-IT-A Mult-IT-A is a collection of MCQs provided training company, founded in Milan in 1987, that spe-
by Alpha test. It contains a total of 1,692 questions in cialises in study aid materials and courses for high school,
Italian, spanning over 17 categories which corresponds university, professional tests, exams and certifications.
to topics featuring in entry exams for Italian universi- Alpha Test is the main reference for high school students
ties (see Table 1 below for details) or question answering preparing for university admission. Each year, Alpha Test
tests employed in public competitions. The quizzes in the gathers new data from the entrance exams of public and
dataset falling into the categories of law, pedagogy, psy- private universities and military schools, mainly in the
chology and criminology originates from public competi- form of multiple choice questions. The publishing house
tions. For each question, four or five possible answers are enhances such materials with comments and explana-
provided, out of which only one is correct. An example tions, and creates variations or completely new versions
from topic ’sinonimi’ (synonyms) is shown in Figure 1. of the original quizzes. All the materials in Mult-IT-A
                                                            have been sourced from original, public data, and rep-
                                                            resent a varied sample of quizzes about general culture,
          Indolente e’ un sinonimo di:
             (A) tenero,
                                                            STEM, and juridical disciplines.
           (B) doloroso,
           (C) pigro,                                      Mult-IT-C All the materials of Mult-IT-C were ob-
           (D) insensibile                                 tained using a web-scraping process from the web-
                                                           site "Concorsi Pubblici"4 via customised Python scripts.
Figure 1: Example of question and possible answers from
                                                           While   there exist many websites collecting public com-
Mult-IT-A for the topic ‘sinonimi’ (synonyms). The correct petitions  exams, we found Concorsi Pubblici to be the
answer is (C).                                             most complete. Because the same public competition can
                                                           be listed in several platforms, gathering all the data from
                                                           a single websites avoided the risks of data duplication.
                                                           The quizzes on Concorsi Pubblici are organised by topic
Mult-IT-C Mult-IT-C is a large collection of MCQs, (see Appendix A), and were extracted in time interval
organised in groups of questions ("quizzes") around mul- 1997-2024.
tiple topics, which we have obtained from publicly ac-
cessible online platforms through data-gathering and
                                                           3.2. Data format
web-scraping. The quizzes are meant to be used by peo-
ple who need to prepare to apply for job positions in Overall, the data format is consistent across the two Mult-
the public sector. One of the most interesting feature IT subsets, which allows for a single evaluation procedure
of Mult-IT-C is its size: it contains more than 100,000 on Mult-IT. The larger size of the Mult-IT-C dataset al-
questions, making it almost six times larger than MMLU. lowed us to include additional information, including
An example from topic ’geografia’ (geography) is shown details about the quiz’s administration presented in the
in Figure 2.                                               form of quiz blocks and a multi-level topic taxonomy.
                                                           This feature is absent in Mult-IT-A as questions are col-
                                                           lected by subject without being grouped in quizzes.
          In quale nazione si trova il Lago                   Common data fields in all Mult-IT are:
              Balaton?
           (A) Ucraina,                                               • origin: It can be either ’C’ or ’A’, to discern if the
           (B) Ungheria,                                                question belongs to Multi-IT-A or Multi-it-C
           (C) Romania,                                               • question: The question.
           (D) Repubblica Ceca
                                                                      • choices: The list of possible answers.
           (E) Bulgaria
                                                                      • answer: The array-index corresponding to the
                                                                        correct answer in the choices array.
Figure 2: Example of question and possible answers from
Mult-IT-C for the topic’geografia’ (geography). The correct     These common fields are crucial to the evaluation task,
answer is (B).                                                but each of the sub-portions has additional information
                                                              added.
                                                              3
                                                                  https://www.alphatest.it/
                                                              4
                                                                  https://www.concorsipubblici.com/
Mult-IT-A Examples taken from Mult-IT-A are given                   {
in Figure 3.                                                            "quiz_id": 2250,
                                                                        "question_id": 20,
                                                                        "question": "La Costituzione riconosce allo
     {                                                                       Stato una potesta’ legislativa
       "origin": "A",                                                        esclusiva in materia di:\n\n",
       "topic": "informatica",                                          "choices": [
       "question": "Le dimensioni del monitor si                          "organizzazione della rete scolastica",
            misurano in:",                                                "norme generali sull’istruzione",
       "choices": [                                                       "ricerca scientifica e tecnologica",
         "megahertz",                                                     "istruzione professionale"
         "pixel",                                                       ],
         "centimetri",                                                  "answer": 1
         "pollici"                                                  }
       ],                                                           {
       "answer": 3                                                      "quiz_id": 1253,
     },                                                                 "question_id": 63,
     {                                                                  "question": "In un ingranaggio con piu’
       "origin": "A",                                                        ruote dentate, una ruota denominata R1
       "topic": "psicologia e sociologia del                                 ha 25 denti e fa muovere una seconda
            disadattamento",                                                 ruota denominata R2 da 50 denti, che a
       "question": "Come viene definito lo stimolo                           sua volta fa muovere una terza ruota R3
            funzionale a provocare un cambiamento?"                            da 150 denti. Se la ruota dentata R3
            ,                                                                fa un giro e mezzo, quanti ne fa la
       "choices": [                                                          ruota dentata R1?\n\n",
         "Stress",                                                      "choices": [
         "Output",                                                        "3",
         "Input",                                                         "5",
         "Matrice"                                                        "6",
       ],                                                                 "9",
       "answer": 0                                                        "12"
     },                                                                 ],
                                                                        "answer": 4
                                                                    }
 Figure 3: Data format used in Mult-IT-A.

                                                                Figure 4: Data format used in the quiz.jsonl file of Mult-IT-C.
   The only additional field is Topic, pointing to the topic
of the question. Its distribution and statistics about token
count and char count are available in Figure 6.
                                                                    • class_level1: The first level of the topic taxon-
                                                                      omy.
Mult-IT-C The dataset consists of two files: quiz.jsonl
contains the actual questions, while metadata.jsonl con-            • class_level2: The second level of the topic tax-
tains additional information about the questions. An                  onomy.
example from the quiz.jsonl file is given in Figure 4.              • class_level3: The third level of the topic taxon-
   Data fields unique of this subportion are:                         omy.
                                                                    • difficulty: The difficulty level as estimated in the
     • quiz_id: The ID of the quiz to which the question              original website.
       pertains.                                                    • source: The source of the question.
     • question_id: The unique identifier of the ques-
       tion inside the quiz. In combination with the           An example of an item from the metadata file is given
       quiz_id it forms the unique identifier of each ques- in Figure5.
       tion and can be used to retrieve the metadata of
       the question from the metadata.jsonl file.           3.3. Zero-shot prompting
  Data fields of the metadata are:                             We evaluate our models in a zero-shot setting, thereby
                                                               imitating the conditions of a real use-case scenario. The
     • id: The unique identifier of the question.              prompt we chose is designed to encourage the model to
     • title: Title of the quiz sourced from the original      output only the letter corresponding to the answer. The
       website.                                                original prompt, together with its English translation, is
     • tags: List of word tags.                                presented in Box 1.
                                                               We decided to write the prompt in Italian in order
    {                                                       to better represent a multilingual scenario. The prompt
        "id": 2250,
                                                            does not contain any information about the subject of the
        "title": "Area 3 Giuridico Amministrativa
             Finanziaria - 25 domande concorso              question or any other informative cues. In this way, our
             dirigente scolastico Miur",                    benchmark not only tests the model in question answer-
        "tags": [                                           ing, but also indirectly tests the instruction-following
          "concorsi dirigenti scolastici",                  abilities of the model in a language different than En-
          "concorso dirigente scolastico",
                                                            glish.
          "dirigente scolastico concorso dirigente
               scolastico 2017",                               Previous work on evaluating the performance of LLMs
          "miur",                                           on MCQ datasets has identified two aspects which can
          "miur concorso",                                  interfere with the model’s answers and therefore accu-
          "miur concorso dirigente scolastico miur          racy. One has to do with the order of possible answers:
               concorsi scuola",
                                                            Wang et al. [12] show that the first presented option
          "concorso scuola",
          "bando concorso scuola",                          out of the possible choices tends to be preferred in the
          "bandi miur"                                      model’s answer, making it quite important to take the
        ],                                                  order of possible answers into account. The other has
        "class_lev1": [                                     to do with the prompt’s (and even the question’s) for-
          "Miur",
                                                            mulation: Singhal et al. [13] experiment with multiple
          "dirigente scolastico"
        ],                                                  types of prompts and also show that prompt formulation
        "source": [                                         affects the model’s output.
          "Fgl Cgil, Miur"                                     Because the position of the correct answer in the orig-
        ],                                                  inal data was already randomly distributed, which we
        "difficulty": [
                                                            verified with a supplementary analysis on the data (see
          "medio"
        ],                                                  Appendix B), performing a random permutation of the
        "class_lev2": [                                     possible answers was not necessary.
          "Istruzione",
          "Altre"
        ],                                                  3.4. Data statistics
        "class_lev3": [
          "Societa’ e Diritto",
                                                               Mult-IT-A The Mult-IT-A dataset is composed of 1,692
          "Altro"                                              questions, spanning over 17 topics, all centered around
        ]                                                      knowledge required for entry exams at Italian Universi-
    }                                                          ties. The topics, and some additional information on the
                                                               dataset composition, are provided in Table 1.
Figure 5: Data format used in the metadata.jsonl file of Mult-    On average, questions are 83.76 characters long and
IT-C.                                                          contain 25 tokens counted with the tiktoken cl100k base
                                                               5
                                                                 tokenizer or 16.5 if counted with the Spacy 6 library
                                                               using the it_core_news_lg model 7 .
   Prompt for the LLM                                             Further statistics about quiz distribution and answer
                                                               position are available in Appendix ?? and C.
   Di seguito è riportata una domanda a scelta                    It’s worth noting that permutating the order of the
   multipla e varie possibili risposte, ciascuna               answers would be recommended to avoid any kind of un-
   indicata da una lettera. Scegli la risposta che             balance, as Multi-IT-A shows an uneven correct answer
   meglio risponde alla domanda, e riporta in                  distribution leaning heavily on the first choice, acquired
   output soltanto la lettera corrispondente a                 from the source data.
   quella risposta, senza spiegazioni.
                                                            Mult-IT-C The Mult-IT-C dataset is composed of
   Below is a multi-choice question together with pos-      108,773 questions divided into 4,129 quizzes.
   sible answers, each indicated by a letter. Choose           To avoid confusion, we decided to give unequivocal
   the best answer for the question, and report as out-     names to the items of the subjects. A "quiz" is defined as a
   put only the letter corresponding to that answer,        set of multiple "questions". Quizzes come from real-world
   without any explanation.                                 examples, so they are provided with a specific name and
                                                            5
                                                              https://github.com/openai/tiktoken
                                                            6
                                                              https://github.com/explosion/spaCy
                                                            7
 Box 1: Zero-shot prompt and English translation.             https://github.com/explosion/spacy-models/releases/tag/it_core_
                                                              news_lg-3.7.0
               Category                                                     Total        #tokens   Avg Token/Quiz
               informatica                                                     128          1997             15.602
               sintassi                                                        121          3617             29.893
               grammatica                                                      119          2429             20.412
               completamento frasi                                             115          3034             26.383
               geografia                                                       114          1247             10.939
               geometria                                                       114          3541             31.061
               ortografia                                                      113          2273             20.115
               biologia                                                        105          4588             43.695
               storia                                                          100          1744             17.440
               psicologia e sociologia del disadattamento                      100          2890             28.900
               elementi di criminologia                                        100          2386             23.860
               pedagogia                                                       100          2271             22.710
               elementi di diritto costituzionale ed amministrativo            100          1813             18.130
               sinonimi                                                         98          1667             17.010
               chimica                                                          83          2983             35.940
               fisica                                                           43          2251             52.349
               deduzione logica                                                 39          1247             31.974

Table 1
Mult-IT-A: Topics included in the dataset, number of questions per topic, total tokens per topic, and average length of question
per topic in terms of tokens. The topic “pedagogia" in the Table is short for “pedagogia con particolare riferimento agli
interventi relativi all’osservazione e al trattamento dei detenuti e degli internati"; the topic “elementi di diritto costituzionale
ed amministrativo" is short for “elementi di diritto costituzionale ed amministrativo con particolare riferimento al rapporto di
pubblico impiego".


a categorisation originating from the original data source.        level of specificity reached by the first level of categori-
A quiz can contain a variable number of questions. The             sation, it’s interesting to notice, as examples, categories
average number of questions per single quiz is 26, and             such as "Verbs", "Diphthongs", or "Word Meanings" re-
the maximum is 250. There are 1623 quizzes with more               ferring to specific language abilities. These categories
than 25 items, 298 with more than 50, and only 22 with             are then grouped in level 2 as "Linguistic Competence"
more than 100 items.                                               and in level 3 as "Language". As another example, we
   The original categorisation made by the authors of              can see categories that refer to specific aspects of the
the website "Concorsi Pubblici" was problematic for our            Italian Public Administrations: we can see in category 1
purposes: some categories were near-duplicates of each             fields such as "INPS", that is, National Institute for Social
other, containing only slightly different words. More-             Security, or "ASL" that is "Local Health Authority".
over, we believed that 186 categories were too many for               We believe that having such a precise categorisation at
a meaningful visualisation and management of the data.             our disposal is of great help in understanding the abilities
For this reason, we created a hierarchy of three levels in         and weaknesses of models in very specific aspects, thus
which the first (bottom) level corresponds to the original         being helpful on one hand for assessing the possibility of
categorisation of the data, then the second level groups           direct practical employment of models in Italian public
the categorisation into 36 areas, and finally, the third           administration and, on the other hand, to improve the
level, the more abstract, has only 7 categories. The draw-         scientific understanding of models and how they deal
back of this approach, as it can be seen in the tables and         with different kinds of challenges. This last aspect can
graphs contained in Appendix A, is that in both the sup-           also be helpful for interpretability studies of LLMs.
plementary categorisations there is a significant amount              On average, questions are 104 characters long, they
of quizzes falling into the category "Other".                      contain 27.5 tokens counted with "tiktoken cl100k base"
                                                                   8
   Nonetheless, we believe that this abstract categorisa-            or 19.8 if counted with the "Spacy" library 9 using the
tion can be good for having a general look at the data             "it_core_news_lg" model 10 . The longest question is 1363
composition and thus the performance of the models in              token long.
terms of macro-areas. On the other hand, keeping the
original very detailed categorisation in the data allows for
more in-depth analysis of model performances in specific
aspects. In Appendix A, all the statistics of the 186 cate-        8
                                                                       See footnote 5
gories are listed in the form of a table. To appreciate the        9
                                                                       See footnote 6
                                                                   10
                                                                        See footnote 7
Figure 6: Mult-IT-A: Topic distribution percentage-wise.


4. Evaluation                                                though the models might have a tendency to select the
                                                             first answer more frequently.
We will use accuracy to evaluate the LLMs’ performance          A related issue is the fact that the model’s output, in
on Mult-IT. Accuracy is defined as the ratio of correctly spite of the specific request in the prompt, might not
answered questions to the total number of questions, and always just be the letter corresponding to the chosen
it is a straightforward and easily interpretable measure answer. In the case of longer outputs, simple regular
of performance on MCQ tasks. Accuracy will be reported expressions will be applied to extract the relevant letter.
overall, and also separately for the two subsets Mult-IT-A      In practice, as for all the CALAMITA challenges, the
and Mult-IT-C.                                               evaluation of the LLMs on Mult-IT will be carried out
    While accuracy is indeed a straightforward evalua- on the LM-evaluation-harness framework developed by
tion metric for this task, deciding which is the answer EleutherAI11 .
identified by the model as correct is not necessarily as
straightforward for a couple of reasons.
    As mentioned in Section 3.3, the position of the correct
answer in the prompt is randomly distributed, reducing
the likelihood of bias resulting from its placement, al-
                                                            11
                                                                 https://github.com/EleutherAI/lm-evaluation-harness
                         Category                        Total     #Tokens     Avg Token-Quiz
                         Altre                           31,281      911,975               29.154
                         Medicina                        28,376      822,541               28.987
                         Corpo Pubblico                  25,208      604,007               23.961
                         Giurisprudenza                  15,540      482,403               31.043
                         Competenza Linguistica           7,142      196,610               27.529
                         Cultura Generale                 7,111      162,300               22.824
                         Informatica                      4,391       80,869               18.417
                         Logica                           3,374      130,258               38.606
                         Farmacia                         3,336       65,898               19.754
                         Geografia                        2,886       44,707               15.491
                         Storia                           2,150       49,945                23.23
                         APES                             2,139       70,868               33.131
                         Scienze Motorie                  2,066       56,278                27.24
                         Matematica                       1,931       52,858               27.373
                         Lingua                           1,929       36,439                18.89
                         Pubblica Amministrazione         1,565       50,432               32.225
                         Educazione civica                1,464       29,510               20.157
                         Letteratura                        853       17,811                20.88
                         Biochimica                         786       12,346               15.707
                         Chimica                            784       14,037               17.904
                         Istruzione                         745       18,528                24.87
                         Architettura                       382       23,280               60.942
                         Fisica                             356        8,969               25.194
                         Biologia                           336       10,512               31.286
                         Economia                           309       10,210               33.042
                         Scienze                            235        3,712               15.796
                         Biotecnologie                      185        5,741               31.032
                         Scienze naturali                   185        2,685               14.514
                         Arte                               180        4,419                24.55
                         profilo psicoattitudinale          135        5,020               37.185
                         Scienze della Comunicazione         90        1,787               19.856
                         Cucina                              75        1,130               15.067
                         Scienze dei Beni culturali          40          502                12.55

Table 2
Level 2 of the taxonomy for Mult-IT-C


                                                               5. Limitations
                                                               The vast majority of data comes from sources linked
                                                               with Italian public Institutions, and can be considered
                                                               official documents. For this reason, we expect an high
                                                               quality regarding the formulation of the quizzes and the
                                                               correctness of the answers. Nonetheless, given the large
                                                               amount of data, we cannot guarantee the absence of
                                                               errors in the single questions. Human errors can happen,
                                                               even in official selection, although it should be considered
                                                               a rare occasion. This aspect can be improved by analysing
                                                               the results obtained by the model in the benchmarks: the
                                                               more the benchmark is going to be used, the more it will
                                                               be possible to isolate and eventually remove or correct
                                                               problematic quizzes with data analytics techniques.
                                                                  Moreover, considered that the quizzes encompass al-
Figure 7: Distribution of number of items by token count for   most a thirty years time span, it is possible that some
Mult-IT-A.                                                     quizzes, particularly the ones regarding laws, may be
                                                               outdated. Nonetheless, thanks to the availability of meta-
Figure 8: Quiz percentage distribution, taxonomy level 2 (top 15 categories, Mult-IT-C)


data, it is possible to further refine this dataset to also    Acknowledgments
account for specific historical knowledge about laws by
providing metadata to the model. However, we believe           The authors would like to thank Alpha Test - https://
that for the first run of this evaluation this point will      www.alphatest.it, and in particular Martha Fabbri, for
not create particular issues as we expect the potentially      their interest in the Mult-IT CALAMITA challenge and
outdated questions to be limited.                              for the extremely valuable exchange of ideas and data,
   Given the publicity of the data, it is possible that the    that allowed us to shape a task of high potential impact
original exams are already present in the models training      also in the field of educational training and assessment.
data as they can easily obtained on the Internet. At the
same time, it is likely that some sources, for example
complete laws of the Italian legislation, are present in the
                                                               References
training data, but we consider this eventuality positive        [1] A. Srivastava, D. Kleyjo, Z. Wu, Beyond the im-
given that one of the benchmark’s aim is to evaluate the            itation game: Quantifying and extrapolating the
knowledge and capacity of the model to adapt to the                 capabilities of language models, Transactions on
Italian landscape.                                                  Machine Learning Research (2023).
                                                                [2] J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, A. Liu,
6. Data license and copyright                                       H. Wang, C. You, Z. Guo, L. Zhu, et al., Benchmark-
                                                                    ing large language models on cmexam-a compre-
   issues                                                           hensive chinese medical exam dataset, Advances in
                                                                    Neural Information Processing Systems 36 (2024).
Information about license and copyright issues is manda-        [3] S. Lin, J. Hilton, O. Evans, TruthfulQA: Measur-
tory.                                                               ing how models mimic human falsehoods, in:
                                                                    S. Muresan, P. Nakov, A. Villavicencio (Eds.), Pro-
                                                                    ceedings of the 60th Annual Meeting of the Asso-
                                                                    ciation for Computational Linguistics (Volume 1:
                                                                    Long Papers), Association for Computational Lin-
     guistics, Dublin, Ireland, 2022, pp. 3214–3252. URL:       https://huggingface.co/spaces/open-llm-leaderboard
     https://aclanthology.org/2022.acl-long.229. doi:10.        [4] https://arxiv.org/pdf/2406.01574 [5] https://www.
     18653/v1/2022.acl-long.229.                                cia.gov/the-world-factbook/about/archives/2022/
 [4] P. Wang, A. Chan, F. Ilievski, M. Chen, X. Ren,            countries/world/ [6] https://arxiv.org/pdf/2304.05613
     Pinto: Faithful language reasoning using prompt-           [7] https://commoncrawl.github.io/cc-crawl-statistics/
     generated rationales, in: Workshop on Trustwor-            plots/languages.html Accessed on 18/09/24 [8]
     thy and Socially Responsible Machine Learning,             https://arxiv.org/abs/2406.17789v1
     NeurIPS 2022, 2022.
 [5] D. Hendrycks, C. Burns, S. Basart, A. Zou,
     M. Mazeika, D. Song, J. Steinhardt, Measuring mas-         7. Online Resources
     sive multitask language understanding, in: Inter-
                                                                The sources for the ceur-art style are available via
     national Conference on Learning Representations,
     2021.                                                           • GitHub,
 [6] Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo,            • Overleaf template.
     W. Ren, A. Arulraj, X. He, Z. Jiang, et al., Mmlu-
     pro: A more robust and challenging multi-task lan-
     guage understanding benchmark, arXiv preprint
     arXiv:2406.01574 (2024).
 [7] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad:
     100,000+ questions for machine comprehension of
     text, 2016. URL: https://arxiv.org/abs/1606.05250.
     arXiv:1606.05250.
 [8] D. Croce, A. Zelenanska, R. Basili, Neural learning
     for question answering in italian, in: C. Ghidini,
     B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA
     2018 – Advances in Artificial Intelligence, Springer
     International Publishing, Cham, 2018, pp. 389–402.
 [9] I. Plaza, N. Melero, C. del Pozo, J. Conde, P. Re-
     viriego, M. Mayor-Rocher, M. Grandury, Spanish
     and llm benchmarks: is mmlu lost in translation?,
     arXiv preprint arXiv:2406.17789 (2024).
[10] V. Lai, N. Ngo, A. Pouran Ben Veyseh, H. Man,
     F. Dernoncourt, T. Bui, T. Nguyen, Chatgpt be-
     yond english: Towards a comprehensive evaluation
     of large language models in multilingual learning,
     in: Findings of the Association for Computational
     Linguistics: EMNLP 2023, 2023, pp. 13171–13189.
[11] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
     cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
     naldi, D. Scalena, CALAMITA: Challenge the Abili-
     ties of LAnguage Models in ITAlian, in: Proceed-
     ings of the 10th Italian Conference on Computa-
     tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
     ber 4 - December 6, 2024, CEUR Workshop Proceed-
     ings, CEUR-WS.org, 2024.
[12] P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y. Cao,
     Q. Liu, T. Liu, Z. Sui, Large language models are not
     fair evaluators, 2023. arXiv:2305.17926.
[13] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn,
     L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal,
     et al., Towards expert-level medical question an-
     swering with large language models, arXiv preprint
     arXiv:2305.09617 (2023).
charge [1] mmlu paper [2] https://github.
com/EleutherAI/lm-evaluation-harness                      [3]
A. Appendix A: Detailed Statistics
   per category (Multi-IT-C)

 Category                                 Total     Quizzes      Total    Tokens Per-   Avg Token-
                                          Quizzes   Percentage   Tokens   centage       Quiz
 Operatore Socio Sanitario                12434     7.39%        272403   6.03%         21.908
 Arma dei Carabinieri                     8179      4.86%        156358   3.46%         19.117
 Carabiniere                              8094      4.81%        154435   3.42%         19.08
 Istruttore Amministrativo                6188      3.68%        199101   4.41%         32.175
 Diritto Amministrativo                   6156      3.66%        219261   4.85%         35.617
 Poliziotto Municipale                    5834      3.47%        160459   3.55%         27.504
 Infermiere                               5531      3.29%        134725   2.98%         24.358
 Informatica                              4011      2.38%        74362    1.65%         18.54
 Guardia di Finanza                       4000      2.38%        116368   2.58%         29.092
 Formez                                   3945      2.34%        77864    1.72%         19.737
 Farmacia                                 3336      1.98%        65898    1.46%         19.754
 Agente di Polizia Municipale             3232      1.92%        94988    2.1%          29.39
 Assistente Amministrativo                3147      1.87%        82803    1.83%         26.312
 cultura generale                         3139      1.87%        71725    1.59%         22.85
 Polizia Municipale                       2615      1.55%        77641    1.72%         29.691
 Medicina e chirurgia                     2475      1.47%        135179   2.99%         54.618
 Polizia di Stato                         2441      1.45%        54487    1.21%         22.322
 Istruttore Amministrativo Contabile      2296      1.36%        65925    1.46%         28.713
 Professioni Sanitarie                    2260      1.34%        80903    1.79%         35.798
 Cultura generale : Prove Concorsuali     2199      1.31%        49629    1.1%          22.569
 Assistente giudiziario                   2123      1.26%        77906    1.72%         36.696
 Scienze Motorie e Sportive               2054      1.22%        56111    1.24%         27.318
 Medico                                   1957      1.16%        43704    0.97%         22.332
 Matematica                               1731      1.03%        48665    1.08%         28.114
 Cultura generale : Eserciziario          1724      1.02%        39858    0.88%         23.119
 Grammatica generale                      1713      1.02%        36572    0.81%         21.35
 Diritto Costituzionale                   1649      0.98%        33879    0.75%         20.545
 Scienze infermieristiche ed ostetriche   1570      0.93%        61315    1.36%         39.054
 Educazione civica                        1464      0.87%        29510    0.65%         20.157
 Azienda Sanitaria Locale (ASL)           1456      0.87%        41252    0.91%         28.332
 INPS                                     1440      0.86%        48416    1.07%         33.622
 Collaboratore Amministrativo             1334      0.79%        39673    0.88%         29.74
 Legislazione sanitaria                   1300      0.77%        27976    0.62%         21.52
 Diritto del Lavoro                       1284      0.76%        55342    1.22%         43.101
 Logica : Ragionamento logico             1280      0.76%        79782    1.77%         62.33
 Inglese                                  1274      0.76%        24616    0.54%         19.322
 Odontoiatria e protesi dentarie          1188      0.71%        62315    1.38%         52.454
 successioni di numeri e lettere          1170      0.7%         20970    0.46%         17.923
 Contabilità pubblica                     1139      0.68%        49706    1.1%          43.64
 Significato parole                       1120      0.67%        19914    0.44%         17.78
 Geografia                                1115      0.66%        16775    0.37%         15.045
 Istruttore direttivo amministrativo      1113      0.66%        30593    0.68%         27.487
 Storia                                   1032      0.61%        22913    0.51%         22.203
 Comprensione di testi                    1030      0.61%        93831    2.08%         91.098
 lingua italiana                          972       0.58%        18226    0.4%          18.751
 Attualità                                960       0.57%        20477    0.45%         21.33
 Geometra                                 948       0.56%        32225    0.71%         33.993
 Poliziotto di stato (Agente)             930       0.55%        20902    0.46%         22.475
dirigente scolastico                       875   0.52%   21699   0.48%   24.799
Corpo Forestale dello Stato                838   0.5%    24954   0.55%   29.778
Sinonimi                                   825   0.49%   6784    0.15%   8.223
Diritto Penale                             815   0.48%   19632   0.43%   24.088
Istruttore informatico                     787   0.47%   17876   0.4%    22.714
Biochimica                                 786   0.47%   12346   0.27%   15.707
Professioni sanitarie                      771   0.46%   19700   0.44%   25.551
Miur                                       745   0.44%   18528   0.41%   24.87
Geografia Astronomica                      742   0.44%   12354   0.27%   16.65
Diritto      Amministrativo       (Forze   739   0.44%   22624   0.5%    30.614
dell’ordine,Poliziotto municipale)
Istruttore tecnico                         724   0.43%   23792   0.53%   32.862
Funzionario Amministrativo                 710   0.42%   21007   0.46%   29.587
Università                                 670   0.4%    19715   0.44%   29.425
Contrari                                   645   0.38%   7018    0.16%   10.881
Chimica                                    644   0.38%   10589   0.23%   16.443
Legislazione sociale                       640   0.38%   23527   0.52%   36.761
bibliotecario                              634   0.38%   14869   0.33%   23.453
Amministrativo                             618   0.37%   19132   0.42%   30.958
vigile del fuoco                           610   0.36%   14755   0.33%   24.189
Corpo Nazionale dei Vigili del Fuoco       610   0.36%   14755   0.33%   24.189
Azienda Ospedaliera                        600   0.36%   22730   0.5%    37.883
Diritto Comunitario                        595   0.35%   12462   0.28%   20.945
Verbi                                      560   0.33%   9853    0.22%   17.595
Dirigente Amministrativo                   545   0.32%   21679   0.48%   39.778
Medicina veterinaria                       537   0.32%   29629   0.66%   55.175
Scienze dell’educazione                    510   0.3%    18234   0.4%    35.753
Istruttore direttivo tecnico               503   0.3%    14508   0.32%   28.843
storia d’Italia                            500   0.3%    13339   0.3%    26.678
geografia Italia                           499   0.3%    6993    0.15%   14.014
Personale ATA                              495   0.29%   8169    0.18%   16.503
Sostantivi                                 490   0.29%   7439    0.16%   15.182
Operatore ecologico                        470   0.28%   15913   0.35%   33.857
Letteratura Generale                       445   0.26%   8715    0.19%   19.584
Diritto privato                            435   0.26%   9850    0.22%   22.644
Coadiutore amministrativo                  431   0.26%   11162   0.25%   25.898
Diritto Commerciale                        410   0.24%   14765   0.33%   36.012
Operaio qualificato                        405   0.24%   8368    0.19%   20.662
Scienze della Riabilitazione               395   0.23%   11059   0.24%   27.997
Letteratura italiana                       393   0.23%   8626    0.19%   21.949
Diritto Civile                             391   0.23%   10556   0.23%   26.997
Storia contemporanea                       370   0.22%   7807    0.17%   21.1
Fisica                                     356   0.21%   8969    0.2%    25.194
Scienze della formazione                   350   0.21%   20550   0.45%   58.714
Ragionamento numerico                      350   0.21%   10376   0.23%   29.646
francese                                   345   0.21%   6646    0.15%   19.264
Logica : Completa la frase                 345   0.21%   10406   0.23%   30.162
Biologia                                   336   0.2%    10512   0.23%   31.286
Logica (Miscellanea)                       335   0.2%    12985   0.29%   38.761
istruttore direttivo amministrativo con-   325   0.19%   8917    0.2%    27.437
tabile
Architettura                               322   0.19%   17576   0.39%   54.584
educatore asilo nido                       310   0.18%   7536    0.17%   24.31
Istruttore contabile                       300   0.18%   7257    0.16%   24.19
Scienze dello sport e della prestazione     291   0.17%   9354    0.21%   32.144
fisica
Testo Unico Enti Locali                     280   0.17%   9419    0.21%   33.639
Medicina e Chirurgia in lingua Inglese      277   0.16%   16414   0.36%   59.256
Scienze del servizio sociale                260   0.15%   7503    0.17%   28.858
Contabilità aziendale                       256   0.15%   6342    0.14%   24.773
Mediatore Marittimo                         244   0.14%   5302    0.12%   21.73
Magistrato                                  241   0.14%   10015   0.22%   41.556
Assistente sociale, Psicologo, Educatore,   240   0.14%   5370    0.12%   22.375
Sociologo
Geografia fisica                            240   0.14%   4458    0.1%    18.575
Coordinatore amministrativo                 240   0.14%   10142   0.22%   42.258
Logica :Test delle serie                    239   0.14%   6145    0.14%   25.711
Scienze                                     235   0.14%   3712    0.08%   15.796
Esecutore amministrativo                    230   0.14%   6441    0.14%   28.004
Demografia                                  228   0.14%   4969    0.11%   21.794
Capacità verbale                            220   0.13%   4866    0.11%   22.118
Professioni Sanitarie tecniche diagnos-     210   0.12%   4847    0.11%   23.081
tiche
storia d’Europa                             208   0.12%   5217    0.12%   25.082
Economia                                    200   0.12%   2972    0.07%   14.86
geografia Europa                            190   0.11%   2535    0.06%   13.342
Software                                    190   0.11%   3557    0.08%   18.721
Unione Europea                              185   0.11%   3547    0.08%   19.173
Biotecnologie                               185   0.11%   5741    0.13%   31.032
Scienze e Tecnologie Viticole ed Eno-       185   0.11%   2685    0.06%   14.514
logiche
Assistente sociale                          185   0.11%   5814    0.13%   31.427
Diritto pubblico                            180   0.11%   4807    0.11%   26.706
Internet                                    170   0.1%    2648    0.06%   15.576
Facoltà di Medicina e Chirurgia             170   0.1%    12928   0.29%   76.047
Arte                                        165   0.1%    4065    0.09%   24.636
Diritto internazionale                      165   0.1%    2874    0.06%   17.418
Aggettivi                                   150   0.09%   2329    0.05%   15.527
Diritto Tributario                          150   0.09%   1815    0.04%   12.1
Legislazione fiscale                        148   0.09%   2432    0.05%   16.432
diritti                                     140   0.08%   5233    0.12%   37.379
Laurea in Chimica                           140   0.08%   3448    0.08%   24.629
profilo psicoattitudinale                   135   0.08%   5020    0.11%   37.185
Esperto Amministrativo                      135   0.08%   4368    0.1%    32.356
spagnolo                                    130   0.08%   2332    0.05%   17.938
tedesco                                     130   0.08%   2083    0.05%   16.023
Geometria                                   130   0.08%   3002    0.07%   23.092
Azienda Pubblica Servizi alla Persona       125   0.07%   3114    0.07%   24.912
(ASP)
Management Pubblico                         125   0.07%   2016    0.04%   16.128
Assistente educativo                        120   0.07%   2837    0.06%   23.642
Assistente familiare                        119   0.07%   2503    0.06%   21.034
Camera di Commercio                         102   0.06%   2643    0.06%   25.912
geografia Mondiale                          100   0.06%   1592    0.04%   15.92
Scienze della Comunicazione                 90    0.05%   1787    0.04%   19.856
Ortografia                                  90    0.05%   1141    0.03%   12.678
Addetto Amministrativo                      90    0.05%   1609    0.04%   17.878
Collaboratore Tecnico Professionale         90    0.05%   2483    0.05%   27.589
Pronomi                                     85    0.05%   1640    0.04%   19.294
Hardware                                   80           0.05%          1232     0.03%   15.4
Assistente contabile                       80           0.05%          1833     0.04%   22.913
Economia aziendale                         77           0.05%          3993     0.09%   51.857
Cuoco                                      75           0.04%          1130     0.03%   15.067
Banca d’Italia                             75           0.04%          3362     0.07%   44.827
Poliziotto di stato (Commissario)          70           0.04%          1699     0.04%   24.271
Statistica                                 70           0.04%          1191     0.03%   17.014
Esperto Amministrativo Contabile           69           0.04%          1531     0.03%   22.188
operatore sociale                          65           0.04%          965      0.02%   14.846
Facoltà di Economia                        62           0.04%          3599     0.08%   58.048
Nomi                                       60           0.04%          868      0.02%   14.467
Facoltà di Architettura                    60           0.04%          5704     0.13%   95.067
Esperto Tecnico                            60           0.04%          3251     0.07%   54.183
Ostetricia                                 60           0.04%          1758     0.04%   29.3
operatore tecnico                          60           0.04%          1300     0.03%   21.667
autista di ambulanza                       60           0.04%          1300     0.03%   21.667
istruttore direttivo socio culturale       54           0.03%          1341     0.03%   24.833
lingue straniere                           50           0.03%          762      0.02%   15.24
Amministrativo giuridico                   50           0.03%          1899     0.04%   37.98
tuel                                       50           0.03%          1363     0.03%   27.26
Curiosi,strani,imprevedibili               49           0.03%          1088     0.02%   22.204
storia Antichità                           40           0.02%          669      0.01%   16.725
Testo Unico imposte sui redditi            40           0.02%          1204     0.03%   30.1
Scienze dei Beni culturali                 40           0.02%          502      0.01%   12.55
Diritto regionale                          40           0.02%          1235     0.03%   30.875
Poliziotto di stato                        40           0.02%          828      0.02%   20.7
Sillabe                                    40           0.02%          730      0.02%   18.25
Avvocato                                   40           0.02%          1052     0.02%   26.3
Letteratura Europea                        35           0.02%          775      0.02%   22.143
istruttore direttivo contabile             25           0.01%          1096     0.02%   43.84
Lauree triennali delle professioni sani-   20           0.01%          411      0.01%   20.55
tarie
ammissione all’università                  20           0.01%          538      0.01%   26.9
Cinema e Teatro                            15           0.01%          354      0.01%   23.6
Dittonghi                                  15           0.01%          195      0.0%    13.0
Accenti                                    12           0.01%          135      0.0%    11.25
Facoltà di Scienze Motorie                 12           0.01%          167      0.0%    13.917
Congiunzioni                               10           0.01%          130      0.0%    13.0

                                  Table 3: Level 1 of the taxonomy, Mult-IT-C
Figure 9: Quiz percentage distribution, taxonomy level 1 (top 30 categories, Mult-IT-C)


  Category                   Total Quizzes   Quizzes Percentage   Total Tokens   Tokens Percentage   Avg Token-Quiz
  Altre Scienze e Tecniche           41006               29.03%        1092538              28.54%            26.643
  Società e Diritto                  39812               28.19%        1054559              27.55%            26.488
  Altro                              33615                23.8%         988679              25.83%            29.412
  Cultura                             9888                 7.0%         223605               5.84%            22.614
  Lingua                              9071                6.42%         233049               6.09%            25.692
  Matematica e Logica                 5288                3.74%         182652               4.77%            34.541
  Scienze MMFFNN                      2559                1.81%          53028               1.39%            20.722

Table 4
Level 3 of the taxonomy, Mult-IT-C


B. Appendix B: Distribution of
   position of correct answer
   (Mult-IT-C)
Figure 10: Quiz percentage distribution, taxonomy level 2 (all the categories, Mult-IT-C)
Figure 11: Quiz percentage distribution, taxonomy level 3 (top 15 categories, Mult-IT-C)


Figure 12: Distribution of answers’ positions compared with a random distribution. The lower amount of items on values 3
and 4 of the x-axis is expected because only some questions have 4 or 5, respectively, possible choices
C. Appendix C: Distribution of position of correct answer (Mult-IT-A)


Figure 13: Mult-IT-A: Distribution of answers’ positions compared with a random distribution. The lower amount of items on
value 4 of the x-axis is expected because only 13.65% of the questions have 5 possible choices