<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Francis);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Mult-IT Multiple Choice Questions on Multiple Topics in Italian: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Rinaldi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacopo Gili</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Francis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mattia Gofetti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviana Patti</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malvina Nissim</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alpha Test</institution>
          ,
          <addr-line>S.R.L</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CLCG, University of Groningen</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Trento</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Turin</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Multi-choice question answering (MCQA) is a powerful tool for evaluating the factual knowledge and reasoning capacities of Large Language Models (LLMs). However, there is a lack of large-scale MCQA datasets originally written in Italian. Existing Italian MCQA benchmarks are often automatically translated from English, an approach with two key drawbacks: Firstly, automatic translations may sound unnatural, contain errors, or use linguistics constructions that do not align with the target language. Secondly, they may introduce topical and ideological biases reflecting Anglo-centric perspectives. To address this gap, we present Mult-IT, an MCQA dataset comprising over 110,000 manually written questions across a wide range of topics. All questions are sourced directly from preparation quizzes for Italian university entrance exams, or for exams for public sector employment in Italy. We are hopeful that this contribution enables a more comprehensive evaluation of LLMs' proficiency, not only in the Italian language, but also in their grasp of Italian cultural and contextual knowledge.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CALAMITA Challenge</kwd>
        <kwd>Italian</kwd>
        <kwd>Benchmarking</kwd>
        <kwd>Multiple-Choice Questions</kwd>
        <kwd>LLMs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Challenge: Introduction and</p>
      <p>Motivation
coherence of discourse, and encourage the presence of job applicants, or learners across a range of topics,
inlinguistic constructions that reflect the source language cluding general knowledge and more specialised subjects.
rather than the target one [9]. The second issue relates to These questions make up the Mult-IT dataset: Multiple
culture and societal norms: text translated from English Choice Questions on Various Topics in Italian, which we
to Italian will lack topical biases, preferences, conven- are introducing in this contribution. The details of the
tions, and ways of expressing ideas that are unique to dataset are described in Section 3. The defining feature
Italian culture. Thus, while the text may be expressed in of the Mult-IT challenge is that all of the MCQs are
nathe Italian language, its content and underlying norms tively Italian, both in language and in content. While
will continue to represent an Anglo-centric, predomi- this is an advantage to gain a better understanding of
nantly American perspective. model behaviour on Italian data, we do expect a decline in</p>
      <p>More generally, training data for LLMs is biased to- model performance. Considering that even models which
wards English content, and as a result, there is often a have been trained on multilingual data have a heavy bias
gap between English and non-English performance [10]. towards English and American-centric culture, it is
exFor example, the Common Crawl dataset2, often used pected that the correctness of the answer may be afected
as a base for more refined datasets to be employed in by a cultural (and possibly language) gap. Should the
the pre-training of LLMs, is composed of 45% English battery of the models tested also include Italian
monocontent, while the data for languages such as Spanish, lingual or bilingual English-Italian models trained on a
French, Italian, and Chinese are all below 5% each, with substantial amount of Italian text, this benchmark will
the only exceptions being Russian (6.2%) and German make it possible to underscore diferences in performance
(5.1%). possibly associated to the language specificity of such</p>
      <p>Creating a large-scale multi-choice question answer- models.
ing benchmark using original Italian data will make it
possible to investigate the Italian abilities of LLMs in a
more natural and transparent way, possibly also leading 3. Data description
to a better understanding of how to make multilingual
models better at Italian. It will also serve as a core
benchmark for assessing the performance of monolingual
Italian LLMs. If similar datasets are collected natively for
other languages, Mult-IT can be part of a larger MCQA
benchmark which is multilingual in the truest sense.</p>
      <p>Mult-IT, presented at CALAMITA [11], is the first
massive Multi-Choice-Question-Answering dataset
specifically designed for the Italian language which draws on
Italian culture and Italian-focused knowledge. By
providing a comprehensive, culturally relevant benchmark for
the Italian language, we aim to set a precedent for the
development of similar resources in other languages and
cultures, ultimately contributing to a more diverse and
inclusive AI landscape.</p>
      <p>Mult-IT contains quizzes designed to assess candidates’
knowledge in open competitive exams, whether for
admission to national universities or for positions in Italian
institutions. This approach ofers several advantages.</p>
      <p>First of all, these public competitions encompass very
general topics such as language comprehension, basic
history, and common knowledge, but also more specialised
ones, focusing on specific laws needed for certain
professions or the security measures required for jobs such as
policemen or firefighters. Our benchmark, therefore,
contains questions that range from a low level of dificulty to
a very high and specific level, setting high standards for
the performance of the models, and it may also be useful
to assess specific knowledge valuable for the adoption
of models in Public Administration scenarios. The
inclusion of profession-specific questions in Mult-IT tests
2. Challenge: Description the ability of LLMs to apply their knowledge in practical,
real-world scenarios, a feature that could prove
particThis challenge involves a multiple-choice questioning ularly valuable in assessing the potential of AI systems
answering task. The model is prompted with a simple to support specialised fields and decision-making
proinstruction (see Box 1), followed by a question and a set of cesses in professional and administrative contexts in the
three and five possible answers, depending on the source Italian landscape. Moreover, the quizzes contained in
and topic of the question. Among these answers, only the dataset also present challenges regarding reasoning,
one is correct, and the others are distractors. The model such as logical thinking and mathematical reasoning, as
is expected to identify the correct answer and return the well as quizzes specifically designed to assess knowledge
letter corresponding to the option deemed correct. and mastery of the Italian language, for example, text</p>
      <p>All questions in the benchmark have been manually comprehension or detailed understanding of grammatical
crafted for the purpose of training or testing students, phenomena.</p>
      <p>Mult-IT consists of two core subsets, which are divided
by the origin of the data. Both subsets are made of quizzes
2https://commoncrawl.github.io/cc-crawl-statistics/plots/
languages.html, accessed on 18/09/24
that test knowledge of general Italian culture and that are
used in the public recruitment processes for
governmentbased positions. They are described in more details below.</p>
      <p>Mult-IT-A All the materials of Mult-IT-A were
obtained thanks to the generosity of Alpha Test 3. Alpha
Test S.r.l. is an Italian publishing house and educational
Mult-IT-A Mult-IT-A is a collection of MCQs provided training company, founded in Milan in 1987, that
speby Alpha test. It contains a total of 1,692 questions in cialises in study aid materials and courses for high school,
Italian, spanning over 17 categories which corresponds university, professional tests, exams and certifications.
to topics featuring in entry exams for Italian universi- Alpha Test is the main reference for high school students
ties (see Table 1 below for details) or question answering preparing for university admission. Each year, Alpha Test
tests employed in public competitions. The quizzes in the gathers new data from the entrance exams of public and
dataset falling into the categories of law, pedagogy, psy- private universities and military schools, mainly in the
chology and criminology originates from public competi- form of multiple choice questions. The publishing house
tions. For each question, four or five possible answers are enhances such materials with comments and
explanaprovided, out of which only one is correct. An example tions, and creates variations or completely new versions
from topic ’sinonimi’ (synonyms) is shown in Figure 1. of the original quizzes. All the materials in Mult-IT-A
have been sourced from original, public data, and
represent a varied sample of quizzes about general culture,
STEM, and juridical disciplines.
Indolente e’ un sinonimo di:
(A) tenero,
(B) doloroso,
(C) pigro,
(D) insensibile
• origin: It can be either ’C’ or ’A’, to discern if the</p>
      <p>question belongs to Multi-IT-A or Multi-it-C
• question: The question.
• choices: The list of possible answers.
• answer: The array-index corresponding to the</p>
      <p>correct answer in the choices array.</p>
      <p>These common fields are crucial to the evaluation task,
but each of the sub-portions has additional information
added.</p>
    </sec>
    <sec id="sec-2">
      <title>3https://www.alphatest.it/ 4https://www.concorsipubblici.com/</title>
      <p>In quale nazione si trova il Lago</p>
      <p>Balaton?
(A) Ucraina,
(B) Ungheria,
(C) Romania,
(D) Repubblica Ceca
(E) Bulgaria
Mult-IT-C The dataset consists of two files: quiz.jsonl
contains the actual questions, while metadata.jsonl
contains additional information about the questions. An
example from the quiz.jsonl file is given in Figure 4.</p>
      <p>Data fields unique of this subportion are:
• quiz_id: The ID of the quiz to which the question
pertains.
• question_id: The unique identifier of the
question inside the quiz. In combination with the
quiz_id it forms the unique identifier of each
question and can be used to retrieve the metadata of
the question from the metadata.jsonl file.</p>
      <p>Data fields of the metadata are:
• id: The unique identifier of the question.
• title: Title of the quiz sourced from the original
website.
• tags: List of word tags.
{
}
{
}
"quiz_id": 2250,
"question_id": 20,
"question": "La Costituzione riconosce allo
Stato una potesta’ legislativa
esclusiva in materia di:\n\n",
"choices": [
"organizzazione della rete scolastica",
"norme generali sull’istruzione",
"ricerca scientifica e tecnologica",
"istruzione professionale"
],
"answer": 1
"quiz_id": 1253,
"question_id": 63,
"question": "In un ingranaggio con piu’
ruote dentate, una ruota denominata R1
ha 25 denti e fa muovere una seconda
ruota denominata R2 da 50 denti, che a
sua volta fa muovere una terza ruota R3
da 150 denti. Se la ruota dentata R3
fa un giro e mezzo, quanti ne fa la
ruota dentata R1?\n\n",
"choices": [
"3",
"5",
"6",
"9",
"12"
],
"answer": 4</p>
      <sec id="sec-2-1">
        <title>3.3. Zero-shot prompting</title>
        <p>We evaluate our models in a zero-shot setting, thereby
imitating the conditions of a real use-case scenario. The
prompt we chose is designed to encourage the model to
output only the letter corresponding to the answer. The
original prompt, together with its English translation, is
presented in Box 1.</p>
        <p>Box 1: Zero-shot prompt and English translation.</p>
        <p>We decided to write the prompt in Italian in order
to better represent a multilingual scenario. The prompt
does not contain any information about the subject of the
question or any other informative cues. In this way, our
benchmark not only tests the model in question
answering, but also indirectly tests the instruction-following
abilities of the model in a language diferent than
English.</p>
        <p>Previous work on evaluating the performance of LLMs
on MCQ datasets has identified two aspects which can
interfere with the model’s answers and therefore
accuracy. One has to do with the order of possible answers:
Wang et al. [12] show that the first presented option
out of the possible choices tends to be preferred in the
model’s answer, making it quite important to take the
order of possible answers into account. The other has
to do with the prompt’s (and even the question’s)
formulation: Singhal et al. [13] experiment with multiple
types of prompts and also show that prompt formulation
afects the model’s output.</p>
        <p>Because the position of the correct answer in the
original data was already randomly distributed, which we
verified with a supplementary analysis on the data (see
Appendix B), performing a random permutation of the
possible answers was not necessary.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.4. Data statistics</title>
        <p>Mult-IT-A The Mult-IT-A dataset is composed of 1,692
questions, spanning over 17 topics, all centered around
knowledge required for entry exams at Italian
Universities. The topics, and some additional information on the
dataset composition, are provided in Table 1.</p>
        <p>On average, questions are 83.76 characters long and
contain 25 tokens counted with the tiktoken cl100k base
5 tokenizer or 16.5 if counted with the Spacy 6 library
using the it_core_news_lg model 7.</p>
        <p>Further statistics about quiz distribution and answer
position are available in Appendix ?? and C.</p>
        <p>It’s worth noting that permutating the order of the
answers would be recommended to avoid any kind of
unbalance, as Multi-IT-A shows an uneven correct answer
distribution leaning heavily on the first choice, acquired
from the source data.</p>
        <p>Mult-IT-C The Mult-IT-C dataset is composed of
108,773 questions divided into 4,129 quizzes.</p>
        <p>To avoid confusion, we decided to give unequivocal
names to the items of the subjects. A "quiz" is defined as a
set of multiple "questions". Quizzes come from real-world
examples, so they are provided with a specific name and</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5https://github.com/openai/tiktoken</title>
      <p>6https://github.com/explosion/spaCy
7https://github.com/explosion/spacy-models/releases/tag/it_core_
news_lg-3.7.0
informatica
sintassi
grammatica
completamento frasi
geografia
geometria
ortografia
biologia
storia
psicologia e sociologia del disadattamento
elementi di criminologia
pedagogia
elementi di diritto costituzionale ed amministrativo
sinonimi
chimica
fisica
deduzione logica</p>
      <p>Total
#tokens</p>
      <p>Avg Token/Quiz
128
121
119
115
114
114
113
105
100
100
100
100
100
98
83
43
39
1997
3617
2429
3034
1247
3541
2273
4588
1744
2890
2386
2271
1813
1667
2983
2251
1247
a categorisation originating from the original data source. level of specificity reached by the first level of
categoriA quiz can contain a variable number of questions. The sation, it’s interesting to notice, as examples, categories
average number of questions per single quiz is 26, and such as "Verbs", "Diphthongs", or "Word Meanings"
rethe maximum is 250. There are 1623 quizzes with more ferring to specific language abilities. These categories
than 25 items, 298 with more than 50, and only 22 with are then grouped in level 2 as "Linguistic Competence"
more than 100 items. and in level 3 as "Language". As another example, we</p>
      <p>The original categorisation made by the authors of can see categories that refer to specific aspects of the
the website "Concorsi Pubblici" was problematic for our Italian Public Administrations: we can see in category 1
purposes: some categories were near-duplicates of each ifelds such as "INPS", that is, National Institute for Social
other, containing only slightly diferent words. More- Security, or "ASL" that is "Local Health Authority".
over, we believed that 186 categories were too many for We believe that having such a precise categorisation at
a meaningful visualisation and management of the data. our disposal is of great help in understanding the abilities
For this reason, we created a hierarchy of three levels in and weaknesses of models in very specific aspects, thus
which the first (bottom) level corresponds to the original being helpful on one hand for assessing the possibility of
categorisation of the data, then the second level groups direct practical employment of models in Italian public
the categorisation into 36 areas, and finally, the third administration and, on the other hand, to improve the
level, the more abstract, has only 7 categories. The draw- scientific understanding of models and how they deal
back of this approach, as it can be seen in the tables and with diferent kinds of challenges. This last aspect can
graphs contained in Appendix A, is that in both the sup- also be helpful for interpretability studies of LLMs.
plementary categorisations there is a significant amount On average, questions are 104 characters long, they
of quizzes falling into the category "Other". contain 27.5 tokens counted with "tiktoken cl100k base"</p>
      <p>Nonetheless, we believe that this abstract categorisa- 8 or 19.8 if counted with the "Spacy" library 9 using the
tion can be good for having a general look at the data "it_core_news_lg" model 10. The longest question is 1363
composition and thus the performance of the models in token long.
terms of macro-areas. On the other hand, keeping the
original very detailed categorisation in the data allows for
more in-depth analysis of model performances in specific
aspects. In Appendix A, all the statistics of the 186 cate- 8See footnote 5
gories are listed in the form of a table. To appreciate the 9See footnote 6
10See footnote 7</p>
      <sec id="sec-3-1">
        <title>4. Evaluation</title>
        <p>We will use accuracy to evaluate the LLMs’ performance
on Mult-IT. Accuracy is defined as the ratio of correctly
answered questions to the total number of questions, and
it is a straightforward and easily interpretable measure
of performance on MCQ tasks. Accuracy will be reported
overall, and also separately for the two subsets Mult-IT-A
and Mult-IT-C.</p>
        <p>While accuracy is indeed a straightforward
evaluation metric for this task, deciding which is the answer
identified by the model as correct is not necessarily as
straightforward for a couple of reasons.</p>
        <p>As mentioned in Section 3.3, the position of the correct
answer in the prompt is randomly distributed, reducing
the likelihood of bias resulting from its placement,
although the models might have a tendency to select the
ifrst answer more frequently.</p>
        <p>A related issue is the fact that the model’s output, in
spite of the specific request in the prompt, might not
always just be the letter corresponding to the chosen
answer. In the case of longer outputs, simple regular
expressions will be applied to extract the relevant letter.</p>
        <p>In practice, as for all the CALAMITA challenges, the
evaluation of the LLMs on Mult-IT will be carried out
on the LM-evaluation-harness framework developed by
EleutherAI11.
Altre
Medicina
Corpo Pubblico
Giurisprudenza
Competenza Linguistica
Cultura Generale
Informatica
Logica
Farmacia
Geografia
Storia
APES
Scienze Motorie
Matematica
Lingua
Pubblica Amministrazione
Educazione civica
Letteratura
Biochimica
Chimica
Istruzione
Architettura
Fisica
Biologia
Economia
Scienze
Biotecnologie
Scienze naturali
Arte
profilo psicoattitudinale
Scienze della Comunicazione
Cucina
Scienze dei Beni culturali
Total</p>
      </sec>
      <sec id="sec-3-2">
        <title>5. Limitations</title>
        <p>The vast majority of data comes from sources linked
with Italian public Institutions, and can be considered
oficial documents. For this reason, we expect an high
quality regarding the formulation of the quizzes and the
correctness of the answers. Nonetheless, given the large
amount of data, we cannot guarantee the absence of
errors in the single questions. Human errors can happen,
even in oficial selection, although it should be considered
a rare occasion. This aspect can be improved by analysing
the results obtained by the model in the benchmarks: the
more the benchmark is going to be used, the more it will
be possible to isolate and eventually remove or correct
problematic quizzes with data analytics techniques.</p>
        <p>Moreover, considered that the quizzes encompass
almost a thirty years time span, it is possible that some
quizzes, particularly the ones regarding laws, may be
outdated. Nonetheless, thanks to the availability of
metadata, it is possible to further refine this dataset to also
account for specific historical knowledge about laws by
providing metadata to the model. However, we believe
that for the first run of this evaluation this point will
not create particular issues as we expect the potentially
outdated questions to be limited.</p>
        <p>Given the publicity of the data, it is possible that the
original exams are already present in the models training
data as they can easily obtained on the Internet. At the
same time, it is likely that some sources, for example
complete laws of the Italian legislation, are present in the
training data, but we consider this eventuality positive
given that one of the benchmark’s aim is to evaluate the
knowledge and capacity of the model to adapt to the
Italian landscape.
6. Data license and copyright
issues
Information about license and copyright issues is
mandatory.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Acknowledgments</title>
        <p>The authors would like to thank Alpha Test - https://
www.alphatest.it, and in particular Martha Fabbri, for
their interest in the Mult-IT CALAMITA challenge and
for the extremely valuable exchange of ideas and data,
that allowed us to shape a task of high potential impact
also in the field of educational training and assessment.
guistics, Dublin, Ireland, 2022, pp. 3214–3252. URL: https://huggingface.co/spaces/open-llm-leaderboard
https://aclanthology.org/2022.acl-long.229. doi:10. [4] https://arxiv.org/pdf/2406.01574 [5] https://www.
18653/v1/2022.acl-long.229. cia.gov/the-world-factbook/about/archives/2022/
[4] P. Wang, A. Chan, F. Ilievski, M. Chen, X. Ren, countries/world/ [6] https://arxiv.org/pdf/2304.05613
Pinto: Faithful language reasoning using prompt- [7] https://commoncrawl.github.io/cc-crawl-statistics/
generated rationales, in: Workshop on Trustwor- plots/languages.html Accessed on 18/09/24 [8]
thy and Socially Responsible Machine Learning, https://arxiv.org/abs/2406.17789v1</p>
        <p>NeurIPS 2022, 2022.
[5] D. Hendrycks, C. Burns, S. Basart, A. Zou,</p>
        <p>M. Mazeika, D. Song, J. Steinhardt, Measuring mas- 7. Online Resources
sive multitask language understanding, in:
International Conference on Learning Representations, The sources for the ceur-art style are available via
2021. • GitHub,
[6] Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, • Overleaf template.</p>
        <p>W. Ren, A. Arulraj, X. He, Z. Jiang, et al.,
Mmlupro: A more robust and challenging multi-task
language understanding benchmark, arXiv preprint
arXiv:2406.01574 (2024).
[7] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad:
100,000+ questions for machine comprehension of
text, 2016. URL: https://arxiv.org/abs/1606.05250.</p>
        <p>arXiv:1606.05250.
[8] D. Croce, A. Zelenanska, R. Basili, Neural learning
for question answering in italian, in: C. Ghidini,
B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA
2018 – Advances in Artificial Intelligence, Springer</p>
        <p>International Publishing, Cham, 2018, pp. 389–402.
[9] I. Plaza, N. Melero, C. del Pozo, J. Conde, P.
Reviriego, M. Mayor-Rocher, M. Grandury, Spanish
and llm benchmarks: is mmlu lost in translation?,
arXiv preprint arXiv:2406.17789 (2024).
[10] V. Lai, N. Ngo, A. Pouran Ben Veyseh, H. Man,</p>
        <p>F. Dernoncourt, T. Bui, T. Nguyen, Chatgpt
beyond english: Towards a comprehensive evaluation
of large language models in multilingual learning,
in: Findings of the Association for Computational</p>
        <p>Linguistics: EMNLP 2023, 2023, pp. 13171–13189.
[11] G. Attanasio, P. Basile, F. Borazio, D. Croce, M.
Francis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M.
Rinaldi, D. Scalena, CALAMITA: Challenge the
Abilities of LAnguage Models in ITAlian, in:
Proceedings of the 10th Italian Conference on
Computational Linguistics (CLiC-it 2024), Pisa, Italy,
December 4 - December 6, 2024, CEUR Workshop
Proceedings, CEUR-WS.org, 2024.
[12] P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y. Cao,</p>
        <p>Q. Liu, T. Liu, Z. Sui, Large language models are not
fair evaluators, 2023. arXiv:2305.17926.
[13] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn,</p>
        <p>
          L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal,
et al., Towards expert-level medical question
answering with large language models, arXiv preprint
arXiv:2305.09617 (2023).
charge [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] mmlu paper [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] https://github.
com/EleutherAI/lm-evaluation-harness [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
A. Appendix A: Detailed Statistics
        </p>
        <p>per category (Multi-IT-C)
9419
16414
7503
6342
5302
10015
5370
4458
10142
6145
3712
6441
4969
4866
4847
5217
2972
2535
3557
3547
5741
2685
5814
4807
2648
12928
4065
2874
2329
1815
2432
5233
3448
5020
4368
2332
2083
3002
3114
B. Appendix B: Distribution of
position of correct answer
(Mult-IT-C)</p>
        <p>Total Quizzes</p>
        <p>Quizzes Percentage</p>
        <p>Total Tokens</p>
        <p>Tokens Percentage
Figure 10: Quiz percentage distribution, taxonomy level 2 (all the categories, Mult-IT-C)
C. Appendix C: Distribution of position of correct answer (Mult-IT-A)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kleyjo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Beyond the imitation game: Quantifying and extrapolating the capabilities of language models</article-title>
          ,
          <source>Transactions on Machine Learning Research</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tian</surname>
          </string-name>
          , A. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , et al.,
          <article-title>Benchmarking large language models on cmexam-a comprehensive chinese medical exam dataset</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Evans,</surname>
          </string-name>
          <article-title>TruthfulQA: Measuring how models mimic human falsehoods</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , Association for Computational Lin-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>