=Paper=
{{Paper
|id=Vol-3878/129_calamita_long
|storemode=property
|title=INVALSI - Mathematical and Language Understanding in Italian: A CALAMITA Challenge
|pdfUrl=https://ceur-ws.org/Vol-3878/129_calamita_long.pdf
|volume=Vol-3878
|authors=Giovanni Puccetti,Maria Cassese,Andrea Esuli
|dblpUrl=https://dblp.org/rec/conf/clic-it/0002CE24
}}
==INVALSI - Mathematical and Language Understanding in Italian: A CALAMITA Challenge==
INVALSI - Mathematical and Language Understanding in
Italian: A CALAMITA Challenge
Giovanni Puccetti1,∗ , Maria Cassese1 and Andrea Esuli1
1
Istituto di Scienza e Tecnologia dell’Informazione - CNR
Abstract
While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Language Models (LMs)
generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance
on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian.
These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the
Italian school system. These tests are prepared by expert pedagogists and have the explicit goal of testing average students’
performance over time across Italy. Therefore, the questions are well written, appropriate for the age of the students, and are
developed with the goal of assessing students’ skills that are essential in the learning process, ensuring that the benchmark
proposed here measures key knowledge for undergraduate students.
Invalsi MATE is composed of 420 questions about mathematical understanding, these questions range from simple money
counting problems to Cartesian geometry questions, e.g. determining if a point belongs to a given line. They are divided into
4 different types: scelta multipla (multiple choice), vero/falso (true/false), numero (number), completa frase (fill the gap).
Invalsi ITA is composed of 1279 questions regarding language understanding, these questions involve both the ability to
extract information and answer questions about a text passage as well as questions about grammatical knowledge. They are
divided into 4 different types: scelta multipla (multiple choice), binaria (binary), domanda aperta (open question), altro (other).
We evaluate 4 powerful language models both English-first and tuned for Italian to see that best accuracy on Invalsi
MATE is 55% while best accuracy on Invalsi ITA is 80%.
Keywords
Mathematical Understanding, Language Understanding, Invalsi, Large Language Models, Italian Language Models
1. Challenge: Introduction and There are several benchmarks to evaluate mathemati-
cal understanding of LLMs based on English tests [6, 7, 8]
Motivation and there are also several multi-domain benchmarks in-
Assessing the quality of Large Language Models is a chal- volving Italian [9], however there aren’t any specifically
lenging task because these models can virtually perform focused on mathematical understanding in Italian. We
any task that can be presented through natural language. focus on high-school questions, an English benchmark
To address this difficulty, each model needs to be tested similar to Invalsi MATE is the GSM8k one, [8], which
on several tasks at once. contains 8,500 high-school questions.
To help provide new benchmarks to evaluate LLMs Language Models understanding of language in En-
in Italian, We propose two benchmarks, Invalsi MATE glish is also well studied, there are several benchmarks
and Invalsi ITA the first meant to evaluate LLMs’ mathe- meant to measure the ability of language models to un-
matical understanding and the second to evaluate their derstand language constructs in English, such as [10, 11],
language understanding, both in Italian. also arranged into extensive suites [12]. On the contrary
These benchmarks originate from the Invalsi tests, there are fewer examples of these tests for the Italian
which have been used in the past for demographic stud- language.
ies [1, 2, 3], but, to the best of our knowledge we are the Therefore we propose Invalsi ITA which contains ques-
first to use them to test LLMs performance in Italian [4], tions that are usually split among several different bench-
followed only later by others [5]. marks, e.g. MNLI [13], SQuAD [14] and others from the
GLUE suite [15]. The questions in the dataset cover sev-
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, eral aspects of language understanding, ranging from the
Dec 04 — 06, 2024, Pisa, Italy ability to extract specific information, such as the date
∗
Corresponding author. when something happened to more complex information
Envelope-Open giovanni.puccetti@isti.cnr.it (G. Puccetti);
maria.cassese@isti.cnr.it (M. Cassese); andrea.esuli@isti.cnr.it
such as whether two events implicate each other or not.
(A. Esuli) These two datasets allow us to measure two key abili-
GLOBE https://gpucce.github.io/ (G. Puccetti); ties of language models in Italian, to make the comparison
https://www.isti.cnr.it/en/about/people-detail/1072/Maria_Cassese among different models more fair we cast all questions as
(M. Cassese); https://www.esuli.it/ (A. Esuli) multiple choice and measure models’ performance by se-
Orcid 0000-0002-5725-4322 (A. Esuli)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License lecting the answer with the highest likelihood according
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Testo Testo
Elisa è uscita da casa questa mattina alle ore 8:15. Se moltiplichi per 2 un numero naturale e dal risul-
Elisa è rientrata nel pomeriggio alle ore 1:15 tato sottrai 1, ottieni sempre un numero pari.
Domanda Domanda
Quanto tempo è stata fuori casa Elisa? Vero o Falso?
A. 5 ore B. 7 ore C. 9 ore D. 11 ore
(a) scelta multipla question from Invalsi MATE. (b) vero/falso question from Invalsi MATE.
Testo Testo
Filippo dice: per trovare il numero della mia magli- Luca lancia due dadi a sei facce non truccati.
etta aggiungi una decina e sei unità al numero 4. Domanda
Domanda Completa la frase inserendo una delle espressioni:
Qual è il numero della maglietta di Filippo? La probabilità che la somma dei punti sia 12 è mag-
giore della, minore della, uguale alla probabilità che
la somma sia 2.
(c) numero question from Invalsi MATE. (d) completa frase question from Invalsi MATE.
Figure 1: Examples of each question type from the Invalsi MATE dataset.
to the model. and the limitations of the data.
We measure the performance of 4 strong large mod-
els, mixtral instruct [16], mistral instruct [17], llama 3 8b 2.1. Task 1: Mathematical Understanding
instruct [18], anita 8b dpo [19], the first three are English-
first and the fourth is fine-tuned in Italian, and the current
in Italian (Invalsi MATE)
only Italian-first model minerva 3b. We show that on In- The first task consists in answering mathematical ques-
valsi ITA the best model among those we tested is mixtral tions in Italian. These questions are meant for students
instruct, which reaches an accuracy of 0.8, while on In- from 6 to 18 years of age, therefore the kind of ques-
valsi MATE the highest accuracy is 0.55, also achieved tion can vary significantly, from simpler, example-based,
by mixtral instruct. ones that don’t require any knowledge besides count-
Both language understanding and mathematical un- ing, to more complex ones requiring basic geometry and
derstanding are key abilities for students as well as lan- calculus training and knowledge, never beyond what is
guage models, particularly since these models are often demanded in basic high-school tests.
used in learning environments. By adding these bench- The questions are of 4 kinds, scelta multipla, completa
marks to the CALAMITA suite we hope they will help frase, vero/falso and numero:
the development of LLMs in Italian by providing a more
comprehensive evaluation of their abilities and thus fos- • scelta multipla (multiple choice): the question
tering the research and development of models in this requires to pick the right answer among four pos-
language. sible ones;
The CALAMITA special event [20], which has aims to • vero/falso (true/false): the question requires to
establishing a shared benchmark for LLMs in Italian, is a pick the right answer between true and false;
first step towars a systematic evaluation of LLMs in this • numero (number): the question requires to pick a
language. We hope that the Invalsi challenge will enrich number that is the correct answer to the question;
the Linguistic and Mathematical understanding branches • completa frase (fill the gap): the question requires
of this shared benchmark. to fill one or more missing words to make the text
coherent.
2. Challenge: Description Of the four question types, scelta multipla and vero/-
falso are naturally multiple choice, with scelta multipla
The challenge is composed of two tasks: Invalsi MATE always having 4 possible answers (A, B, C, D) and vero/-
and Invalsi ITA. For each task, we provide a detailed falso always 2, (true, false). Questions of the numero type,
description of the data, the metrics used for evaluation, are not naturally multiple choice, since the answer is a
(a) Invalsi MATE (b) Invalsi ITA
Figure 2: The distribution of Question types in the Invalsi datasets in (a) for Invalsi MATE and in (b) for Invalsi ITA.
number among all possible ones (some of the answers • domanda aperta (open question): the question re-
will be a year, e.g. 1948, while others can be a decimal quires to pick the passage in the text that answers
number of liters of milk, e.g. 0.2), to address this we add the question.
3 extra answers that are realistic but wrong to make the • altro (other): A small share of questions belong
questions multiple choice. Finally, completa frase ques- to open-ended questions with varying scope that
tions, which are only few (20), are too difficult to turn are hard to put under a single label.
into multiple choice without changing their meaning,
and therefore we exclude them. Similar to Invalsi MATE, this task involves only
multiple-choice questions, evaluated through a likelihood
approach. Both scelta multipla and binaria are naturally
2.2. Task 2: Language Understanding in of this kind, the first with 4 options (A, B, C D) and the
Italian (Invalsi ITA) second with 2 options that change for each question.
The second task consists in answering Italian language Both domanda aperta and altro questions are hard to turn
understanding questions, similarly to task 1, these ques- into multiple-choice ones and therefore we discard them.
tions are also appropriate for students between 6 and 18 Also for Invalsi ITA, this involves only discarding about
years old, and they are overall not too difficult to answer. 180 questions out of 1297, therefore the task only involves
Most of the questions concern a text passage that has to 1117 samples.
be included in the model context, making this evaluation
more costly because the context becomes considerably 3. Data description
larger. The text passage is where the difficulty difference
between ages is more evident, since it can be a simple 3.1. Origin of data
and short story for primary school students, while they
are generally longer and more involved texts for older The dataset is built upon the questions from the Invalsi
students. tests of the last 15 years. These tests are administered
The questions are of 4 different types, scelta multipla, to students yearly. There are three different Invalsi tests,
binaria, domanda aperta and altro: Language, Mathematics and English. For the scope of
this datasets we don’t look into the English test, but we
• scelta multipla (multiple choice): the question limit ourselves to the Italian language and Mathematics
requires to pick the right answer among four pos- ones. The original questions can be accessed here 1
sible ones; Some of the questions from the original tests contain
• binaria (binary): the question requires to pick visual content as part of the question, we omit these
the right answer about a binary property of a questions since we focus on the language understanding
statement, e.g. True - False, Before - After, etc. abilities of the models.
1
https://www.gestinv.it/Index.aspx
Question Type ALL scelta multipla vero/falso numero Question Type ALL scelta multipla binaria
N. Questions 400 244 54 102 N. Questions 1117 977 140
Model Accuracy
Model Accuracy
mixtral instruct 0.55 0.49 0.63 0.66
mistral instruct 0.44 0.34 0.59 0.63 mixtral instruct 0.80 0.82 0.69
anita 8b dpo 0.47 0.40 0.61 0.55 mistral instruct 0.49 0.60 0.51
llama 3 8b instruct 0.48 0.42 0.57 0.58 anita 8b dpo 0.71 0.72 0.66
minerva 3b 0.20 0.22 0.50 0.32
llama 3 8b instruct 0.69 0.70 0.61
random 0.28 0.25 0.5 0.25 minerva 3b 0.30 0.25 0.54
Table 1 random 0.27 0.25 0.44
Models 0-Shot accuracy on Invalsi MATE, likelihood based
evaluation. In bold the highest accuracy in each column and Table 2
underlined the second highest. Models 0-Shot accuracy on Invalsi ITA, likelihood based eval-
uation. In bold the highest accuracy in each column and
underlined the second highest.
The data is first collected as is from the webpage and
afterwards it is manually checked for errors and incon-
choice ones. We also don’t use chain of thought prompts
sistencies from two annotators. The annotators have
or similar methods. Since this is the first attempt to
MSc in Mathematics and Computer Science, which gives
build a dataset on mathematical understanding in Italian,
them sufficient knowledge to identify issues in the Invalsi
currently we evaluate with the simplest approach.
MATE questions. For Invalsi ITA, the annotators don’t
have an appropriate background, however, the questions
are simple enough that they can be easily understood 3.3. Detailed data statistics
and checked for errors by anybody who has completed
The data does not have a train and a test split because
the mandatory education.
we have a limited number of samples. The Invalsi MATE
split is composed of 420 samples, of which 400 are used
3.2. Data format in the benchmark, since we exclude the 20 questions
marked as completa frase since they can´t be made into
The Invalsi MATE dataset has 8 different columns:
multiple choice.
• testo: this field contains the context needed to Figure 2a shows the percentage of questions of each
answer the questions, it is often empty for Invalsi kind in the dataset, scelta multipla has the largest share,
MATE since most of the context is part of the 58%, the second most present is numero, 24.7% and then
domanda field itself; vero/falso and completa frase are fewer. Table 1 has a
• domanda: this field contains the question itself, random row that shows the performance if one were to
including possible answer options, e.g. for scelta pick random questions, moreover the correct answers for
multipla questions; each question type are approximately evenly distributed
• risposta: this field contains the correct answer; among labels. Specifically, scelta multipla questions have
• test_id: this field is just an id to identify each answers distributed as follows, 46 are labelled A, 87 B,
sample; 71 C and 40 D, showing a moderate balance. Similarly
for vero/falso questions there 24 questions with positive
• tipo: this field indicates the question type, among
answer and 30 with negative answer.
scelta multipla, vero/falso, numero and completa
Similarly, Figure 2b shows the percentage of questions
frase;
of each of the kinds present in Invalsi ITA, scelta multipla
• alt1, alt2 and alt3: this three fields indicate the
is by a large margin the most present, composing 76.4%
alternative values for the numero questions since
of all the questions, binaria is second with 10.9% while
this we chose ourselves and are not indicated in
domanda aperta and altro are fewer.
the domanda field.
Table 2 shows the performance one can achieve pick-
The Invalsi ITA dataset has the same fields as the In- ing answers at random in each split, and moreover the
valsi MATE one with the exception that the testo field is correct answers are evenly distributed for each label also
often present and generally the longest. in this dataset. In particular, for Invalsi ITA 254 of the
We evaluate in a zero-shot fashion just providing the scelta multipla questions have answer A, 255 B, 263 C
model with question and using a likelihood based method, and 205 D which is comparable to the distribution in the
we pick as the model’s answer the one with the highest Invalsi MATE dataset and similarly does binaria.
likelihood among the options available. This is always
possible since we have recast all the questions as multiple
4. Metrics and the second for the evaluation of language under-
standing in Italian.
Since all the datasets and splits we study have balanced For Invalsi MATE we have collected 420 questions
labels, we choose to measure accuracy. In particular, divided into 4 types, scelta multipla, vero/falso, numero
since all the questions in the datasets we propose are and completa frase and we evaluate 4 strong language
multiple choice, it is straightforward to measure accuracy models that are near SOTA in their weight range, mixtral
even if there are questions of different kinds, by simply instruct, mistral instruct, llama 3 8b instruct and anita 8b
counting |𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟𝑠|/|𝑎𝑙𝑙 𝑎𝑛𝑠𝑤𝑒𝑟𝑠|. dpo. We find that this models are still far from perfect
To see how challenging our benchmark is , we measure mathematical understanding in Italian with the highest
four powerful language models based on mistral, mixtral accuracy, achieved by mixtral instruct being 55%.
and llama 3, in particular, we measure the performance of For Invalsi ITA we have collected 1297 questions di-
mistral instruct, mixtral instruct, anita 8b dpo and llama vided into 4 types, scelta multipla, binaria, domanda
3 8b instruct. aperta and altro, we tested the same models also on this
These models have between 7 and 54 billion param- benchmark and found that models are stronger at lan-
eters, three of them, mistral instruct, anita 8b dpo and guage understanding, with the highest accuracy in this
llama 3 8b instruct are purely autoregressive transform- task at 80%, also in this case, achieved by mixtral instruct.
ers, while mixtral instruct is a MoE architecture that has Both mathematical and Language understanding are
54 billion parameters but only uses 14 billion at inference. key abilities for LLMs, we believe that our two bench-
We test them all in the same way, using a likelihood based marks will foster the development of LLMs in Italian and
approach. pave the way for new more challenging benchmarks on
Table 1 shows the performance of these models on mathematical and language understanding in Italian.
Invalsi MATE on the whole dataset, in the ALL column
and on each split scelta multipla, vero/falso and numero
in the respective columns. mixtral instruct is the clear 6. Limitations
winner among the models we tested, it beats the second
best, llama 3 8b instruct by 7% accuracy on the entire The main limitations of the benchmark we propose lies
dataset. On the separate splits, mixtral instruct is best in Task 2, Invalsi ITA we show that the models we test
overall, however the second best model changes, with achieve very high accuracy, up to 80% on this bench-
llama 3 8b instruct being second in scelta multipla with a mark, making it possibly too simple for newer and larger
7% gap, anita 8b dpo second on vero/falso with a smaller models, nevertheless, current Italian first LLMs are not
2% performance gap and mistral instruct being second comparable to larger English-first ones and therefore we
best in numero with a 3% gap. believe it can still be valuable in this transitory phase.
The total accuracy is bound by 55% showing that the In- On the contrary Invalsi MATE is very challenging and it
valsi MATE task is challenging for models of the sizes we seems that models won’t saturate it soon.
tested, up to 54B parameters, and that the performance We believe that there is a limited risk from contamina-
a model achieves provides valuable insights about how tion from both existing English and Italian tests.
well it can perform mathematical reasoning in Italian. Concerning direct contamination, we were unable to
Table 2 shows the performance of the same models find any web page that would expose the answers openly
on Invalsi ITA, the model ranking stays the same, with without needing any sort of authentication, making it dif-
mixtral instruct the strongest and anita 8b dpo second ficult to crawl these data automatically, therefore, while
best. Performance on Invalsi ITA is higher across all fields the questions might be present in the training set of some
with mixtral instruct achieving 80% accuracy. Unlike of the models, we deem it unlikely that the answers were
what happens for Invalsi MATE, in Invalsi ITA the second there too.
best model is the same across the board, anita 8b dpo is Concerning contamination through translation from
the second in scelta multipla as well as in binaria with a English, the Invalsi questions are carefully crafted to
performance gap around 10% in all question types. match the grade of the students that will undertake them,
The accuracy of the best model on all the questions at therefore we believe it is unlikely that they are taken
once is 80% showing that the models we tested perform from English questions available in other online sources,
well in the language understanding in Italian. but rather created specifically for each new annual test.
5. Conclusions Acknowledgments
We propose two Tasks, Invalsi MATE and Invalsi ITA the This work was also partially supported by: FAIR
first for the evaluation of mathematical understanding (PE00000013) project under the NextGenerationEU pro-
gramme, partially by the PNRR project ITSERR (CUP
B53C22001770006) and partially by the Project PRIN verifiers to solve math word problems, 2021.
2022EPTPJ9 (WEMB – “Word EMBeddings: From Cog- arXiv:2110.14168 .
nitive Linguistics to Language Engineering, and Back”), [9] R. J. Das, S. E. Hristov, H. Li, D. I. Dimitrov, I. Koy-
funded by the Italian Ministry of University and Research chev, P. Nakov, Exams-v: A multi-discipline multi-
(MUR). lingual multimodal exam benchmark for evaluating
vision language models, 2024. arXiv:2403.10378 .
[10] L. Bentivogli, B. Magnini, I. Dagan, H. T.
References Dang, D. Giampiccolo, The fifth PASCAL
recognizing textual entailment challenge, in:
[1] G. Bolondi, C. Cascella, Somministrazione delle
Proceedings of the Second Text Analysis Con-
prove invalsi dal 2009 al 2015: un patrimonio d’in-
ference, TAC 2009, Gaithersburg, Maryland,
formazioni tra evidenze psicometriche e didattiche,
USA, November 16-17, 2009, NIST, 2009. URL:
in: I dati INVALSI: uno strumento per la ricerca,
https://tac.nist.gov/publications/2009/additional.
Franco Angeli, Milano, 2017, p. 14.
papers/RTE5_overview.proceedings.pdf.
[2] A. Costanzo, M. Desimoni, Beyond the mean esti-
[11] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, B. V.
mate: a quantile regression analysis of inequalities
Durme, Record: Bridging the gap between human
in educational outcomes using invalsi survey data,
and machine commonsense reading comprehen-
Large-scale Assessments in Education (2017). URL:
sion, 2018. URL: https://arxiv.org/abs/1810.12885.
https://doi.org/10.1186/s40536-017-0048-4. doi:10.
arXiv:1810.12885 .
1186/s40536- 017- 0048- 4 .
[12] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh,
[3] J. Pietschnig, S. Oberleiter, E. Toffalini, D. Giofrè,
J. Michael, F. Hill, O. Levy, S. R. Bowman, Super-
Reliability of the g factor over time in italian
GLUE: a stickier benchmark for general-purpose
invalsi data (2010-2022): What can achievement-g
language understanding systems, Curran Asso-
tell us about the flynn effect?, Personal-
ciates Inc., Red Hook, NY, USA, 2019.
ity and Individual Differences 214 (2023)
[13] A. Williams, N. Nangia, S. Bowman, A broad-
112345. URL: https://www.sciencedirect.com/
coverage challenge corpus for sentence understand-
science/article/pii/S0191886923002684. doi:https:
ing through inference, in: M. Walker, H. Ji, A. Stent
//doi.org/10.1016/j.paid.2023.112345 .
(Eds.), Proceedings of the 2018 Conference of the
[4] A. Esuli, G. Puccetti, The invalsi benchmarks: mea-
North American Chapter of the Association for
suring linguistic and mathematical understanding
Computational Linguistics: Human Language Tech-
of large language models in italian, 2024. URL: https:
nologies, Volume 1 (Long Papers), Association for
//arxiv.org/abs/2403.18697. arXiv:2403.18697 .
Computational Linguistics, New Orleans, Louisiana,
[5] F. Mercorio, M. Mezzanzanica, D. Potertì, A. Serino,
2018, pp. 1112–1122. URL: https://aclanthology.org/
A. Seveso, Disce aut deficere: Evaluating llms
N18-1101. doi:10.18653/v1/N18- 1101 .
proficiency on the invalsi italian benchmark,
[14] P. Rajpurkar, R. Jia, P. Liang, Know what you
2024. URL: https://arxiv.org/abs/2406.17535.
don’t know: Unanswerable questions for SQuAD,
arXiv:2406.17535 .
in: I. Gurevych, Y. Miyao (Eds.), Proceedings of
[6] D. Hendrycks, C. Burns, S. Kadavath, A. Arora,
the 56th Annual Meeting of the Association for
S. Basart, E. Tang, D. Song, J. Steinhardt, Measur-
Computational Linguistics (Volume 2: Short Pa-
ing mathematical problem solving with the math
pers), Association for Computational Linguistics,
dataset, NeurIPS (2021).
Melbourne, Australia, 2018, pp. 784–789. URL:
[7] H. Liu, Z. Zheng, Y. Qiao, H. Duan, Z. Fei, F. Zhou,
https://aclanthology.org/P18-2124. doi:10.18653/
W. Zhang, S. Zhang, D. Lin, K. Chen, MathBench:
v1/P18- 2124 .
Evaluating the theory and application proficiency
[15] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy,
of LLMs with a hierarchical mathematics bench-
S. Bowman, GLUE: A multi-task benchmark and
mark, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.),
analysis platform for natural language understand-
Findings of the Association for Computational Lin-
ing, in: T. Linzen, G. Chrupała, A. Alishahi (Eds.),
guistics ACL 2024, Association for Computational
Proceedings of the 2018 EMNLP Workshop Black-
Linguistics, Bangkok, Thailand and virtual meeting,
boxNLP: Analyzing and Interpreting Neural Net-
2024, pp. 6884–6915. URL: https://aclanthology.org/
works for NLP, Association for Computational Lin-
2024.findings-acl.411.
guistics, Brussels, Belgium, 2018, pp. 353–355. URL:
[8] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,
https://aclanthology.org/W18-5446. doi:10.18653/
H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton,
v1/W18- 5446 .
R. Nakano, C. Hesse, J. Schulman, Training
[16] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch,
B. Savary, C. Bamford, D. S. Chaplot, D. de las
Casas, E. B. Hanna, F. Bressand, G. Lengyel, V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Petro-
G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. vic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Mar-
Lachaux, P. Stock, S. Subramanian, S. Yang, S. An- tinet, X. Wang, X. E. Tan, X. Xie, X. Jia, X. Wang,
toniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song,
T. Lacroix, W. E. Sayed, Mixtral of experts, 2024. Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan,
arXiv:2401.04088 . Z. Chen, Z. Papakipos, A. Singh, A. Grattafiori,
[17] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Vic-
ford, D. S. Chaplot, D. de las Casas, F. Bressand, toria, A. Goldstand, A. Menon, A. Sharma, A. Boe-
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, senberg, A. Vaughan, A. Baevski, A. Feinstein,
M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, A. Kallet, A. Sangani, A. Yunus, A. Lupu, A. Al-
T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. varado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan,
arXiv:2310.06825 . A. Ramchandani, A. Franco, A. Saraf, A. Chowd-
[18] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- hury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yaz-
Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, dan, B. James, B. Maurer, B. Leonhardi, B. Huang,
A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mi- B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu,
tra, A. Sravankumar, A. Korenev, A. Hinsvark, B. Ni, B. Hancock, B. Wasti, B. Spence, B. Sto-
A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, jkovic, B. Gamido, B. Montalvo, C. Parker, C. Bur-
A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, ton, C. Mejia, C. Wang, C. Kim, C. Zhou, C. Hu,
C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. Mc- C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer,
Connell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. D. Civin, D. Beaty, D. Kreymer, D. Li, D. Wyatt,
Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh,
D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan, D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland,
D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin, E. Dowling, E. Jamil, E. Montgomery, E. Presani,
E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith, E. Hahn, E. Wood, E. Brinkman, E. Arcaute, E. Dun-
F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L. bar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Ozgenel,
Anderson, G. Nail, G. Mialon, G. Pang, G. Cu- F. Caggioni, F. Guzmán, F. Kanayet, F. Seide, G. M.
curell, H. Nguyen, H. Korevaar, H. Xu, H. Tou- Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern,
vron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra, G. Thattai, G. Herman, G. Sizov, Guangyi, Zhang,
I. Evtimov, J. Copet, J. Lee, J. Geffert, J. Vranes, G. Lakshminarayanan, H. Shojanazeri, H. Zou,
J. Park, J. Mahadeokar, J. Shah, J. van der Linde, H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk,
J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang, H. Aspegren, H. Goldman, I. Damlaj, I. Molybog,
J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park, I. Tufanov, I.-E. Veliche, I. Gat, J. Weissman, J. Ge-
J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala, boski, J. Kohli, J. Asher, J.-B. Gaya, J. Marcus, J. Tang,
K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone, J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong,
K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla, J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shep-
L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan, ard, J. McPhie, J. Torres, J. Ginsburg, J. Wang,
L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher, K. Wu, K. H. U, K. Saxena, K. Prasad, K. Khan-
L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti, delwal, K. Zand, K. Matosich, K. Veeraraghavan,
M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita, K. Michelena, K. Li, K. Huang, K. Chawla, K. Lakho-
M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K. tia, K. Huang, L. Chen, L. Garg, L. A, L. Silva,
Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov, L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich,
N. Bogoychev, N. Chatterji, O. Duchenne, O. Çelebi, L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt,
P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhar- M. Tsimpoukelli, M. Mankus, M. Hasson, M. Lennie,
gava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He, M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Ke-
Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer, neally, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel,
R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R. Pa- M. Vyatskov, M. Samvelyan, M. Clark, M. Macey,
tel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor, M. Wang, M. J. Hermoso, M. Metanat, M. Raste-
R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabas- gari, M. Bansal, N. Santhanam, N. Parks, N. White,
appa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie, N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. P.
S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhos- Laptev, N. Dong, N. Zhang, N. Cheng, O. Chernoguz,
ale, S. Zhang, S. Vandenhende, S. Batra, S. Whit- O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh,
man, S. Sootla, S. Collot, S. Gururangan, S. Borodin- P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux,
sky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou, P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj,
T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao, Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy,
U. Karn, V. Goswami, V. Gupta, V. Ramanathan, R. Nayani, R. Mitra, R. Li, R. Hogan, R. Battey,
R. Wang, R. Maheswari, R. Howes, R. Rinott, S. J.
Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon,
S. Sidorov, S. Pan, S. Verma, S. Yamamoto, S. Ra-
maswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin,
S. C. Zha, S. Shankar, S. Zhang, S. Zhang, S. Wang,
S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max,
S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad,
S. Gupta, S. Cho, S. Virk, S. Subramanian, S. Choud-
hury, S. Goldman, T. Remez, T. Glaser, T. Best,
T. Kohler, T. Robinson, T. Li, T. Zhang, T. Matthews,
T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Mon-
tanez, V. Mohan, V. S. Kumar, V. Mangla, V. Albiero,
V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov,
W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable,
X. Tang, X. Wang, X. Wu, X. Wang, X. Xia, X. Wu,
X. Gao, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang,
Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Hao,
Y. Qian, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick,
Z. Wen, Z. Yang, Z. Zhao, The llama 3 herd of
models, 2024. URL: https://arxiv.org/abs/2407.21783.
arXiv:2407.21783 .
[19] M. Polignano, P. Basile, G. Semeraro, Advanced
natural-based interaction for the italian language:
Llamantino-3-anita, 2024. URL: https://arxiv.org/
abs/2405.07101. arXiv:2405.07101 .
[20] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
naldi, D. Scalena, CALAMITA: Challenge the Abili-
ties of LAnguage Models in ITAlian, in: Proceed-
ings of the 10th Italian Conference on Computa-
tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
ber 4 - December 6, 2024, CEUR Workshop Proceed-
ings, CEUR-WS.org, 2024.