=Paper= {{Paper |id=Vol-3878/129_calamita_long |storemode=property |title=INVALSI - Mathematical and Language Understanding in Italian: A CALAMITA Challenge |pdfUrl=https://ceur-ws.org/Vol-3878/129_calamita_long.pdf |volume=Vol-3878 |authors=Giovanni Puccetti,Maria Cassese,Andrea Esuli |dblpUrl=https://dblp.org/rec/conf/clic-it/0002CE24 }} ==INVALSI - Mathematical and Language Understanding in Italian: A CALAMITA Challenge== https://ceur-ws.org/Vol-3878/129_calamita_long.pdf
                                INVALSI - Mathematical and Language Understanding in
                                Italian: A CALAMITA Challenge
                                Giovanni Puccetti1,∗ , Maria Cassese1 and Andrea Esuli1
                                1
                                    Istituto di Scienza e Tecnologia dell’Informazione - CNR


                                                   Abstract
                                                   While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Language Models (LMs)
                                                   generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance
                                                   on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian.
                                                        These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the
                                                   Italian school system. These tests are prepared by expert pedagogists and have the explicit goal of testing average students’
                                                   performance over time across Italy. Therefore, the questions are well written, appropriate for the age of the students, and are
                                                   developed with the goal of assessing students’ skills that are essential in the learning process, ensuring that the benchmark
                                                   proposed here measures key knowledge for undergraduate students.
                                                        Invalsi MATE is composed of 420 questions about mathematical understanding, these questions range from simple money
                                                   counting problems to Cartesian geometry questions, e.g. determining if a point belongs to a given line. They are divided into
                                                   4 different types: scelta multipla (multiple choice), vero/falso (true/false), numero (number), completa frase (fill the gap).
                                                        Invalsi ITA is composed of 1279 questions regarding language understanding, these questions involve both the ability to
                                                   extract information and answer questions about a text passage as well as questions about grammatical knowledge. They are
                                                   divided into 4 different types: scelta multipla (multiple choice), binaria (binary), domanda aperta (open question), altro (other).
                                                        We evaluate 4 powerful language models both English-first and tuned for Italian to see that best accuracy on Invalsi
                                                   MATE is 55% while best accuracy on Invalsi ITA is 80%.

                                                   Keywords
                                                   Mathematical Understanding, Language Understanding, Invalsi, Large Language Models, Italian Language Models



                                1. Challenge: Introduction and                                                                                      There are several benchmarks to evaluate mathemati-
                                                                                                                                                 cal understanding of LLMs based on English tests [6, 7, 8]
                                   Motivation                                                                                                    and there are also several multi-domain benchmarks in-
                                Assessing the quality of Large Language Models is a chal-                                                        volving   Italian [9], however there aren’t any specifically
                                lenging task because these models can virtually perform                                                          focused    on mathematical understanding in Italian. We
                                any task that can be presented through natural language. focus on high-school questions, an English benchmark
                                To address this difficulty, each model needs to be tested similar to Invalsi MATE is the GSM8k one, [8], which
                                on several tasks at once.                                                                                        contains 8,500 high-school questions.
                                              To help provide new benchmarks to evaluate LLMs                                                       Language Models understanding of language in En-
                                in Italian, We propose two benchmarks, Invalsi MATE glish is also well studied, there are several benchmarks
                                and Invalsi ITA the first meant to evaluate LLMs’ mathe- meant to measure the ability of language models to un-
                                matical understanding and the second to evaluate their derstand language constructs in English, such as [10, 11],
                                language understanding, both in Italian.                                                                         also arranged into extensive suites [12]. On the contrary
                                              These benchmarks originate from the Invalsi tests, there are fewer examples of these tests for the Italian
                                which have been used in the past for demographic stud- language.
                                ies [1, 2, 3], but, to the best of our knowledge we are the                                                         Therefore we propose Invalsi ITA which contains ques-
                                first to use them to test LLMs performance in Italian [4],                                                       tions that are usually split among several different bench-
                                followed only later by others [5].                                                                               marks, e.g. MNLI [13], SQuAD [14] and others from the
                                                                                                                                                 GLUE suite [15]. The questions in the dataset cover sev-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, eral aspects of language understanding, ranging from the
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                   ability to extract specific information, such as the date
                                ∗
                                     Corresponding author.                                                                                       when something happened to more complex information
                                Envelope-Open giovanni.puccetti@isti.cnr.it (G. Puccetti);
                                maria.cassese@isti.cnr.it (M. Cassese); andrea.esuli@isti.cnr.it
                                                                                                                                                 such as whether two events implicate each other or not.
                                (A. Esuli)                                                                                                          These two datasets allow us to measure two key abili-
                                GLOBE https://gpucce.github.io/ (G. Puccetti);                                                                   ties of language models in Italian, to make the comparison
                                https://www.isti.cnr.it/en/about/people-detail/1072/Maria_Cassese among different models more fair we cast all questions as
                                (M. Cassese); https://www.esuli.it/ (A. Esuli)                                                                   multiple choice and measure models’ performance by se-
                                Orcid 0000-0002-5725-4322 (A. Esuli)
                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License lecting the answer with the highest likelihood according
                                             Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
      Testo                                                         Testo
      Elisa è uscita da casa questa mattina alle ore 8:15.          Se moltiplichi per 2 un numero naturale e dal risul-
      Elisa è rientrata nel pomeriggio alle ore 1:15                tato sottrai 1, ottieni sempre un numero pari.
      Domanda                                                       Domanda
      Quanto tempo è stata fuori casa Elisa?                        Vero o Falso?
      A. 5 ore B. 7 ore C. 9 ore D. 11 ore



       (a) scelta multipla question from Invalsi MATE.                  (b) vero/falso question from Invalsi MATE.

      Testo                                                          Testo
      Filippo dice: per trovare il numero della mia magli-           Luca lancia due dadi a sei facce non truccati.
      etta aggiungi una decina e sei unità al numero 4.              Domanda
      Domanda                                                        Completa la frase inserendo una delle espressioni:
      Qual è il numero della maglietta di Filippo?                   La probabilità che la somma dei punti sia 12 è mag-
                                                                     giore della, minore della, uguale alla probabilità che
                                                                     la somma sia 2.


           (c) numero question from Invalsi MATE.                     (d) completa frase question from Invalsi MATE.
Figure 1: Examples of each question type from the Invalsi MATE dataset.



to the model.                                                    and the limitations of the data.
   We measure the performance of 4 strong large mod-
els, mixtral instruct [16], mistral instruct [17], llama 3 8b    2.1. Task 1: Mathematical Understanding
instruct [18], anita 8b dpo [19], the first three are English-
first and the fourth is fine-tuned in Italian, and the current
                                                                      in Italian (Invalsi MATE)
only Italian-first model minerva 3b. We show that on In-         The first task consists in answering mathematical ques-
valsi ITA the best model among those we tested is mixtral        tions in Italian. These questions are meant for students
instruct, which reaches an accuracy of 0.8, while on In-         from 6 to 18 years of age, therefore the kind of ques-
valsi MATE the highest accuracy is 0.55, also achieved           tion can vary significantly, from simpler, example-based,
by mixtral instruct.                                             ones that don’t require any knowledge besides count-
   Both language understanding and mathematical un-              ing, to more complex ones requiring basic geometry and
derstanding are key abilities for students as well as lan-       calculus training and knowledge, never beyond what is
guage models, particularly since these models are often          demanded in basic high-school tests.
used in learning environments. By adding these bench-               The questions are of 4 kinds, scelta multipla, completa
marks to the CALAMITA suite we hope they will help               frase, vero/falso and numero:
the development of LLMs in Italian by providing a more
comprehensive evaluation of their abilities and thus fos-             • scelta multipla (multiple choice): the question
tering the research and development of models in this                   requires to pick the right answer among four pos-
language.                                                               sible ones;
   The CALAMITA special event [20], which has aims to                 • vero/falso (true/false): the question requires to
establishing a shared benchmark for LLMs in Italian, is a               pick the right answer between true and false;
first step towars a systematic evaluation of LLMs in this             • numero (number): the question requires to pick a
language. We hope that the Invalsi challenge will enrich                number that is the correct answer to the question;
the Linguistic and Mathematical understanding branches                • completa frase (fill the gap): the question requires
of this shared benchmark.                                               to fill one or more missing words to make the text
                                                                        coherent.
2. Challenge: Description                                           Of the four question types, scelta multipla and vero/-
                                                                 falso are naturally multiple choice, with scelta multipla
The challenge is composed of two tasks: Invalsi MATE             always having 4 possible answers (A, B, C, D) and vero/-
and Invalsi ITA. For each task, we provide a detailed            falso always 2, (true, false). Questions of the numero type,
description of the data, the metrics used for evaluation,        are not naturally multiple choice, since the answer is a
                          (a) Invalsi MATE                                                (b) Invalsi ITA

Figure 2: The distribution of Question types in the Invalsi datasets in (a) for Invalsi MATE and in (b) for Invalsi ITA.



number among all possible ones (some of the answers                       • domanda aperta (open question): the question re-
will be a year, e.g. 1948, while others can be a decimal                    quires to pick the passage in the text that answers
number of liters of milk, e.g. 0.2), to address this we add                 the question.
3 extra answers that are realistic but wrong to make the                  • altro (other): A small share of questions belong
questions multiple choice. Finally, completa frase ques-                    to open-ended questions with varying scope that
tions, which are only few (20), are too difficult to turn                   are hard to put under a single label.
into multiple choice without changing their meaning,
and therefore we exclude them.                                  Similar to Invalsi MATE, this task involves only
                                                             multiple-choice questions, evaluated through a likelihood
                                                             approach. Both scelta multipla and binaria are naturally
2.2. Task 2: Language Understanding in                       of this kind, the first with 4 options (A, B, C D) and the
       Italian (Invalsi ITA)                                 second with 2 options that change for each question.
The second task consists in answering Italian language Both domanda aperta and altro questions are hard to turn
understanding questions, similarly to task 1, these ques- into multiple-choice ones and therefore we discard them.
tions are also appropriate for students between 6 and 18 Also for Invalsi ITA, this involves only discarding about
years old, and they are overall not too difficult to answer. 180 questions out of 1297, therefore the task only involves
Most of the questions concern a text passage that has to 1117 samples.
be included in the model context, making this evaluation
more costly because the context becomes considerably 3. Data description
larger. The text passage is where the difficulty difference
between ages is more evident, since it can be a simple 3.1. Origin of data
and short story for primary school students, while they
are generally longer and more involved texts for older The dataset is built upon the questions from the Invalsi
students.                                                    tests of the last 15 years. These tests are administered
   The questions are of 4 different types, scelta multipla, to students yearly. There are three different Invalsi tests,
binaria, domanda aperta and altro:                           Language, Mathematics and English. For the scope of
                                                             this datasets we don’t look into the English test, but we
     • scelta multipla (multiple choice): the question limit ourselves to the Italian language and Mathematics
         requires to pick the right answer among four pos- ones. The original questions can be accessed here 1
         sible ones;                                            Some of the questions from the original tests contain
     • binaria (binary): the question requires to pick visual content as part of the question, we omit these
         the right answer about a binary property of a questions since we focus on the language understanding
         statement, e.g. True - False, Before - After, etc.  abilities of the models.
                                                                 1
                                                                     https://www.gestinv.it/Index.aspx
  Question Type         ALL    scelta multipla   vero/falso   numero      Question Type         ALL     scelta multipla   binaria
  N. Questions          400          244             54         102       N. Questions          1117          977           140
  Model                                    Accuracy
                                                                          Model                            Accuracy
  mixtral instruct      0.55        0.49              0.63     0.66
  mistral instruct      0.44        0.34              0.59     0.63       mixtral instruct      0.80         0.82          0.69
  anita 8b dpo          0.47        0.40              0.61     0.55       mistral instruct      0.49         0.60          0.51
  llama 3 8b instruct   0.48        0.42              0.57     0.58       anita 8b dpo          0.71         0.72          0.66
  minerva 3b            0.20        0.22              0.50     0.32
                                                                          llama 3 8b instruct   0.69         0.70          0.61
  random                0.28        0.25              0.5      0.25       minerva 3b            0.30         0.25          0.54
Table 1                                                                   random                0.27         0.25          0.44
Models 0-Shot accuracy on Invalsi MATE, likelihood based
evaluation. In bold the highest accuracy in each column and            Table 2
underlined the second highest.                                         Models 0-Shot accuracy on Invalsi ITA, likelihood based eval-
                                                                       uation. In bold the highest accuracy in each column and
                                                                       underlined the second highest.

   The data is first collected as is from the webpage and
afterwards it is manually checked for errors and incon-
                                                                       choice ones. We also don’t use chain of thought prompts
sistencies from two annotators. The annotators have
                                                                       or similar methods. Since this is the first attempt to
MSc in Mathematics and Computer Science, which gives
                                                                       build a dataset on mathematical understanding in Italian,
them sufficient knowledge to identify issues in the Invalsi
                                                                       currently we evaluate with the simplest approach.
MATE questions. For Invalsi ITA, the annotators don’t
have an appropriate background, however, the questions
are simple enough that they can be easily understood                   3.3. Detailed data statistics
and checked for errors by anybody who has completed
                                                            The data does not have a train and a test split because
the mandatory education.
                                                            we have a limited number of samples. The Invalsi MATE
                                                            split is composed of 420 samples, of which 400 are used
3.2. Data format                                            in the benchmark, since we exclude the 20 questions
                                                            marked as completa frase since they can´t be made into
The Invalsi MATE dataset has 8 different columns:
                                                            multiple choice.
      • testo: this field contains the context needed to       Figure 2a shows the percentage of questions of each
        answer the questions, it is often empty for Invalsi kind in the dataset, scelta multipla has the largest share,
        MATE since most of the context is part of the 58%, the second most present is numero, 24.7% and then
        domanda field itself;                               vero/falso and completa frase are fewer. Table 1 has a
      • domanda: this field contains the question itself, random row that shows the performance if one were to
        including possible answer options, e.g. for scelta pick random questions, moreover the correct answers for
        multipla questions;                                 each question type are approximately evenly distributed
      • risposta: this field contains the correct answer; among labels. Specifically, scelta multipla questions have
      • test_id: this field is just an id to identify each answers distributed as follows, 46 are labelled A, 87 B,
        sample;                                             71 C and 40 D, showing a moderate balance. Similarly
                                                            for vero/falso questions there 24 questions with positive
      • tipo: this field indicates the question type, among
                                                            answer and 30 with negative answer.
        scelta multipla, vero/falso, numero and completa
                                                               Similarly, Figure 2b shows the percentage of questions
        frase;
                                                            of each of the kinds present in Invalsi ITA, scelta multipla
      • alt1, alt2 and alt3: this three fields indicate the
                                                            is by a large margin the most present, composing 76.4%
        alternative values for the numero questions since
                                                            of all the questions, binaria is second with 10.9% while
        this we chose ourselves and are not indicated in
                                                            domanda aperta and altro are fewer.
        the domanda field.
                                                               Table 2 shows the performance one can achieve pick-
   The Invalsi ITA dataset has the same fields as the In- ing answers at random in each split, and moreover the
valsi MATE one with the exception that the testo field is correct answers are evenly distributed for each label also
often present and generally the longest.                    in this dataset. In particular, for Invalsi ITA 254 of the
   We evaluate in a zero-shot fashion just providing the scelta multipla questions have answer A, 255 B, 263 C
model with question and using a likelihood based method, and 205 D which is comparable to the distribution in the
we pick as the model’s answer the one with the highest Invalsi MATE dataset and similarly does binaria.
likelihood among the options available. This is always
possible since we have recast all the questions as multiple
4. Metrics                                                     and the second for the evaluation of language under-
                                                               standing in Italian.
Since all the datasets and splits we study have balanced          For Invalsi MATE we have collected 420 questions
labels, we choose to measure accuracy. In particular,          divided into 4 types, scelta multipla, vero/falso, numero
since all the questions in the datasets we propose are         and completa frase and we evaluate 4 strong language
multiple choice, it is straightforward to measure accuracy     models that are near SOTA in their weight range, mixtral
even if there are questions of different kinds, by simply      instruct, mistral instruct, llama 3 8b instruct and anita 8b
counting |𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟𝑠|/|𝑎𝑙𝑙 𝑎𝑛𝑠𝑤𝑒𝑟𝑠|.                      dpo. We find that this models are still far from perfect
   To see how challenging our benchmark is , we measure        mathematical understanding in Italian with the highest
four powerful language models based on mistral, mixtral        accuracy, achieved by mixtral instruct being 55%.
and llama 3, in particular, we measure the performance of         For Invalsi ITA we have collected 1297 questions di-
mistral instruct, mixtral instruct, anita 8b dpo and llama     vided into 4 types, scelta multipla, binaria, domanda
3 8b instruct.                                                 aperta and altro, we tested the same models also on this
   These models have between 7 and 54 billion param-           benchmark and found that models are stronger at lan-
eters, three of them, mistral instruct, anita 8b dpo and       guage understanding, with the highest accuracy in this
llama 3 8b instruct are purely autoregressive transform-       task at 80%, also in this case, achieved by mixtral instruct.
ers, while mixtral instruct is a MoE architecture that has        Both mathematical and Language understanding are
54 billion parameters but only uses 14 billion at inference.   key abilities for LLMs, we believe that our two bench-
We test them all in the same way, using a likelihood based     marks will foster the development of LLMs in Italian and
approach.                                                      pave the way for new more challenging benchmarks on
   Table 1 shows the performance of these models on            mathematical and language understanding in Italian.
Invalsi MATE on the whole dataset, in the ALL column
and on each split scelta multipla, vero/falso and numero
in the respective columns. mixtral instruct is the clear       6. Limitations
winner among the models we tested, it beats the second
best, llama 3 8b instruct by 7% accuracy on the entire         The main limitations of the benchmark we propose lies
dataset. On the separate splits, mixtral instruct is best      in Task 2, Invalsi ITA we show that the models we test
overall, however the second best model changes, with           achieve very high accuracy, up to 80% on this bench-
llama 3 8b instruct being second in scelta multipla with a     mark, making it possibly too simple for newer and larger
7% gap, anita 8b dpo second on vero/falso with a smaller       models, nevertheless, current Italian first LLMs are not
2% performance gap and mistral instruct being second           comparable to larger English-first ones and therefore we
best in numero with a 3% gap.                                  believe it can still be valuable in this transitory phase.
   The total accuracy is bound by 55% showing that the In-     On the contrary Invalsi MATE is very challenging and it
valsi MATE task is challenging for models of the sizes we      seems that models won’t saturate it soon.
tested, up to 54B parameters, and that the performance            We believe that there is a limited risk from contamina-
a model achieves provides valuable insights about how          tion from both existing English and Italian tests.
well it can perform mathematical reasoning in Italian.            Concerning direct contamination, we were unable to
   Table 2 shows the performance of the same models            find any web page that would expose the answers openly
on Invalsi ITA, the model ranking stays the same, with         without needing any sort of authentication, making it dif-
mixtral instruct the strongest and anita 8b dpo second         ficult to crawl these data automatically, therefore, while
best. Performance on Invalsi ITA is higher across all fields   the questions might be present in the training set of some
with mixtral instruct achieving 80% accuracy. Unlike           of the models, we deem it unlikely that the answers were
what happens for Invalsi MATE, in Invalsi ITA the second       there too.
best model is the same across the board, anita 8b dpo is          Concerning contamination through translation from
the second in scelta multipla as well as in binaria with a     English, the Invalsi questions are carefully crafted to
performance gap around 10% in all question types.              match the grade of the students that will undertake them,
   The accuracy of the best model on all the questions at      therefore we believe it is unlikely that they are taken
once is 80% showing that the models we tested perform          from English questions available in other online sources,
well in the language understanding in Italian.                 but rather created specifically for each new annual test.


5. Conclusions                                                 Acknowledgments
We propose two Tasks, Invalsi MATE and Invalsi ITA the         This work was also partially supported by: FAIR
first for the evaluation of mathematical understanding         (PE00000013) project under the NextGenerationEU pro-
                                                               gramme, partially by the PNRR project ITSERR (CUP
B53C22001770006) and partially by the Project PRIN                 verifiers to solve math word problems, 2021.
2022EPTPJ9 (WEMB – “Word EMBeddings: From Cog-                     arXiv:2110.14168 .
nitive Linguistics to Language Engineering, and Back”),        [9] R. J. Das, S. E. Hristov, H. Li, D. I. Dimitrov, I. Koy-
funded by the Italian Ministry of University and Research          chev, P. Nakov, Exams-v: A multi-discipline multi-
(MUR).                                                             lingual multimodal exam benchmark for evaluating
                                                                   vision language models, 2024. arXiv:2403.10378 .
                                                              [10] L. Bentivogli, B. Magnini, I. Dagan, H. T.
References                                                         Dang, D. Giampiccolo,             The fifth PASCAL
                                                                   recognizing textual entailment challenge, in:
 [1] G. Bolondi, C. Cascella, Somministrazione delle
                                                                   Proceedings of the Second Text Analysis Con-
     prove invalsi dal 2009 al 2015: un patrimonio d’in-
                                                                   ference, TAC 2009, Gaithersburg, Maryland,
     formazioni tra evidenze psicometriche e didattiche,
                                                                   USA, November 16-17, 2009, NIST, 2009. URL:
     in: I dati INVALSI: uno strumento per la ricerca,
                                                                   https://tac.nist.gov/publications/2009/additional.
     Franco Angeli, Milano, 2017, p. 14.
                                                                   papers/RTE5_overview.proceedings.pdf.
 [2] A. Costanzo, M. Desimoni, Beyond the mean esti-
                                                              [11] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, B. V.
     mate: a quantile regression analysis of inequalities
                                                                   Durme, Record: Bridging the gap between human
     in educational outcomes using invalsi survey data,
                                                                   and machine commonsense reading comprehen-
     Large-scale Assessments in Education (2017). URL:
                                                                   sion, 2018. URL: https://arxiv.org/abs/1810.12885.
     https://doi.org/10.1186/s40536-017-0048-4. doi:10.
                                                                   arXiv:1810.12885 .
     1186/s40536- 017- 0048- 4 .
                                                              [12] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh,
 [3] J. Pietschnig, S. Oberleiter, E. Toffalini, D. Giofrè,
                                                                   J. Michael, F. Hill, O. Levy, S. R. Bowman, Super-
     Reliability of the g factor over time in italian
                                                                   GLUE: a stickier benchmark for general-purpose
     invalsi data (2010-2022): What can achievement-g
                                                                   language understanding systems, Curran Asso-
     tell us about the flynn effect?,            Personal-
                                                                   ciates Inc., Red Hook, NY, USA, 2019.
     ity and Individual Differences 214 (2023)
                                                              [13] A. Williams, N. Nangia, S. Bowman, A broad-
     112345. URL: https://www.sciencedirect.com/
                                                                   coverage challenge corpus for sentence understand-
     science/article/pii/S0191886923002684. doi:https:
                                                                   ing through inference, in: M. Walker, H. Ji, A. Stent
     //doi.org/10.1016/j.paid.2023.112345 .
                                                                   (Eds.), Proceedings of the 2018 Conference of the
 [4] A. Esuli, G. Puccetti, The invalsi benchmarks: mea-
                                                                   North American Chapter of the Association for
     suring linguistic and mathematical understanding
                                                                   Computational Linguistics: Human Language Tech-
     of large language models in italian, 2024. URL: https:
                                                                   nologies, Volume 1 (Long Papers), Association for
     //arxiv.org/abs/2403.18697. arXiv:2403.18697 .
                                                                   Computational Linguistics, New Orleans, Louisiana,
 [5] F. Mercorio, M. Mezzanzanica, D. Potertì, A. Serino,
                                                                   2018, pp. 1112–1122. URL: https://aclanthology.org/
     A. Seveso, Disce aut deficere: Evaluating llms
                                                                   N18-1101. doi:10.18653/v1/N18- 1101 .
     proficiency on the invalsi italian benchmark,
                                                              [14] P. Rajpurkar, R. Jia, P. Liang, Know what you
     2024. URL: https://arxiv.org/abs/2406.17535.
                                                                   don’t know: Unanswerable questions for SQuAD,
     arXiv:2406.17535 .
                                                                   in: I. Gurevych, Y. Miyao (Eds.), Proceedings of
 [6] D. Hendrycks, C. Burns, S. Kadavath, A. Arora,
                                                                   the 56th Annual Meeting of the Association for
     S. Basart, E. Tang, D. Song, J. Steinhardt, Measur-
                                                                   Computational Linguistics (Volume 2: Short Pa-
     ing mathematical problem solving with the math
                                                                   pers), Association for Computational Linguistics,
     dataset, NeurIPS (2021).
                                                                   Melbourne, Australia, 2018, pp. 784–789. URL:
 [7] H. Liu, Z. Zheng, Y. Qiao, H. Duan, Z. Fei, F. Zhou,
                                                                   https://aclanthology.org/P18-2124. doi:10.18653/
     W. Zhang, S. Zhang, D. Lin, K. Chen, MathBench:
                                                                   v1/P18- 2124 .
     Evaluating the theory and application proficiency
                                                              [15] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy,
     of LLMs with a hierarchical mathematics bench-
                                                                   S. Bowman, GLUE: A multi-task benchmark and
     mark, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.),
                                                                   analysis platform for natural language understand-
     Findings of the Association for Computational Lin-
                                                                   ing, in: T. Linzen, G. Chrupała, A. Alishahi (Eds.),
     guistics ACL 2024, Association for Computational
                                                                   Proceedings of the 2018 EMNLP Workshop Black-
     Linguistics, Bangkok, Thailand and virtual meeting,
                                                                   boxNLP: Analyzing and Interpreting Neural Net-
     2024, pp. 6884–6915. URL: https://aclanthology.org/
                                                                   works for NLP, Association for Computational Lin-
     2024.findings-acl.411.
                                                                   guistics, Brussels, Belgium, 2018, pp. 353–355. URL:
 [8] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,
                                                                   https://aclanthology.org/W18-5446. doi:10.18653/
     H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton,
                                                                   v1/W18- 5446 .
     R. Nakano, C. Hesse, J. Schulman, Training
                                                              [16] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch,
                                                                   B. Savary, C. Bamford, D. S. Chaplot, D. de las
     Casas, E. B. Hanna, F. Bressand, G. Lengyel,                V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Petro-
     G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A.        vic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Mar-
     Lachaux, P. Stock, S. Subramanian, S. Yang, S. An-          tinet, X. Wang, X. E. Tan, X. Xie, X. Jia, X. Wang,
     toniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang,          Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song,
     T. Lacroix, W. E. Sayed, Mixtral of experts, 2024.          Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan,
     arXiv:2401.04088 .                                          Z. Chen, Z. Papakipos, A. Singh, A. Grattafiori,
[17] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-            A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Vic-
     ford, D. S. Chaplot, D. de las Casas, F. Bressand,          toria, A. Goldstand, A. Menon, A. Sharma, A. Boe-
     G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud,           senberg, A. Vaughan, A. Baevski, A. Feinstein,
     M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril,             A. Kallet, A. Sangani, A. Yunus, A. Lupu, A. Al-
     T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023.         varado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan,
     arXiv:2310.06825 .                                          A. Ramchandani, A. Franco, A. Saraf, A. Chowd-
[18] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-           hury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yaz-
     Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,          dan, B. James, B. Maurer, B. Leonhardi, B. Huang,
     A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mi-             B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu,
     tra, A. Sravankumar, A. Korenev, A. Hinsvark,               B. Ni, B. Hancock, B. Wasti, B. Spence, B. Sto-
     A. Rao, A. Zhang, A. Rodriguez, A. Gregerson,               jkovic, B. Gamido, B. Montalvo, C. Parker, C. Bur-
     A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern,        ton, C. Mejia, C. Wang, C. Kim, C. Zhou, C. Hu,
     C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. Mc-            C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer,
     Connell, C. Keller, C. Touret, C. Wu, C. Wong, C. C.        D. Civin, D. Beaty, D. Kreymer, D. Li, D. Wyatt,
     Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz,     D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh,
     D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan,           D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland,
     D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin,         E. Dowling, E. Jamil, E. Montgomery, E. Presani,
     E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith,            E. Hahn, E. Wood, E. Brinkman, E. Arcaute, E. Dun-
     F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L.          bar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Ozgenel,
     Anderson, G. Nail, G. Mialon, G. Pang, G. Cu-               F. Caggioni, F. Guzmán, F. Kanayet, F. Seide, G. M.
     curell, H. Nguyen, H. Korevaar, H. Xu, H. Tou-              Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern,
     vron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra,        G. Thattai, G. Herman, G. Sizov, Guangyi, Zhang,
     I. Evtimov, J. Copet, J. Lee, J. Geffert, J. Vranes,        G. Lakshminarayanan, H. Shojanazeri, H. Zou,
     J. Park, J. Mahadeokar, J. Shah, J. van der Linde,          H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk,
     J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang,       H. Aspegren, H. Goldman, I. Damlaj, I. Molybog,
     J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park,      I. Tufanov, I.-E. Veliche, I. Gat, J. Weissman, J. Ge-
     J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala,       boski, J. Kohli, J. Asher, J.-B. Gaya, J. Marcus, J. Tang,
     K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone,       J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong,
     K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla,         J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shep-
     L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan,       ard, J. McPhie, J. Torres, J. Ginsburg, J. Wang,
     L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher,      K. Wu, K. H. U, K. Saxena, K. Prasad, K. Khan-
     L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti,       delwal, K. Zand, K. Matosich, K. Veeraraghavan,
     M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita,         K. Michelena, K. Li, K. Huang, K. Chawla, K. Lakho-
     M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K.             tia, K. Huang, L. Chen, L. Garg, L. A, L. Silva,
     Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov,        L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich,
     N. Bogoychev, N. Chatterji, O. Duchenne, O. Çelebi,         L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt,
     P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhar-    M. Tsimpoukelli, M. Mankus, M. Hasson, M. Lennie,
     gava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He,     M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Ke-
     Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer,          neally, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel,
     R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R. Pa-   M. Vyatskov, M. Samvelyan, M. Clark, M. Macey,
     tel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor,      M. Wang, M. J. Hermoso, M. Metanat, M. Raste-
     R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabas-       gari, M. Bansal, N. Santhanam, N. Parks, N. White,
     appa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie,      N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. P.
     S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhos-           Laptev, N. Dong, N. Zhang, N. Cheng, O. Chernoguz,
     ale, S. Zhang, S. Vandenhende, S. Batra, S. Whit-           O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh,
     man, S. Sootla, S. Collot, S. Gururangan, S. Borodin-       P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux,
     sky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou,         P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj,
     T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao,           Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy,
     U. Karn, V. Goswami, V. Gupta, V. Ramanathan,               R. Nayani, R. Mitra, R. Li, R. Hogan, R. Battey,
     R. Wang, R. Maheswari, R. Howes, R. Rinott, S. J.
     Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon,
     S. Sidorov, S. Pan, S. Verma, S. Yamamoto, S. Ra-
     maswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin,
     S. C. Zha, S. Shankar, S. Zhang, S. Zhang, S. Wang,
     S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max,
     S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad,
     S. Gupta, S. Cho, S. Virk, S. Subramanian, S. Choud-
     hury, S. Goldman, T. Remez, T. Glaser, T. Best,
     T. Kohler, T. Robinson, T. Li, T. Zhang, T. Matthews,
     T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Mon-
     tanez, V. Mohan, V. S. Kumar, V. Mangla, V. Albiero,
     V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov,
     W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable,
     X. Tang, X. Wang, X. Wu, X. Wang, X. Xia, X. Wu,
     X. Gao, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang,
     Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Hao,
     Y. Qian, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick,
     Z. Wen, Z. Yang, Z. Zhao, The llama 3 herd of
     models, 2024. URL: https://arxiv.org/abs/2407.21783.
     arXiv:2407.21783 .
[19] M. Polignano, P. Basile, G. Semeraro, Advanced
     natural-based interaction for the italian language:
     Llamantino-3-anita, 2024. URL: https://arxiv.org/
     abs/2405.07101. arXiv:2405.07101 .
[20] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
     cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
     naldi, D. Scalena, CALAMITA: Challenge the Abili-
     ties of LAnguage Models in ITAlian, in: Proceed-
     ings of the 10th Italian Conference on Computa-
     tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
     ber 4 - December 6, 2024, CEUR Workshop Proceed-
     ings, CEUR-WS.org, 2024.