1. Introduction

MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

Ruggero Marino Lazzaroni

Alessandro Angioi

Michelangelo Puliga

Davide Sanna

Roberto Marras

University of Graz

OnePix Academy

2025

Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three dificulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-eficient open-source alternatives (<30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, ofering insights into current capabilities and standardized evaluation methodology for this critical domain.

eol>LLM Evaluation Benchmark Italian NLP Medical Education Question Answering Educational Technology CLiC-it

1. Introduction

versity entrance examination questions. Sourced from Edizioni Simone, a leading Italian publisher of preparaLarge Language Models (LLMs) have demonstrated re- tory materials, MedBench-IT comprises 17,410 expertmarkable capabilities across diverse tasks [1], transform- written, multiple-choice questions. These questions span ing artificial intelligence applications. Their potential in six core subjects (Biology, Chemistry, Logic, General Culspecialized domains, particularly education [2, 3], ofers ture, Mathematics, and Physics) and are categorized into promise for personalized learning, assessment, and high- three distinct dificulty levels, mirroring the structure of stakes examination support. As LLMs advance, rigorous the actual Italian medical admissions tests. Our evaluaand contextually relevant evaluation methodologies be- tion encompasses a diverse range of models, including come essential. leading proprietary LLMs (e.g., GPT-4o, Claude series)

However, a significant portion of existing LLM bench- and resource-eficient open-source alternatives (<30B pamarks are predominantly English-centric [4, 5, 6], and re- rameters), with a particular focus on models practical for sources for non-English languages, especially in specific, deployment in various Italian organizational contexts. demanding domains, remain comparatively scarce. This Our evaluation methodology begins with standard acgap is particularly evident for the Italian language, where curacy assessments and is then augmented with several the lack of specialized benchmarks [7, 8, 9] can hinder in-depth analyses designed to probe model robustness the objective assessment of LLM performance, limit the and behavior. These include rigorous tests for reprodevelopment of tailored educational technologies, and ducibility (examining response consistency across idennecessitate reliance on translated materials which may tical runs), ordering bias (assessing sensitivity to the be imperfect or fail to capture local educational nuances. permutation of answer choices), and the impact of ex

In this paper, we introduce MedBench-IT, a novel plicit reasoning prompts on model performance. We and comprehensive benchmark specifically designed to also investigate the relationship between question text evaluate the performance of LLMs on Italian medical uni- readability and model accuracy, providing further dimensions for understanding model capabilities.

Our primary contributions include:

• The creation and presentation of MedBench-IT, the first large-scale benchmark specifically for Italian medical entrance exam questions, curated from expert-validated sources, meant to be a valu- like BioASQ [14] have emerged to evaluate LLMs on able resource for the fostering of LLMs for the medical knowledge, question answering, and reasonItalian language, particularly within its educa- ing. These resources are crucial for advancing AI in tional technology sector. healthcare. However, they primarily focus on English• An extensive empirical evaluation of a diverse language materials and examination styles. For examset of state-of-the-art and practically deployable ple, MedQA [11] is based on USMLE-style questions, LLMs on MedBench-IT. which assess medicine-specific knowledge for medical • In-depth analyses of model consistency (repro- licensing purposes, whereas the Italian medical entrance ducibility), robustness (ordering bias), and the exam covers a broader range of topics to evaluate candiferential impact of direct versus reasoning- didates’ suitability for medical school admission. Apeliciting prompting strategies. plying these benchmarks directly to the Italian medical • Actionable insights into factors such as subject context presents challenges related to translation fidelity, matter, question dificulty, and text readability diferences in curriculum emphasis, and distinct examinathat influence LLM performance within this spe- tion formats, underscoring the need for native-language, cific Italian educational context. context-specific benchmarks.

2.2. Medical Domain LLM Benchmarks

In the medical domain, benchmarks such as MedQA [11], PubMedQA [12], MedExQA [13], and challenges

The remainder of this paper is structured as follows: Section 2 discusses related work in LLM evaluation and Italian NLP resources. Section 3 details the construction and characteristics of the MedBench-IT dataset. Section 4 outlines our experimental setup, including the core evaluation and subsequent analytical tests. Section 5 presents and analyzes the results from these evaluations. Section 6 discusses the broader implications of our findings. Section 7 acknowledges the limitations of our study, and Section 8 concludes the paper with directions for future work.

2. Related Work

The evaluation of Large Language Models (LLMs) is a rapidly evolving field, with numerous benchmarks developed to assess their capabilities across various dimensions.

2.1. General LLM Evaluation Benchmarks 2.3. LLM Evaluation and Resources in Italian

The Italian NLP community has developed evaluation frameworks like the CALAMITA challenge [9], which includes the Mult-IT dataset [15] with questions from Italian university entrance and public sector exams. Medical domain eforts include work on specialty tests [ 16] and shared tasks like CLinkaRT at EVALITA 2023, focused on the clinical domain [17]. MedBench-IT distinguishes itself through its specific medical entrance exam focus, a larger specialized corpus (17,410 medical questions), detailed subject/dificulty breakdowns, and comprehensive robustness analyses.

Other evaluation suites for Italian, such as ItaEval [7] and ITA-Bench [8], aim to provide broader assessments of LLM capabilities, often by translating existing English benchmarks or adapting various Italian datasets.

In the educational context, benchmarks derived from INVALSI tests (standardized national assessments) like those discussed by Puccetti et al. [18] can assess linguistic and mathematical understanding. Unlike these generalpurpose benchmarks, MedBench-IT focuses specifically on medical entrance exams using native Italian content.

Prominent benchmarks such as GLUE (General Language Understanding Evaluation) [ 10 ], SuperGLUE [5], and MMLU (Massive Multitask Language Understanding) [6] have been instrumental in tracking the progress of LLMs on general language understanding and multi-task rea- 2.4. Studies on LLM Robustness and soning. More recently, benchmarks like MMLU-Pro [4] Reasoning have sought to address saturation issues and increase the challenge level of existing evaluations by incorporating more reasoning-focused questions and more distractor options. While foundational, these benchmarks are predominantly designed for and evaluated in English, limiting their direct applicability to other linguistic contexts without adaptation.

Beyond accuracy, LLM robustness and reasoning capabilities are critical areas of investigation. Prior research has highlighted LLM sensitivity to prompt variations [19], ordering biases in multiple-choice questions [4], and reproducibility challenges. Chain-of-Thought (CoT) prompting [20] eficacy varies by model and task complexity. Recent work [21] revealed significant limitations in LLM mathematical reasoning, showing apparent proifciency may depend more on pattern recognition than genuine understanding. Our work incorporates these considerations by establishing baseline performance and conducting specific experiments to assess reproducibility, ordering bias, and reasoning-eliciting prompt impact on MedBench-IT.

3. The MedBench-IT Benchmark 3.1. Dataset Construction

MedBench-IT comprises multiple-choice questions provided by Edizioni Simone, a leading Italian publisher of medical entrance exam preparatory materials. Questions are expert-authored to accurately reflect oficial Italian medical admission exam style, content, and dificulty.

From an initial corpus of 43,525 questions, we applied ifltering steps: (1) removed image-reliant questions for text-based LLM compatibility; (2) excluded English subject questions; (3) stripped XML/HTML markup; (4) standardized format to question stem, five answer options, and single correct answer. After preprocessing, we selected a stratified sample of 17,410 questions maintaining original subject and dificulty proportions, inspired by MMLU’s comparable size for balanced coverage and evaluation manageability.

3.2. Dataset Characteristics and Prompting

The final dataset contains 17,410 questions with metadata indicating subject and dificulty level. Table 1 shows Biology (28.1%) and Chemistry (22.9%) as largest portions, followed by Logic (17.3%), General Culture (13.2%), Mathematics (9.6%), and Physics (8.9%). Table 2 shows Level 1/Base (46.1%), Level 2/Intermediate (41.1%), and Level 3/Advanced (12.8%) distributions.

An example Biology question:

Domanda: La plasmolisi:

Possibili risposte: 1. Avviene nelle cellule animali 2. E’ lo scollamento della membrana plasmatica dalla parete nelle cellule vegetali 3. E’ causata da un eccessivo turgore della cellula 4. Avviene in ambiente ipotonico 5. E’ la rottura della membrana cellulare nei globuli rossi (Risposta corretta: 2) Question (English Translation): Plasmolysis: Possible answers: 1. Occurs in animal cells 2. Is the detachment of the plasma membrane from the wall in plant cells 3. Is caused by excessive turgor of the cell 4. Occurs in a hypotonic environment 5. Is the rupture of the cell membrane in red blood cells (Correct Answer: 2)

3.3. Prompting Strategies

To evaluate LLM performance on MedBench-IT, we employed two distinct zero-shot prompting strategies: 1. Standard Prompt (Direct Answering): This prompt presents the question and answer choices directly, asking the model to select the number corresponding to the correct answer. The format, presented to the models in Italian, is as follows (see Listing 1).

Listing 1: Standard Prompt Format used in MedBench-IT. Listing 2: Reasoning-Eliciting Prompt Format used in

MedBench-IT. The models were instructed to output only the reasoning (if prompted) and the final answer number in the specified format. For experiments utilizing the reasoning prompt, the reasoning text was used for qualitative analysis, while only the numerical answer was used for accuracy scoring.

4. Experimental Setup

The open-source models evaluated represent the latest iterations of various families and sizes at time of experimentation, including several fine-tuned for Italian: • Qwen 2.5 series [22]: Including instruct versions from 0.5B to 14B parameters (e.g., Qwen 2.5 7B Instruct). • Gemma 2 series [23]: Including instruct-tuned versions (Gemma 2 2B IT, Gemma 2 9B IT) and community fine-tunes focused on Italian. • Llama 3 series and fine-tunes [ 24]: Models such as Llama 3.1 8B Instruct and various Italian finetunes contributed by the community. • Phi series [25]: Including Phi-4. • DeepSeek series [26]: Including models accessed via API: DeepSeek Chat (equivalent to DeepseekV3), DeepSeek Reasoner (equivalent to DeepseekR1), and locally deployed distilled models (e.g., DeepSeek R1 Distill Qwen 7B). • OLMo 2 series [27]: OLMo 2 7B Instruct and

OLMo 2 13B Instruct. • Other notable models: Including Aya Expanse 8B [28], and models from the Minerva family by SapienzaNLP [29].

4.1. Models Evaluated

All open-source models were run locally using stanThis section outlines the methodology employed for eval- dard libraries such as the vLLM framework [30]. For uating various Large Language Models (LLMs) on the proprietary models, oficial APIs were used during the MedBench-IT benchmark. We detail the models selected experimentation period (between December 2024 and for evaluation, the primary metrics used, and the specific January 2025). Unless specified otherwise (e.g., for reproprotocols for our specialized analyses. ducibility tests), a sampling temperature of 0 was used for all models to promote deterministic outputs for the main evaluation runs.

A diverse range of LLMs was selected for evaluation on MedBench-IT, encompassing leading proprietary models and prominent open-source alternatives. The selection aimed to provide a comprehensive overview of current model capabilities, including models with specific focus on Italian language tasks and those chosen for practical

4.2. Evaluation Metrics

The primary metric used for evaluating model performance on MedBench-IT is accuracy. Accuracy is calculated as the percentage of questions for which the model provided the correct answer out of the total number of questions evaluated: Accuracy =

Number of Correct Answers

Total Number of Questions × 100% (1) Accuracy was computed overall, as well as broken down by: • Subject area (Biology, Chemistry, Logic, etc.). • Dificulty level (Level 1, Level 2, Level 3).

4.3. Specialized Analyses Setup

In addition to standard accuracy evaluation, we conducted several specialized analyses to assess model robustness and behavior:

2https://huggingface.co/anakin87/gemma-2-9b-neogenesis-ita

3https://huggingface.co/mii-llm/maestrale-chat-v0-4-beta

1. Reproducibility Test: To assess response con

sistency, we evaluated GPT-4o twice on the entire 14B 76.8 67.9 MedBench-IT dataset using identical parameters 14B 72.6 76.9 (standard prompt, temperature 1). We compared 97BB 6612..71 6697..42 question-by-question responses, calculating per- 7B 61.1 67.6 centages of identical answers and consistent cor- 8B 50.3 57.4 rectness across runs (Section 5.2). 7B 50.8 53.0 2. Ordering Bias Test: To investigate whether an- 8B 46.7 0.1 swer option order influences predictions, we eval- 02.5BB 4213..12 3149..32 uated selected models (GPT-4o and Claude 3.5 Haiku) on both the original dataset and a version a Par. = Parameters (B = billion, – = proprietary) with shufled answer options, comparing accuracy scores to identify performance deviations attributable to ordering (Section 5.3). Top proprietary models and large open-source models 3. Reasoning Impact Test: All models were eval- like DeepSeek Reasoner and o1-preview achieve accuracy uated using both standard direct-answering and around or above 90%, followed by Claude 3.5 Sonnet and reasoning-eliciting prompts. Accuracy scores and GPT-4/4o series in the mid-to-high 80s. Open-source reasoning text length were analyzed for correla- models demonstrate strong capabilities, with Phi-4 and tions with answer correctness (Section 5.4). Qwen 2.5 14B Instruct achieving 70%+ accuracy. Models 4. Readability Analysis: We calculated Flesch like Gemma 2 9B Instruct, Lexora Medium 7B, and Italian Reading Ease scores (Formula di Flesch-Vacca) for adaptations of Gemma 2 9B (e.g., ‘anakin87/gemma-2each question using the ‘textstat‘ library1. Logis- 9b-neogenesis-ita‘2) perform respectably around 60-62%. tic regression analysis determined whether read- Smaller models like Llama 3.1 8B Instruct and the Italian ability correlates with model performance under Maestrale family3 (based on Mistral 7B) score around 50%, both prompt conditions (Section 5.5). while many other open-source models, including several Italian fine-tunes of Llama 3 8B, fall into the 30-50% range.

This ranking shows rapid progress in open-source models 5. Results and Analysis while still showing a performance delta compared to the best proprietary systems.

This section presents the evaluation results of selected Subject analysis reveals consistent dificulty patterns LLMs on MedBench-IT using standard zero-shot prompts, (full per-subject results in Appendix B, Table 4). Logic and followed by specialized analyses. 1https://pypi.org/project/textstat/ 671B – – 671B – – – – – Mathematics consistently emerge as most challenging for 4o’s accuracy dropped slightly from 83.9% to 83.5% (-0.4%). nearly all models. Top models often score 15-25 percent- Claude 3.5 Haiku decreased from 80.4% to 79.5% (-0.9%) age points lower in Logic compared to Biology or Chem- (Figure 2). istry (e.g., GPT-4o: 92.4% in Biology vs 64.9% in Logic).

This suggests abstract reasoning and multi-step problemsolving remain significant hurdles. Conversely, Biology, Chemistry, and General Culture show higher accuracy, likely reflecting strong factual knowledge capabilities.

Physics performance is typically intermediate.

5.2. Reproducibility Insights

The reproducibility test on GPT-4o yielded 88.86% response consistency across two identical runs on 17,410 questions, indicating 11.14% diferent answer choices despite identical inputs.

Consistency varied notably across subjects (Figure 1).

Higher consistency was observed in knowledge-based subjects like Biology (96.8%) and General Culture (93.0%), Figure 2: Performance comparison for GPT-4o and Claude while lower consistency was found in subjects requir- 3.5 Haiku on Standard vs. Shufled MedBench-IT benchmark. ing complex reasoning: Mathematics (79.8%) and Logic (73.6%). Physics (89.9%) and Chemistry (91.7%) showed McNemar’s test revealed mixed results: GPT-4o intermediate consistency. Across dificulty levels, con- showed no statistically significant ordering bias (p > sistency remained stable (Level 1: 89.8%, Level 2: 88.1%, 0.05), while Claude 3.5 Haiku exhibited significant poLevel 3: 88.0%). sitional sensitivity (p < 0.001). These results demon

Regarding correctness, 80.6% of responses were cor- strate MedBench-IT’s ability to detect ordering bias when rect in both runs, 13.2% were incorrect in both runs, and present, revealing model-specific robustness diferences. 6.2% showed inconsistent correctness between runs. McNemar’s test confirmed diferences were not statistically significant (p > 0.05), indicating normal stochastic varia- 5.4. Impact of Reasoning Prompts tion rather than systematic instability.

5.3. Ordering Bias

The ordering bias test, shufling answer choices for GPT4o and Claude 3.5 Haiku, showed minimal impact. GPTComparing standard direct-answering versus reasoningeliciting prompts revealed nuanced results (Figure 3). Unlike benchmarks where Chain-of-Thought significantly boosts performance [20, 4], many top-performing models on MedBench-IT showed no substantial gains, with some exhibiting slightly lower accuracy. Models like DeepSeek Reasoner, o1-preview, and GPT-4o performed slightly worse with reasoning prompts. Some mid-range or smaller models, such as Llama 3.1 8B Instruct, showed slight increases.

This suggests capable models eficiently arrive at answers without requiring explicit, complex reasoning chains. The forced reasoning step might introduce unnecessary processing for some architectures. Analysis showed models tend to produce shorter explanations when correct compared to incorrect answers, indicating more concise justifications for correct answers derived directly.

5.5. Readability Correlation

Analysis investigating the relationship between question text readability (Flesch Reading Ease score for Italian, Formula di Flesch-Vacca) and model accuracy revealed a statistically significant, albeit small, inverse correlation. Logistic regression showed lower readability scores (more complex text) were associated with slightly lower odds of correct answers (standard: OR ≈ 0.997 per point increase, p < 0.001; reasoning: OR ≈ 0.999 per point increase, p < 0.001).

While statistically significant, the small efect size suggests text readability is a minor factor compared to subject knowledge, reasoning complexity, or inherent model capabilities in determining MedBench-IT performance.

6. Discussion

The evaluation results on MedBench-IT provide several key insights into current LLM capabilities for Italian medical entrance examinations.

The benchmark successfully diferentiates performance across models, with top-tier proprietary models (DeepSeek Reasoner, o1-preview, Claude 3.5 Sonnet, GPT-4o) substantially outperforming most open-source alternatives. However, promising mid-sized open-source models (Qwen 2.5 14B, Phi-4) and Italian fine-tunes show competitive results suitable for resource-constrained environments.

Subject-specific analysis reveals Logic and Mathematics as major bottlenecks across all models, suggesting abstract and multi-step reasoning remains challenging compared to knowledge retrieval tasks in Biology or Chemistry. This aligns with observations from other challenging benchmarks.

The reproducibility analysis shows non-negligible variability (11% response diference, 6% correctness inconsistency for GPT-4o), particularly in Logic and Mathematics, cautioning against over-interpreting small performance diferences on single runs with non-deterministic sampling.

Interestingly, explicit reasoning prompts showed nuanced impact unlike other benchmarks where Chainof-Thought is essential. Top models often performed slightly worse with reasoning prompts, suggesting they progress in sub-30B parameter models suitable for pracemploy eficient internal pathways for these question tical deployment. Logic and Mathematics consistently types. Smaller models showed slight benefits, and shorter emerged as the most challenging subjects, indicating reasoning correlated with correctness, indicating poten- complex reasoning remains dificult, while knowledgetial verbosity when uncertain. intensive subjects like Biology and Chemistry showed

The low correlation with text readability confirms that higher performance. domain knowledge and reasoning, rather than linguistic Our robustness analyses confirmed ordering bias recomplexity, drive dificulty in MedBench-IT. sistance and good overall reproducibility, though signif

Overall, MedBench-IT provides a valuable, challeng- icant variability in Logic and Mathematics emphasizes ing testbed for the Italian NLP community, highlighting caution when interpreting complex reasoning results. Excurrent strengths and weaknesses while supporting eval- plicit reasoning prompts showed nuanced impact—often uation of practical, deployable models for Italian educa- providing little gain or slight decreases for top modtional applications. els—suggesting MedBench-IT tests applied knowledge and implicit reasoning pathways efectively.

MedBench-IT provides a valuable standardized tool 7. Limitations for the Italian NLP community to measure progress, diagnose weaknesses, and evaluate models for Italian EdTech applications.

Future work includes expanding the question set to more advanced medical examinations, conducting deeper qualitative error analysis, and exploring evaluation formats beyond multiple-choice. Furthermore, the complete leaderboard will be hosted and continuously updated on a website maintained by OnePix Academy, allowing for the submission and evaluation of new models.

While MedBench-IT provides a valuable contribution, several limitations should be acknowledged.

To begin with, the benchmark relies exclusively on a multiple-choice question (MCQ) format, which may not fully capture the depth of understanding compared to open-ended questions. Furthermore, no few-shot evaluation was conducted. This is an interesting extension, particularly for the reasoning approach, where providing complete CoT traces can improve model performance, especially for smaller models. The dataset, while expertcurated, covers preparatory materials and may not fully Acknowledgments represent the complexity of advanced medical training. It also does not include context documents, limiting its use This research was financed by OnePix Academy as part for evaluating Retrieval-Augmented Generation (RAG) of their efort in Italian EdTech research. We thank Ediarchitectures, which can significantly improve perfor- zioni Simone for providing the dataset used in this benchmance. mark as part of a commercial partnership with OnePix

The potential for data contamination in the pre- Academy. training corpora of the evaluated LLMs cannot be entirely ruled out, even if unlikely given our data source.

Our robustness analyses were conducted on a limited Data Availability and Leaderboard subset of models, so findings may not generalize. Finally, MedBench-IT is text-only and does not evaluate multimodal reasoning (e.g., interpreting diagrams).

Due to the proprietary nature of the source material from Edizioni Simone, the question dataset itself cannot be publicly redistributed. Researchers interested in replicating the benchmark or accessing the data for research 8. Conclusion and Future Work purposes should contact the corresponding author to inquire about potential data sharing agreements faciliIn this paper, we introduced MedBench-IT, the first large- tated through the commercial partnership. As previously scale benchmark focused on evaluating LLMs on Italian mentioned, the complete leaderboard results, including medical university entrance examination questions. By performance metrics for all evaluated models (including curating 17,410 expert-written questions from a leading those not detailed in the main paper tables/figures) and publisher, Edizioni Simone, MedBench-IT provides a chal- potentially future model submissions, will be made availlenging and contextually relevant testbed spanning six able and maintained on a dedicated website hosted by key subjects pertinent to Italian medical admissions. OnePix Academy. Interested parties can contact the au

Our evaluation reveals a clear performance hierarchy. thors or OnePix Academy for information on submitting Top proprietary models (DeepSeek Reasoner, o1-preview) new models for evaluation on MedBench-IT. achieve near-90% accuracy, while leading open-source models like Phi-4 and Qwen 2.5 14B exceed 70%. Italian ifne-tunes perform competitively at 60%, demonstrating

A.2. Logic Example

5. [Opzione 5] (Risposta corretta: [Index for ’Almeno un bambino non ama il gelato’ or similar]) Question (English Translation): Which of the following is the negation of the statement "All children love ice cream"? Possible answers: 1. [Option 1] 2. [Option 2] 3. [Option 3] 4. [Option 4] 5. [Option 5] (Correct Answer: [Index for ’At least one child does not love ice cream’ or similar])

Domanda: Se e solo se Giulia a luglio non va

in vacanza in montagna, va poi in vacanza al mare ad agosto. Giulia è andata sulle Dolomiti a luglio, dunque non andrà ad agosto al mare. Quale delle seguenti afermazioni segue la stessa struttura logica del suddetto ragionamento? Possibili risposte: 1. Carolina, se acquista molte borse, spende molti soldi. Carolina ha acquistato molte borse, dunque ha speso molti soldi 2. Clotilde non va in motorino la sera tardi, se piove. Stasera non ha piovuto, dunque è andata in motorino 3. Elisa mangia le fragole a cena se e solo se a pranzo non mangia albicocche. Ha già mangiato albicocche a pranzo, dunque a cena non mangia le fragole 4. Solo se Clara studia molto, supera gli esami. Clara ha superato gli esami, dunque ha studiato molto 5. Se Riccardo non gioca a calcio, non è in forma per giocare a tennis. Riccardo non gioca a tennis, dunque non ha giocato a calcio (Risposta corretta: 3) Question (English Translation): If and only if Giulia does not go on holiday to the mountains in July, she then goes on holiday to the sea in August. Giulia went to the Dolomites in July, therefore she will not go to the sea in August. Which of the following statements follows the same logical structure as the reasoning above? Possible answers: 1. Carolina, if she buys many bags, spends a lot of money. Carolina bought many bags, therefore she spent a lot of money 2. Clotilde does not ride her scooter late at night if it rains. Tonight it did not rain, therefore she went on her scooter 3. Elisa eats strawberries for dinner if and only if she does not eat apricots for lunch. She already ate apricots for lunch, therefore she does not eat strawberries for dinner 4. Only if Clara studies hard, does she pass the exams. Clara passed the exams, therefore she studied hard 5. If Riccardo does not play football, he is not ift to play tennis. Riccardo does not play tennis, therefore he did not play football (Correct Answer: 3)

A.3. Physics Example Domanda: In quale sistema una tonnellata è

un multiplo? Possibili risposte: 1. Nel sistema delle dozzine 2. Nel sistema binario 3. Nel sistema esadecimale 4. Nel sistema decimale 5. Nessuna delle altre (Risposta corretta: 4) Question (English Translation): In which system is a ton (tonne) a multiple? Possible answers: 1. In the duodecimal system (base 12) 2. In the binary system 3. In the hexadecimal system 4. In the decimal system 5. None of the others (Correct Answer: 4)

A.4. Chemistry Example Domanda: A quante moli corrispondono 5

mL (d=1,8 g· cm− 3) di un composto avente una massa molare di 450 g· mol− 1? Possibili risposte: 1. [Option 1 - e.g., 0.01 mol] 2. [Option 2 - e.g., 0.02 mol] 3. [Option 3 - e.g., 0.04 mol] 4. [Option 4 - e.g., 0.1 mol] 5. [Option 5 - e.g., 0.2 mol] (Risposta corretta: [Index for 0.02 mol]) Question (English Translation): How many moles correspond to 5 mL (d=1.8 g· cm− 3) of a compound having a molar mass of 450 g· mol− 1? Possible answers: 1. [Option 1] 2. [Option 2] 3. [Option 3] 4. [Option 4] 5. [Option 5] (Correct Answer: [Index for 0.02 mol])

B. Per-Subject Model Performance

Declaration on Generative AI During the preparation of this work, the author(s) used Gemini (Google) in order to: Paraphrase and reword and Improve writing style. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

Proceedings , Venice, Italy, 2023 , pp. 113 - 119 . URL: ing Capability in LLMs via Reinforcement Learning,

https://aclanthology.org/ 2023 .clicit- 1 .15/. 2025 . URL: http://arxiv.org/abs/2501.12948. [17]

Altuna ,

Karunakaran ,

Lavelli ,

Magnini , doi:10.48550/arXiv.2501.12948,

Speranza , R. Zanoli, CLinKaRT at EVALITA arXiv: 2501 .12948 [cs].

2023: Overview of the task on linking a lab re- [27]

Groeneveld , et al., OLMo: Accelerating the sci-

Basile ,

Bosco ,

Dell'Orletta ,

Lai , M. San- V. Srikumar (Eds.), Proceedings of the 62nd An-

ings of the 8th Evaluation Campaign of Natural Linguistics (Volume 1: Long Papers) , Association

ian (EVALITA 2023 ), Accademia University Press, 2024 , pp. 15789 - 15809 . URL: https://aclanthology.

Parma , Italy, 2023 , pp. 483 - 492 . URL: https:// org/2024. acl-long . 841 /. doi: 10 .18653/v1/ 2024 .

ceur-ws.org/ Vol- 3473 /paper43.pdf. acl-long . 841 . [18]

Puccetti ,

Cassese ,

Esuli , The Invalsi Bench- [28]

Üstün , et al., Aya Model: An Instruc-

cal understanding of Large Language Models in guage Model , 2024 . URL: http://arxiv.org/abs/

Italian , in: O. Rambow , L. Wanner , M. Apidi- 2402 .07827. doi: 10 .48550/arXiv.2402.07827,

anaki , H. Al-Khalifa , B. D. Eugenio , S. Schockaert arXiv: 2402 .07827 [cs].

(Eds.), Proceedings of the 31st International Confer- [29]

Orlando ,

Moroni , P.-L. Huguet Cabot , S. Co-

Computational

Linguistics , Abu Dhabi, UAE , 2025 , igli, Minerva LLMs: The first family of large

pp. 6782 - 6797 . URL: https://aclanthology.org/ 2025 . language models trained from scratch on Italian

coling-main. 453 /. data, in: F. Dell'Orletta , A.

Lenci , S. Montemagni, [19] T. Z.

Zhao , E.

Wallace , S.

Feng , D.

Klein , S.

Singh , R.

Sprugnoli (Eds.), Proceedings of the 10th Italian

formance of Language Models , 2021 . URL: http: it 2024 ), CEUR Workshop Proceedings, Pisa, Italy,

//arxiv.org/abs/2102.09690. doi: 10 .48550/arXiv. 2024 , pp. 707 - 719 . URL: https://aclanthology.org/

2102.09690, arXiv: 2102 .09690 [cs]. 2024 .clicit- 1 .77/. [20]

Wei , et al., Chain- of-Thought Prompting [30] W.

Kwon , Z.

Li , S.

Zhuang , Y.

Sheng , L.

Zheng , C. H.

2023. URL: http://arxiv.org/abs/2201.11903. ory management for large language model serving

doi:10 .48550/arXiv.2201.11903, with

pagedattention

, in: Proceedings of the ACM

arXiv:2201 .11903 [cs]. SIGOPS 29th Symposium on Operating Systems [21]

Mirzadeh ,

Alizadeh ,

Shahrokhi ,

Tuzel , Principles, 2023 .

ing in large language models , 2024 . URL: https: A. Additional Question Examples

//arxiv.org/abs/2410.05229. arXiv: 2410 . 05229 . [22]

Yang , et al., Qwen2 . 5 technical report, This appendix provides additional examples of questions

2025. URL: https://arxiv.org/abs/2412.15115. from the MedBench-IT dataset, illustrating diferent sub-

arXiv:2412 .15115. jects. The example for Biology is included in the main [23]

Team , Gemma: Open Models Based on Gemini text (Subsection ??). Each question below is presented in

Research and Technology, 2024 . URL: http://arxiv. Italian, followed by its English translation and the correct

org/abs/2403 .08295. doi: 10 .48550/arXiv.2403. answer index.

08295, arXiv: 2403 .08295 [cs]. [24]

Grattafiori , et al., The Llama 3 Herd of Models,

2024. URL: http://arxiv.org/abs/2407.21783. A.1. General Culture Example

doi:10 .48550/arXiv.2407.21783, Domanda

Quale delle seguenti è la

arXiv:2407 .21783 [cs]. negazione dell'enunciato "Tutti i bambini [25]

Abdin , et al., Phi-3 Technical Report: A amano il gelato”?

Your

Phone , 2024 . URL: http://arxiv.org/abs/ 1. [Opzione 1]

2404.14219. doi: 10 .48550/arXiv.2404. 14219 , 2 . [Opzione 2]

arXiv:2404.14219 [cs]. 3. [Opzione 3 ] [26] DeepSeek-AI , DeepSeek-R1: Incentivizing Reason- 4. [Opzione 4] Domanda: Dati tre segmenti AA', BB' e CC' tali che: AA' = 2 cm, BB' = 1,5 * AA', CC' = 2,0 * BB' . Quale triangolo è possibile costruire con questi lati? Possibili risposte: 1. Non è possibile costruire nessun triangolo 2. Un triangolo rettangolo 3. Un triangolo ottusangolo 4. Un triangolo scaleno 5. Un triangolo acutangolo (Risposta corretta: 1) Question (English Translation): Given three segments AA', BB', and CC' such that: AA' = 2 cm, BB' = 1.5 * AA', CC' = 2.0 * BB' . Which triangle is possible to construct with these sides? Possible answers: 1. It is not possible to construct any triangle 2. A right-angled triangle 3. An obtuse-angled triangle 4. A scalene triangle 5. An acute-angled triangle (Correct Answer: 1)