<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ruggero Marino Lazzaroni</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Angioi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelangelo Puliga</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Sanna</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Marras</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>University of Graz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>OnePix Academy</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three dificulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-eficient open-source alternatives (&lt;30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, ofering insights into current capabilities and standardized evaluation methodology for this critical domain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLM Evaluation</kwd>
        <kwd>Benchmark</kwd>
        <kwd>Italian NLP</kwd>
        <kwd>Medical Education</kwd>
        <kwd>Question Answering</kwd>
        <kwd>Educational Technology</kwd>
        <kwd>CLiC-it</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>versity entrance examination questions. Sourced from
Edizioni Simone, a leading Italian publisher of
preparaLarge Language Models (LLMs) have demonstrated re- tory materials, MedBench-IT comprises 17,410
expertmarkable capabilities across diverse tasks [1], transform- written, multiple-choice questions. These questions span
ing artificial intelligence applications. Their potential in six core subjects (Biology, Chemistry, Logic, General
Culspecialized domains, particularly education [2, 3], ofers ture, Mathematics, and Physics) and are categorized into
promise for personalized learning, assessment, and high- three distinct dificulty levels, mirroring the structure of
stakes examination support. As LLMs advance, rigorous the actual Italian medical admissions tests. Our
evaluaand contextually relevant evaluation methodologies be- tion encompasses a diverse range of models, including
come essential. leading proprietary LLMs (e.g., GPT-4o, Claude series)</p>
      <p>However, a significant portion of existing LLM bench- and resource-eficient open-source alternatives (&lt;30B
pamarks are predominantly English-centric [4, 5, 6], and re- rameters), with a particular focus on models practical for
sources for non-English languages, especially in specific, deployment in various Italian organizational contexts.
demanding domains, remain comparatively scarce. This Our evaluation methodology begins with standard
acgap is particularly evident for the Italian language, where curacy assessments and is then augmented with several
the lack of specialized benchmarks [7, 8, 9] can hinder in-depth analyses designed to probe model robustness
the objective assessment of LLM performance, limit the and behavior. These include rigorous tests for
reprodevelopment of tailored educational technologies, and ducibility (examining response consistency across
idennecessitate reliance on translated materials which may tical runs), ordering bias (assessing sensitivity to the
be imperfect or fail to capture local educational nuances. permutation of answer choices), and the impact of
ex</p>
      <p>In this paper, we introduce MedBench-IT, a novel plicit reasoning prompts on model performance. We
and comprehensive benchmark specifically designed to also investigate the relationship between question text
evaluate the performance of LLMs on Italian medical uni- readability and model accuracy, providing further
dimensions for understanding model capabilities.</p>
      <p>Our primary contributions include:</p>
      <p>• The creation and presentation of MedBench-IT,
the first large-scale benchmark specifically for
Italian medical entrance exam questions, curated
from expert-validated sources, meant to be a valu- like BioASQ [14] have emerged to evaluate LLMs on
able resource for the fostering of LLMs for the medical knowledge, question answering, and
reasonItalian language, particularly within its educa- ing. These resources are crucial for advancing AI in
tional technology sector. healthcare. However, they primarily focus on
English• An extensive empirical evaluation of a diverse language materials and examination styles. For
examset of state-of-the-art and practically deployable ple, MedQA [11] is based on USMLE-style questions,
LLMs on MedBench-IT. which assess medicine-specific knowledge for medical
• In-depth analyses of model consistency (repro- licensing purposes, whereas the Italian medical entrance
ducibility), robustness (ordering bias), and the exam covers a broader range of topics to evaluate
candiferential impact of direct versus reasoning- didates’ suitability for medical school admission.
Apeliciting prompting strategies. plying these benchmarks directly to the Italian medical
• Actionable insights into factors such as subject context presents challenges related to translation fidelity,
matter, question dificulty, and text readability diferences in curriculum emphasis, and distinct
examinathat influence LLM performance within this spe- tion formats, underscoring the need for native-language,
cific Italian educational context. context-specific benchmarks.</p>
      <sec id="sec-1-1">
        <title>2.2. Medical Domain LLM Benchmarks</title>
        <p>In the medical domain, benchmarks such as MedQA
[11], PubMedQA [12], MedExQA [13], and challenges</p>
        <p>The remainder of this paper is structured as follows:
Section 2 discusses related work in LLM evaluation and
Italian NLP resources. Section 3 details the construction
and characteristics of the MedBench-IT dataset. Section 4
outlines our experimental setup, including the core
evaluation and subsequent analytical tests. Section 5 presents
and analyzes the results from these evaluations. Section 6
discusses the broader implications of our findings.
Section 7 acknowledges the limitations of our study, and
Section 8 concludes the paper with directions for future
work.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The evaluation of Large Language Models (LLMs) is a
rapidly evolving field, with numerous benchmarks
developed to assess their capabilities across various
dimensions.</p>
      <sec id="sec-2-1">
        <title>2.1. General LLM Evaluation Benchmarks</title>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. LLM Evaluation and Resources in</title>
      </sec>
      <sec id="sec-2-3">
        <title>Italian</title>
        <p>The Italian NLP community has developed evaluation
frameworks like the CALAMITA challenge [9], which
includes the Mult-IT dataset [15] with questions from
Italian university entrance and public sector exams. Medical
domain eforts include work on specialty tests [ 16] and
shared tasks like CLinkaRT at EVALITA 2023, focused
on the clinical domain [17]. MedBench-IT distinguishes
itself through its specific medical entrance exam focus, a
larger specialized corpus (17,410 medical questions),
detailed subject/dificulty breakdowns, and comprehensive
robustness analyses.</p>
        <p>Other evaluation suites for Italian, such as ItaEval
[7] and ITA-Bench [8], aim to provide broader
assessments of LLM capabilities, often by translating existing
English benchmarks or adapting various Italian datasets.</p>
        <p>In the educational context, benchmarks derived from
INVALSI tests (standardized national assessments) like
those discussed by Puccetti et al. [18] can assess linguistic
and mathematical understanding. Unlike these
generalpurpose benchmarks, MedBench-IT focuses specifically
on medical entrance exams using native Italian content.</p>
        <p>
          Prominent benchmarks such as GLUE (General Language
Understanding Evaluation) [
          <xref ref-type="bibr" rid="ref21 ref31">10</xref>
          ], SuperGLUE [5], and
MMLU (Massive Multitask Language Understanding) [6]
have been instrumental in tracking the progress of LLMs
on general language understanding and multi-task rea- 2.4. Studies on LLM Robustness and
soning. More recently, benchmarks like MMLU-Pro [4] Reasoning
have sought to address saturation issues and increase the
challenge level of existing evaluations by incorporating
more reasoning-focused questions and more distractor
options. While foundational, these benchmarks are
predominantly designed for and evaluated in English,
limiting their direct applicability to other linguistic contexts
without adaptation.
        </p>
        <p>Beyond accuracy, LLM robustness and reasoning
capabilities are critical areas of investigation. Prior research
has highlighted LLM sensitivity to prompt variations
[19], ordering biases in multiple-choice questions [4],
and reproducibility challenges. Chain-of-Thought (CoT)
prompting [20] eficacy varies by model and task
complexity. Recent work [21] revealed significant limitations
in LLM mathematical reasoning, showing apparent
proifciency may depend more on pattern recognition than
genuine understanding. Our work incorporates these
considerations by establishing baseline performance and
conducting specific experiments to assess reproducibility,
ordering bias, and reasoning-eliciting prompt impact on
MedBench-IT.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The MedBench-IT Benchmark</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset Construction</title>
        <p>MedBench-IT comprises multiple-choice questions
provided by Edizioni Simone, a leading Italian publisher of
medical entrance exam preparatory materials. Questions
are expert-authored to accurately reflect oficial Italian
medical admission exam style, content, and dificulty.</p>
        <p>From an initial corpus of 43,525 questions, we applied
ifltering steps: (1) removed image-reliant questions for
text-based LLM compatibility; (2) excluded English
subject questions; (3) stripped XML/HTML markup; (4)
standardized format to question stem, five answer options,
and single correct answer. After preprocessing, we
selected a stratified sample of 17,410 questions maintaining
original subject and dificulty proportions, inspired by
MMLU’s comparable size for balanced coverage and
evaluation manageability.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset Characteristics and</title>
      </sec>
      <sec id="sec-3-3">
        <title>Prompting</title>
        <p>The final dataset contains 17,410 questions with metadata
indicating subject and dificulty level. Table 1 shows
Biology (28.1%) and Chemistry (22.9%) as largest portions,
followed by Logic (17.3%), General Culture (13.2%),
Mathematics (9.6%), and Physics (8.9%). Table 2 shows Level
1/Base (46.1%), Level 2/Intermediate (41.1%), and Level
3/Advanced (12.8%) distributions.</p>
        <p>An example Biology question:</p>
        <sec id="sec-3-3-1">
          <title>Domanda: La plasmolisi:</title>
          <p>Possibili risposte:
1. Avviene nelle cellule animali
2. E’ lo scollamento della membrana
plasmatica dalla parete nelle cellule vegetali
3. E’ causata da un eccessivo turgore della
cellula
4. Avviene in ambiente ipotonico
5. E’ la rottura della membrana cellulare nei
globuli rossi
(Risposta corretta: 2)
Question (English Translation):
Plasmolysis:
Possible answers:
1. Occurs in animal cells
2. Is the detachment of the plasma membrane
from the wall in plant cells
3. Is caused by excessive turgor of the cell
4. Occurs in a hypotonic environment
5. Is the rupture of the cell membrane in red
blood cells
(Correct Answer: 2)</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.3. Prompting Strategies</title>
        <p>To evaluate LLM performance on MedBench-IT, we
employed two distinct zero-shot prompting strategies:
1. Standard Prompt (Direct Answering): This
prompt presents the question and answer choices
directly, asking the model to select the number
corresponding to the correct answer. The format,
presented to the models in Italian, is as follows
(see Listing 1).</p>
        <p>Listing 1: Standard Prompt Format used in MedBench-IT.
Listing 2: Reasoning-Eliciting Prompt Format used in</p>
        <p>MedBench-IT.
The models were instructed to output only the reasoning
(if prompted) and the final answer number in the specified
format. For experiments utilizing the reasoning prompt,
the reasoning text was used for qualitative analysis, while
only the numerical answer was used for accuracy scoring.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>The open-source models evaluated represent the latest
iterations of various families and sizes at time of
experimentation, including several fine-tuned for Italian:
• Qwen 2.5 series [22]: Including instruct versions
from 0.5B to 14B parameters (e.g., Qwen 2.5 7B
Instruct).
• Gemma 2 series [23]: Including instruct-tuned
versions (Gemma 2 2B IT, Gemma 2 9B IT) and
community fine-tunes focused on Italian.
• Llama 3 series and fine-tunes [ 24]: Models such
as Llama 3.1 8B Instruct and various Italian
finetunes contributed by the community.
• Phi series [25]: Including Phi-4.
• DeepSeek series [26]: Including models accessed
via API: DeepSeek Chat (equivalent to
DeepseekV3), DeepSeek Reasoner (equivalent to
DeepseekR1), and locally deployed distilled models (e.g.,
DeepSeek R1 Distill Qwen 7B).
• OLMo 2 series [27]: OLMo 2 7B Instruct and</p>
      <p>OLMo 2 13B Instruct.
• Other notable models: Including Aya Expanse
8B [28], and models from the Minerva family by
SapienzaNLP [29].</p>
      <sec id="sec-4-1">
        <title>4.1. Models Evaluated</title>
        <p>All open-source models were run locally using
stanThis section outlines the methodology employed for eval- dard libraries such as the vLLM framework [30]. For
uating various Large Language Models (LLMs) on the proprietary models, oficial APIs were used during the
MedBench-IT benchmark. We detail the models selected experimentation period (between December 2024 and
for evaluation, the primary metrics used, and the specific January 2025). Unless specified otherwise (e.g., for
reproprotocols for our specialized analyses. ducibility tests), a sampling temperature of 0 was used
for all models to promote deterministic outputs for the
main evaluation runs.</p>
        <p>A diverse range of LLMs was selected for evaluation on
MedBench-IT, encompassing leading proprietary models
and prominent open-source alternatives. The selection
aimed to provide a comprehensive overview of current
model capabilities, including models with specific focus
on Italian language tasks and those chosen for practical</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation Metrics</title>
        <p>The primary metric used for evaluating model
performance on MedBench-IT is accuracy. Accuracy is
calculated as the percentage of questions for which the model
provided the correct answer out of the total number of
questions evaluated:
Accuracy =</p>
        <p>Number of Correct Answers</p>
        <p>Total Number of Questions × 100% (1)
Accuracy was computed overall, as well as broken down
by:
• Subject area (Biology, Chemistry, Logic, etc.).
• Dificulty level (Level 1, Level 2, Level 3).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Specialized Analyses Setup</title>
        <p>In addition to standard accuracy evaluation, we
conducted several specialized analyses to assess model
robustness and behavior:</p>
        <sec id="sec-4-3-1">
          <title>2https://huggingface.co/anakin87/gemma-2-9b-neogenesis-ita</title>
          <p>3https://huggingface.co/mii-llm/maestrale-chat-v0-4-beta</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>1. Reproducibility Test: To assess response con</title>
          <p>sistency, we evaluated GPT-4o twice on the entire 14B 76.8 67.9
MedBench-IT dataset using identical parameters 14B 72.6 76.9
(standard prompt, temperature 1). We compared 97BB 6612..71 6697..42
question-by-question responses, calculating per- 7B 61.1 67.6
centages of identical answers and consistent cor- 8B 50.3 57.4
rectness across runs (Section 5.2). 7B 50.8 53.0
2. Ordering Bias Test: To investigate whether an- 8B 46.7 0.1
swer option order influences predictions, we eval- 02.5BB 4213..12 3149..32
uated selected models (GPT-4o and Claude 3.5
Haiku) on both the original dataset and a version a Par. = Parameters (B = billion, – = proprietary)
with shufled answer options, comparing
accuracy scores to identify performance deviations
attributable to ordering (Section 5.3). Top proprietary models and large open-source models
3. Reasoning Impact Test: All models were eval- like DeepSeek Reasoner and o1-preview achieve accuracy
uated using both standard direct-answering and around or above 90%, followed by Claude 3.5 Sonnet and
reasoning-eliciting prompts. Accuracy scores and GPT-4/4o series in the mid-to-high 80s. Open-source
reasoning text length were analyzed for correla- models demonstrate strong capabilities, with Phi-4 and
tions with answer correctness (Section 5.4). Qwen 2.5 14B Instruct achieving 70%+ accuracy. Models
4. Readability Analysis: We calculated Flesch like Gemma 2 9B Instruct, Lexora Medium 7B, and Italian
Reading Ease scores (Formula di Flesch-Vacca) for adaptations of Gemma 2 9B (e.g.,
‘anakin87/gemma-2each question using the ‘textstat‘ library1. Logis- 9b-neogenesis-ita‘2) perform respectably around 60-62%.
tic regression analysis determined whether read- Smaller models like Llama 3.1 8B Instruct and the Italian
ability correlates with model performance under Maestrale family3 (based on Mistral 7B) score around 50%,
both prompt conditions (Section 5.5). while many other open-source models, including several
Italian fine-tunes of Llama 3 8B, fall into the 30-50% range.</p>
          <p>This ranking shows rapid progress in open-source models
5. Results and Analysis while still showing a performance delta compared to the
best proprietary systems.</p>
          <p>This section presents the evaluation results of selected Subject analysis reveals consistent dificulty patterns
LLMs on MedBench-IT using standard zero-shot prompts, (full per-subject results in Appendix B, Table 4). Logic and
followed by specialized analyses.
1https://pypi.org/project/textstat/
671B
–
–
671B
–
–
–
–
–
Mathematics consistently emerge as most challenging for 4o’s accuracy dropped slightly from 83.9% to 83.5% (-0.4%).
nearly all models. Top models often score 15-25 percent- Claude 3.5 Haiku decreased from 80.4% to 79.5% (-0.9%)
age points lower in Logic compared to Biology or Chem- (Figure 2).
istry (e.g., GPT-4o: 92.4% in Biology vs 64.9% in Logic).</p>
          <p>This suggests abstract reasoning and multi-step
problemsolving remain significant hurdles. Conversely, Biology,
Chemistry, and General Culture show higher accuracy,
likely reflecting strong factual knowledge capabilities.</p>
          <p>Physics performance is typically intermediate.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>5.2. Reproducibility Insights</title>
        <p>The reproducibility test on GPT-4o yielded 88.86%
response consistency across two identical runs on 17,410
questions, indicating 11.14% diferent answer choices
despite identical inputs.</p>
        <p>Consistency varied notably across subjects (Figure 1).</p>
        <p>Higher consistency was observed in knowledge-based
subjects like Biology (96.8%) and General Culture (93.0%), Figure 2: Performance comparison for GPT-4o and Claude
while lower consistency was found in subjects requir- 3.5 Haiku on Standard vs. Shufled MedBench-IT benchmark.
ing complex reasoning: Mathematics (79.8%) and Logic
(73.6%). Physics (89.9%) and Chemistry (91.7%) showed McNemar’s test revealed mixed results: GPT-4o
intermediate consistency. Across dificulty levels, con- showed no statistically significant ordering bias (p &gt;
sistency remained stable (Level 1: 89.8%, Level 2: 88.1%, 0.05), while Claude 3.5 Haiku exhibited significant
poLevel 3: 88.0%). sitional sensitivity (p &lt; 0.001). These results
demon</p>
        <p>Regarding correctness, 80.6% of responses were cor- strate MedBench-IT’s ability to detect ordering bias when
rect in both runs, 13.2% were incorrect in both runs, and present, revealing model-specific robustness diferences.
6.2% showed inconsistent correctness between runs.
McNemar’s test confirmed diferences were not statistically
significant (p &gt; 0.05), indicating normal stochastic varia- 5.4. Impact of Reasoning Prompts
tion rather than systematic instability.</p>
      </sec>
      <sec id="sec-4-5">
        <title>5.3. Ordering Bias</title>
        <p>The ordering bias test, shufling answer choices for
GPT4o and Claude 3.5 Haiku, showed minimal impact.
GPTComparing standard direct-answering versus
reasoningeliciting prompts revealed nuanced results (Figure 3).
Unlike benchmarks where Chain-of-Thought significantly
boosts performance [20, 4], many top-performing
models on MedBench-IT showed no substantial gains, with
some exhibiting slightly lower accuracy. Models like
DeepSeek Reasoner, o1-preview, and GPT-4o performed
slightly worse with reasoning prompts. Some mid-range
or smaller models, such as Llama 3.1 8B Instruct, showed
slight increases.</p>
        <p>This suggests capable models eficiently arrive at
answers without requiring explicit, complex reasoning
chains. The forced reasoning step might introduce
unnecessary processing for some architectures. Analysis
showed models tend to produce shorter explanations
when correct compared to incorrect answers, indicating
more concise justifications for correct answers derived
directly.</p>
      </sec>
      <sec id="sec-4-6">
        <title>5.5. Readability Correlation</title>
        <p>Analysis investigating the relationship between question
text readability (Flesch Reading Ease score for Italian,
Formula di Flesch-Vacca) and model accuracy revealed
a statistically significant, albeit small, inverse
correlation. Logistic regression showed lower readability scores
(more complex text) were associated with slightly lower
odds of correct answers (standard: OR ≈ 0.997 per point
increase, p &lt; 0.001; reasoning: OR ≈ 0.999 per point
increase, p &lt; 0.001).</p>
        <p>While statistically significant, the small efect size
suggests text readability is a minor factor compared to
subject knowledge, reasoning complexity, or inherent model
capabilities in determining MedBench-IT performance.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Discussion</title>
      <p>The evaluation results on MedBench-IT provide several
key insights into current LLM capabilities for Italian
medical entrance examinations.</p>
      <p>The benchmark successfully diferentiates
performance across models, with top-tier proprietary
models (DeepSeek Reasoner, o1-preview, Claude 3.5 Sonnet,
GPT-4o) substantially outperforming most open-source
alternatives. However, promising mid-sized open-source
models (Qwen 2.5 14B, Phi-4) and Italian fine-tunes show
competitive results suitable for resource-constrained
environments.</p>
      <p>Subject-specific analysis reveals Logic and
Mathematics as major bottlenecks across all models, suggesting
abstract and multi-step reasoning remains challenging
compared to knowledge retrieval tasks in Biology or
Chemistry. This aligns with observations from other
challenging benchmarks.</p>
      <p>The reproducibility analysis shows non-negligible
variability (11% response diference, 6% correctness
inconsistency for GPT-4o), particularly in Logic and Mathematics,
cautioning against over-interpreting small performance
diferences on single runs with non-deterministic
sampling.</p>
      <p>Interestingly, explicit reasoning prompts showed
nuanced impact unlike other benchmarks where
Chainof-Thought is essential. Top models often performed
slightly worse with reasoning prompts, suggesting they progress in sub-30B parameter models suitable for
pracemploy eficient internal pathways for these question tical deployment. Logic and Mathematics consistently
types. Smaller models showed slight benefits, and shorter emerged as the most challenging subjects, indicating
reasoning correlated with correctness, indicating poten- complex reasoning remains dificult, while
knowledgetial verbosity when uncertain. intensive subjects like Biology and Chemistry showed</p>
      <p>The low correlation with text readability confirms that higher performance.
domain knowledge and reasoning, rather than linguistic Our robustness analyses confirmed ordering bias
recomplexity, drive dificulty in MedBench-IT. sistance and good overall reproducibility, though
signif</p>
      <p>Overall, MedBench-IT provides a valuable, challeng- icant variability in Logic and Mathematics emphasizes
ing testbed for the Italian NLP community, highlighting caution when interpreting complex reasoning results.
Excurrent strengths and weaknesses while supporting eval- plicit reasoning prompts showed nuanced impact—often
uation of practical, deployable models for Italian educa- providing little gain or slight decreases for top
modtional applications. els—suggesting MedBench-IT tests applied knowledge
and implicit reasoning pathways efectively.</p>
      <p>MedBench-IT provides a valuable standardized tool
7. Limitations for the Italian NLP community to measure progress,
diagnose weaknesses, and evaluate models for Italian EdTech
applications.</p>
      <p>Future work includes expanding the question set to
more advanced medical examinations, conducting deeper
qualitative error analysis, and exploring evaluation
formats beyond multiple-choice. Furthermore, the complete
leaderboard will be hosted and continuously updated on
a website maintained by OnePix Academy, allowing for
the submission and evaluation of new models.</p>
      <p>While MedBench-IT provides a valuable contribution,
several limitations should be acknowledged.</p>
      <p>To begin with, the benchmark relies exclusively on a
multiple-choice question (MCQ) format, which may not
fully capture the depth of understanding compared to
open-ended questions. Furthermore, no few-shot
evaluation was conducted. This is an interesting extension,
particularly for the reasoning approach, where providing
complete CoT traces can improve model performance,
especially for smaller models. The dataset, while
expertcurated, covers preparatory materials and may not fully Acknowledgments
represent the complexity of advanced medical training. It
also does not include context documents, limiting its use This research was financed by OnePix Academy as part
for evaluating Retrieval-Augmented Generation (RAG) of their efort in Italian EdTech research. We thank
Ediarchitectures, which can significantly improve perfor- zioni Simone for providing the dataset used in this
benchmance. mark as part of a commercial partnership with OnePix</p>
      <p>The potential for data contamination in the pre- Academy.
training corpora of the evaluated LLMs cannot be
entirely ruled out, even if unlikely given our data source.</p>
      <p>Our robustness analyses were conducted on a limited Data Availability and Leaderboard
subset of models, so findings may not generalize. Finally,
MedBench-IT is text-only and does not evaluate
multimodal reasoning (e.g., interpreting diagrams).</p>
      <p>Due to the proprietary nature of the source material from
Edizioni Simone, the question dataset itself cannot be
publicly redistributed. Researchers interested in
replicating the benchmark or accessing the data for research
8. Conclusion and Future Work purposes should contact the corresponding author to
inquire about potential data sharing agreements
faciliIn this paper, we introduced MedBench-IT, the first large- tated through the commercial partnership. As previously
scale benchmark focused on evaluating LLMs on Italian mentioned, the complete leaderboard results, including
medical university entrance examination questions. By performance metrics for all evaluated models (including
curating 17,410 expert-written questions from a leading those not detailed in the main paper tables/figures) and
publisher, Edizioni Simone, MedBench-IT provides a chal- potentially future model submissions, will be made
availlenging and contextually relevant testbed spanning six able and maintained on a dedicated website hosted by
key subjects pertinent to Italian medical admissions. OnePix Academy. Interested parties can contact the
au</p>
      <p>Our evaluation reveals a clear performance hierarchy. thors or OnePix Academy for information on submitting
Top proprietary models (DeepSeek Reasoner, o1-preview) new models for evaluation on MedBench-IT.
achieve near-90% accuracy, while leading open-source
models like Phi-4 and Qwen 2.5 14B exceed 70%. Italian
ifne-tunes perform competitively at 60%, demonstrating</p>
      <sec id="sec-5-1">
        <title>A.2. Logic Example</title>
        <p>5. [Opzione 5]
(Risposta corretta: [Index for ’Almeno un
bambino non ama il gelato’ or similar])
Question (English Translation): Which of
the following is the negation of the statement
"All children love ice cream"?
Possible answers:
1. [Option 1]
2. [Option 2]
3. [Option 3]
4. [Option 4]
5. [Option 5]
(Correct Answer: [Index for ’At least one child
does not love ice cream’ or similar])</p>
        <sec id="sec-5-1-1">
          <title>Domanda: Se e solo se Giulia a luglio non va</title>
          <p>in vacanza in montagna, va poi in vacanza al
mare ad agosto. Giulia è andata sulle Dolomiti
a luglio, dunque non andrà ad agosto al mare.
Quale delle seguenti afermazioni segue la
stessa struttura logica del suddetto
ragionamento?
Possibili risposte:
1. Carolina, se acquista molte borse, spende
molti soldi. Carolina ha acquistato molte
borse, dunque ha speso molti soldi
2. Clotilde non va in motorino la sera tardi,
se piove. Stasera non ha piovuto, dunque è
andata in motorino
3. Elisa mangia le fragole a cena se e solo se a
pranzo non mangia albicocche. Ha già
mangiato albicocche a pranzo, dunque a cena non
mangia le fragole
4. Solo se Clara studia molto, supera gli esami.
Clara ha superato gli esami, dunque ha
studiato molto
5. Se Riccardo non gioca a calcio, non è in
forma per giocare a tennis. Riccardo non gioca
a tennis, dunque non ha giocato a calcio
(Risposta corretta: 3)
Question (English Translation): If and only
if Giulia does not go on holiday to the
mountains in July, she then goes on holiday to the
sea in August. Giulia went to the Dolomites
in July, therefore she will not go to the sea in
August. Which of the following statements
follows the same logical structure as the
reasoning above?
Possible answers:
1. Carolina, if she buys many bags, spends
a lot of money. Carolina bought many bags,
therefore she spent a lot of money
2. Clotilde does not ride her scooter late at
night if it rains. Tonight it did not rain,
therefore she went on her scooter
3. Elisa eats strawberries for dinner if and
only if she does not eat apricots for lunch.
She already ate apricots for lunch, therefore
she does not eat strawberries for dinner
4. Only if Clara studies hard, does she pass
the exams. Clara passed the exams, therefore
she studied hard
5. If Riccardo does not play football, he is not
ift to play tennis. Riccardo does not play
tennis, therefore he did not play football
(Correct Answer: 3)</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>A.3. Physics Example</title>
        <sec id="sec-5-2-1">
          <title>Domanda: In quale sistema una tonnellata è</title>
          <p>un multiplo?
Possibili risposte:
1. Nel sistema delle dozzine
2. Nel sistema binario
3. Nel sistema esadecimale
4. Nel sistema decimale
5. Nessuna delle altre
(Risposta corretta: 4)
Question (English Translation): In which
system is a ton (tonne) a multiple?
Possible answers:
1. In the duodecimal system (base 12)
2. In the binary system
3. In the hexadecimal system
4. In the decimal system
5. None of the others
(Correct Answer: 4)</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>A.4. Chemistry Example</title>
        <sec id="sec-5-3-1">
          <title>Domanda: A quante moli corrispondono 5</title>
          <p>mL (d=1,8 g· cm− 3) di un composto avente
una massa molare di 450 g· mol− 1?
Possibili risposte:
1. [Option 1 - e.g., 0.01 mol]
2. [Option 2 - e.g., 0.02 mol]
3. [Option 3 - e.g., 0.04 mol]
4. [Option 4 - e.g., 0.1 mol]
5. [Option 5 - e.g., 0.2 mol]
(Risposta corretta: [Index for 0.02 mol])
Question (English Translation): How
many moles correspond to 5 mL (d=1.8
g· cm− 3) of a compound having a molar mass
of 450 g· mol− 1?
Possible answers:
1. [Option 1]
2. [Option 2]
3. [Option 3]
4. [Option 4]
5. [Option 5]
(Correct Answer: [Index for 0.02 mol])</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>B. Per-Subject Model Performance</title>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) used Gemini (Google) in order to: Paraphrase and
reword and Improve writing style. After using these tool(s)/service(s), the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          , Venice, Italy,
          <year>2023</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>119</lpage>
          . URL: ing Capability in LLMs via Reinforcement Learning,
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          https://aclanthology.org/
          <year>2023</year>
          .clicit-
          <volume>1</volume>
          .15/.
          <year>2025</year>
          . URL: http://arxiv.org/abs/2501.12948. [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Altuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Karunakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          , doi:10.48550/arXiv.2501.12948,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Speranza</surname>
          </string-name>
          , R. Zanoli, CLinKaRT at EVALITA arXiv:
          <volume>2501</volume>
          .12948 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>2023: Overview of the task on linking a lab re-</article-title>
          [27]
          <string-name>
            <given-names>D.</given-names>
            <surname>Groeneveld</surname>
          </string-name>
          , et al.,
          <source>OLMo: Accelerating the sci-</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          , M. San- V. Srikumar (Eds.),
          <source>Proceedings of the 62nd An-</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>ings of the 8th Evaluation Campaign of Natural Linguistics (Volume 1: Long Papers)</article-title>
          , Association
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>ian (EVALITA</source>
          <year>2023</year>
          ), Accademia University Press,
          <year>2024</year>
          , pp.
          <fpage>15789</fpage>
          -
          <lpage>15809</lpage>
          . URL: https://aclanthology.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Parma</surname>
          </string-name>
          , Italy,
          <year>2023</year>
          , pp.
          <fpage>483</fpage>
          -
          <lpage>492</lpage>
          . URL: https:// org/2024.
          <article-title>acl-long</article-title>
          .
          <volume>841</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>ceur-ws.org/</source>
          Vol-
          <volume>3473</volume>
          /paper43.pdf.
          <article-title>acl-long</article-title>
          .
          <volume>841</volume>
          . [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Puccetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cassese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          , The Invalsi Bench- [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Üstün</surname>
          </string-name>
          , et al.,
          <source>Aya Model: An</source>
          Instruc-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>cal understanding of Large Language Models in guage Model</source>
          ,
          <year>2024</year>
          . URL: http://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Italian</surname>
          </string-name>
          , in: O.
          <string-name>
            <surname>Rambow</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
          </string-name>
          , M. Apidi-
          <volume>2402</volume>
          .07827. doi:
          <volume>10</volume>
          .48550/arXiv.2402.07827,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>anaki</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Al-Khalifa</surname>
            ,
            <given-names>B. D.</given-names>
          </string-name>
          <string-name>
            <surname>Eugenio</surname>
          </string-name>
          , S. Schockaert arXiv:
          <volume>2402</volume>
          .07827 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 31st International</source>
          Confer- [29]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          , P.-L. Huguet
          <string-name>
            <surname>Cabot</surname>
          </string-name>
          , S. Co-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Computational</given-names>
            <surname>Linguistics</surname>
          </string-name>
          , Abu Dhabi,
          <string-name>
            <surname>UAE</surname>
          </string-name>
          ,
          <year>2025</year>
          , igli, Minerva LLMs:
          <article-title>The first family of large</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          pp.
          <fpage>6782</fpage>
          -
          <lpage>6797</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>language models trained from scratch on Italian</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          coling-main.
          <volume>453</volume>
          /. data, in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            , S. Montemagni, [19]
            <given-names>T. Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Wallace</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sprugnoli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 10th Italian</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>formance of Language Models</source>
          ,
          <year>2021</year>
          . URL: http: it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          //arxiv.org/abs/2102.09690. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2024</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          . URL: https://aclanthology.org/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          2102.09690, arXiv:
          <fpage>2102</fpage>
          .09690 [cs].
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .77/. [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Chain-</surname>
            of-Thought Prompting [30]
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Kwon</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sheng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>C. H.</given-names>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          2023. URL: http://arxiv.org/abs/2201.11903.
          <article-title>ory management for large language model serving</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>doi:10</source>
          .48550/arXiv.2201.11903,
          <string-name>
            <surname>with</surname>
            <given-names>pagedattention</given-names>
          </string-name>
          ,
          <source>in: Proceedings of the ACM</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>arXiv:2201</source>
          .11903 [cs].
          <source>SIGOPS 29th Symposium on Operating Systems</source>
          [21]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mirzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Alizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shahrokhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tuzel</surname>
          </string-name>
          , Principles,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>ing in large language models</source>
          ,
          <year>2024</year>
          . URL: https: A. Additional Question Examples
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          //arxiv.org/abs/2410.05229. arXiv:
          <volume>2410</volume>
          .
          <fpage>05229</fpage>
          . [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.,
          <source>Qwen2</source>
          .
          <article-title>5 technical report, This appendix provides additional examples of questions</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          2025. URL: https://arxiv.org/abs/2412.15115.
          <article-title>from the MedBench-IT dataset, illustrating diferent sub-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>arXiv:2412</source>
          .15115. jects.
          <article-title>The example for Biology is included in the main</article-title>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Gemma: Open Models Based on Gemini text (Subsection ??). Each question below is presented in</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Research</surname>
          </string-name>
          and Technology,
          <year>2024</year>
          . URL: http://arxiv. Italian, followed
          <article-title>by its English translation and the correct</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>org/abs/2403</source>
          .08295. doi:
          <volume>10</volume>
          .48550/arXiv.2403. answer index.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          08295, arXiv:
          <fpage>2403</fpage>
          .08295 [cs]. [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          , et al.,
          <source>The Llama 3 Herd of Models,</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>2024. URL: http://arxiv.org/abs/2407.21783. A.1. General Culture Example</mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>doi:10</source>
          .48550/arXiv.2407.21783,
          <string-name>
            <surname>Domanda</surname>
            <given-names>:</given-names>
          </string-name>
          <article-title>Quale delle seguenti è la</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>arXiv:2407</source>
          .21783 [cs].
          <source>negazione dell'enunciato "Tutti i bambini</source>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdin</surname>
          </string-name>
          , et al.,
          <source>Phi-3 Technical Report: A amano il gelato”?</source>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Your</given-names>
            <surname>Phone</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: http://arxiv.org/abs/ 1. [Opzione 1]
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          2404.14219. doi:
          <volume>10</volume>
          .48550/arXiv.2404.
          <issue>14219</issue>
          ,
          <issue>2</issue>
          . [Opzione 2]
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <source>arXiv:2404.14219 [cs]. 3. [Opzione</source>
          <volume>3</volume>
          ] [26]
          <string-name>
            <surname>DeepSeek-AI</surname>
          </string-name>
          ,
          <article-title>DeepSeek-R1: Incentivizing Reason- 4. [Opzione 4] Domanda: Dati tre segmenti AA', BB' e CC' tali che: AA' = 2 cm, BB' = 1,5 * AA', CC' = 2,0 * BB'</article-title>
          .
          <article-title>Quale triangolo è possibile costruire con questi lati? Possibili risposte: 1. Non è possibile costruire nessun triangolo 2. Un triangolo rettangolo 3. Un triangolo ottusangolo 4. Un triangolo scaleno 5. Un triangolo acutangolo (Risposta corretta: 1) Question (English Translation): Given three segments AA', BB', and CC' such that: AA' = 2 cm, BB' = 1.5 * AA', CC' = 2.0 * BB'</article-title>
          .
          <article-title>Which triangle is possible to construct with these sides? Possible answers: 1. It is not possible to construct any triangle 2. A right-angled triangle 3. An obtuse-angled triangle 4. A scalene triangle 5. An acute-angled triangle (Correct Answer: 1)</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>