1. Introduction

Doctor, Is That You? Evaluating Large Language Models on Italy's Medical School Entrance Exams

Ruben Piperno

ruben.piperno@unicampus.it 1 2

Agnese Bonfigli

agnese.bonfigli@unicampus.it 1 2

Felice Dell'Orletta

felice.dellorletta@ilc.cnr.it 1

Leandro Pecchia

leandro.pecchia@unicampus.it 0 2

Mario Merone

m.merone@unicampus.it 0 2

Luca Bacco

l.bacco@unicampus.it 1 2 0 Fondazione Policlinico Universitario Campus Bio-Medico , Via Alvaro del Portillo 200, 00128 Rome , Italy 1 ItaliaNLP Lab, Institute of Computational Linguistics “Antonio Zampolli”, National Research Council , Via Giuseppe Moruzzi 1, 56124 Pisa , Italy 2 Research Unit of Intelligent Health-Technologies, Department of Engineering, Università Campus Bio-Medico di Roma , Via Alvaro del Portillo 21, 00128 Rome , Italy

2025

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe's most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the oficial admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks.

eol>Large Language Models Italian Medical Admission Test Instruction Tuning Prompt Engineering NLP in healthcare

1. Introduction

The Italian medical school entrance exam is widely regarded as one of the most competitive and demanding standardized tests in Europe. Each year, approximately 60-65,000 aspiring students face this rigorous assessment1, which consists of 60 multiple-choice questions spanning biology, chemistry, physics, mathematics, and logical reasoning. Preparation typically begins as early as the penultimate year of high school, with students dedicating countless hours to theoretical study, targeted quizzes, and full-length simulated exams. Despite this intense efort, only a portion of students manage to be included in the national ranking: for example, in 2019 only 42.7% achieved the minimum score, while in 2020 this rose to 68.3%2. These figures highlight the exam’s reputation as a formidable educational hurdle and a critical turning point in the academic lives of thousands of ambitious young individuals.

Against this backdrop, it is natural to ask what kind of cognitive skill set is truly necessary to succeed in such a highly selective process. Within this context, in an era increasingly shaped by Artificial Intelligence (AI), a provocative question arises:

To date, could a powerful Large Language Model (LLM), trained on vast data of human knowledge and capable of performing complex reasoning tasks, actually achieve what so many well-prepared students cannot? Could it earn a high enough score to gain admission to an Italian medical school?

LLMs represent a significant paradigm shift within Natural Language Processing (NLP), consistently demonstrating exceptional performance across diverse linguistic and cognitive tasks. Recent advancements have illustrated that these models frequently match or exceed traditional supervised methodologies and, in certain instances, surpass established human benchmarks [1, 2].

Complementary works in Italian have shown that GPTstyle models can reach near-human scores on the national medical-specialty exam [3], introduced CLinkaRT

2Analysis of Medicine admission test scores

for clinical information extraction [4], and released na- Content and scale. The benchmark consists of 3, 301 tive large-scale benchmarks such as INVALSI-MATE/ITA high-quality items covering five domains (Table 1). Each [5], Mult-IT [6] and the broader CALAMITA suite [7], item includes a question text (or stem) along with five laying the groundwork for systematic Italian-language multiple-choice answers, only one of which is correct. evaluation. This structure supports two task formulations: a classi

With proven capabilities in natural language compre- fication task, when the question is presented with the hension and logical reasoning, LLMs have exhibited sub- answer options, and a generation task, when only the stantial potential in educational contexts, ofering instant question is provided and the model is expected to propersonalized feedback, efectively summarizing intricate duce the correct answer. In our experiments, we adopt information, and even simulating complex human-like the classification setting, supplying both the question problem-solving processes. and the five candidate answers to the model.

However, despite their strong capabilities, previous studies have pointed out some limitations of LLMs. In Scoring Scheme. Each item is graded individually and particular, these models can be very sensitive to small then aggregated through a three-stage pipeline: changes in the prompt [8, 9]. One major issue is how Per-Item Mark. A correct answer yields +1.5 points, the arrangement of elements within the prompt afects an omission 0, and an incorrect answer − 0.4. Negative their performance, especially in tasks that require under- marking discourages guessing and keeps the expected standing and reasoning. For example, prior research has value of random choice below zero. shown that LLMs are sensitive to both the specific few- Per-Domain Average. Let be the mark obtained shot examples provided and the order in which answer on the -th question of domain ∈ {bio, chem, . . . } and choices are presented [10, 11]. the number of items in that domain (Table 1). The

In this work, our key contribution is an in-depth anal- mean score for the domain is ysis of how current LLMs, both Italian-specific and mulIttilailniagnuaml, epdeircfaolrmschoonotlheenmtrualntic-echeoxiacme,, minuvletis-tdiigsactiipnligntahrye = 1 ∑=︁1 ∈ [− 0.4, 1.5]. ( 1 ) following factors that may afect the performance:

Language-specific pre-training. We compare gen- Weighted Aggregation. Since domains contribute eral multilingual models, both with multilingual pre- unequally to the final mark, mirroring both the weighting training and Italian specialization, and models specifi- and question distribution of the actual exam, we adopt cally pre-trained in Italian, to assess the role of language- the oficial weights shown in Table 1 to compute the specific knowledge in a complex downstream task. overall average per item:

Model size. We evaluate models of diferent sizes to understand how parameter count influences perfor- = ∑︁ ∈ [− 0.4, 1.5]. ( 2 ) mance.

Prompt design. We explore the impact of prompt formulation, including zero-shot vs. few-shot prompting, Finally, the average is rescaled to the admission-test scale as well as the efects of prompt length and specificity. of [− 24, 90] by

Instruction tuning. We analyze how models that = 60 . ( 3 ) underwent instruction tuning (training on datasets designed to follow human-like task instructions) perform in comparison to base LLMs when faced with exam-style tasks.

2. Dataset

The employed corpus3 consists of the oficial Italian medical school entrance exams administered in past years, collected from the public archive of the Ministry of Education, University and Research (MIUR)4. As such, it faithfully reproduces the exact wording, structure, and dificulty level encountered by real candidates.

3https://huggingface.co/datasets/room-b007/test-medicina 4https://www.miur.gov.it

Domain Biology Chemistry Mathematics & Physics Logic & Reasoning General Knowledge Total # Questions

Distribution

Hence a model (or a student) that answers everything correctly attains max = 90, whereas one that is wrong on every question falls to min = − 24. Conversely, a purely random guesser (i.e., one that selects an answer leading to an overall expected mark of uniformly at random and is therefore correct with proba- Selected Models Table 2 lists every model considered bility 1/5) has an expected per-item score of in our experiments, organized by pre-training origin (Italian vs. multilingual) and instruction-tuning status. Each ¯= 51 · 1.5 + 45 · (− 0.4) = − 0.02, entry reports parameter count, original paper (if any) and the Hugging Face identifier.

This curated pool encompasses a wide range of model

rand = 60 × ¯ ≈ − 1.2 scales, pre-training strategies, instruction-tuning variants and backbone architectures, enabling us to rigor

According to the oficial admission rules, only candi- ously evaluate how these factors afect each model’s abildates who score at least 20 out of 90 are included in the ity to tackle the Italian medical-school entrance test. national ranking. This threshold is fixed each year and represents the minimum requirement for consideration, although substantially higher scores are typically needed to secure a study place.

3. Large Language Models 5https://huggingface.co/spaces/evalitahf/evalita_llm_leaderboard Data Leakage To the best of our knowledge, none of

the questions included in the dataset were seen during the pre-training or fine-tuning of the evaluated models. The oficial model cards and papers explicitly exclude proprietary multiple-choice exam content, including the MIUR admission tests. While we cannot entirely rule out the possibility of indirect exposure (e.g., paraphrased content shared in online forums), we consider the risk of such leakage to be minimal.

Recent progress in open-weights LLMs has produced

Italian-centric and Italian-specialised systems that still outperform much larger multilingual baselines on the EvalITA benchmark5 [12]. In this study, we select from the EvalITA leaderboard the top-performing models with 4. Experiments fewer than or equal to 9B parameters, balancing stateof-the-art performance and computational feasibility, and 4.1. Experimental Setup we supplement them with four Italian-specialist models (DanteLLM [13], Cerbero [14], Loquace, Zefiro [ 15, 16]) All experiments are performed on the dataset described that satisfy the same parameter budget but were not sub- in Section 2 and the models detailed in Section 3. No mitted to the leaderboard. This guarantees architectural parameter is updated at any point: every model is used diversity (LLaMA and Mistral families) while maintaining solely in inference mode. Unless otherwise specified in computational feasibility. the original checkpoint, all models are queried with their default generation parameters (temperature = 1.0, top_p = 1.0, top_k = 50, repetition_penalty = 1.0); no hyperparameter tuning is performed.

Selection Criteria Models were selected to facilitate

the analysis of the factors outlined in Section1, while maintaining a constant computational budget. The selection criteria are summarised below: Few-Shot Selection. For each topic of the dataset we

Language of Pre-Training. We included (i) purely- randomly sample exactly two in-context demonstrations. Italian LLMs trained from scratch on Italian corpora, (ii) These demonstrations are fixed once and reused across multilingual models that were later specialised to Italian all models, prompts, and runs. In the zero-shot setting and (iii) non-specialised multilingual models. the demonstrations are omitted, while in the few-shot

Model Size (Scaling). Families of LLMs ofering sev- setting they are inserted directly into the prompt as fixed eral sizes in the 0.35 B - 9 B range, allowing us to gauge in-context examples. the efect of scale while holding architecture and linguistic coverage constant. Prompting Strategies. Instruction-tuned (IT) check

Instruction Tuning. Whenever a base and an points are queried under two conditions: instruction-tuned (or DPO-tuned) variant coexist, we plain — the prompt text in Table 3 is provided as a included both. single user message, identical to the one used for base

Architectural Diversity. We cover the three models; dominant open-weights backbones available with chat-template — the same text is embedan Italian specialisation under 9 B parameters: ded in the model’s native chat schema via LLaMA / Gemma / Mistral [17, 18, 19]. tokenizer.apply_chat_template.

Base Architecture Params Instr. Tuned Checkpoint and Reference

Hardware and Precision. All runs are executed on a the task. The three system prompts are presented in single NVIDIA A100 80GB GPU, with torch.float16 Appendix A. weights. P1 is an ultra-minimal template that provides nothing more than the formal task instruction: the model is told Evaluation Metrics. Model performance is assessed that it will face a five-option multiple-choice question with four complementary metrics: and must output only the index of the correct answer. It (i) Overall score is computed by first averaging the contains no role play, no mention of the entrance exam, per-item marks using the oficial domain weights (Ta- and no hint about the underlying knowledge domains. ble 1) to obtain a weighted score ∈ [− 0.4, 1.5], and This prompt therefore functions as a lower bound on then applying the linear rescaling = 60 · , which instruction length. maps the result to the standard entrance-exam range P2 retains the same output constraint but introduces a [− 24, 90] expressed in sixtieths, as explained in Section concise role play: the model is asked to “simulate a can2. Since our setup assumes that the model always selects didate who has studied intensively for the Italian medical an answer among the given options, we do not consider admission test”. This framing injects moderate priming the possibility of no response. Consequently, each item about the exam context and about the desired mindset is scored either +1.5 for a correct answer or − 0.4 for an (eficiency and accuracy) while remaining compact. incorrect one. P3 is the most verbose instruction. It explicitly lists six (ii) Per-topic score reports the same quantity com- knowledge areas (Logic, Biology, Chemistry, Mathematputed separately for each domain (Biology, Chemistry, ics, Physics, and General Culture), thereby grounding Mathematics & Physics, Logic, General Knowledge). the task in the domains required by the real-world exam.

(iii) Overall Macro-averaged 1 aggregates precision The prompt also reiterates the number-only policy in and recall uniformly across the five answer classes, mak- boldface to maximise compliance. ing it robust to the pronounced class imbalance of the Importantly, all three prompts prescribe the identidataset, as shown in Table 1. cal answer format: a single digit in {1, . . . , 5} with no (iv) Per-topic macro-averaged 1 applies the same statis- accompanying text or explanation. Consequently, any tic within each domain , highlighting areas where a variation in performance, positional bias, or inter-model model may be disproportionately strong or weak despite agreement can be attributed to the incremental context similar global performance. rather than to diferences in expected output style.

4.2. Prompt Design The study adopts three system prompts that difer systematically in both length and semantic richness, allowing us to examine how sensitive each model is to the amount of contextual information it receives before attempting 4.3. Qualitative Analysis We complement the quantitative evaluation with a qual

itative analysis aimed at assessing the robustness and behavioural patterns of the tested models.

First, we analyse positional bias, i.e., the tendency (a) Pretrained natively in Italian - F1 (b) Multilingual specialised in Italian - F1 (c) Non-specialised multilingual - F1 (d) Pretrained natively in Italian - Final (e) Multilingual specialised in Italian - Fi- (f) Non-specialised multilingual - Final score nal score score of a model to overproduce certain answer indices (e.g., “1” or “3”) regardless of the question. For each model and prompt, we compute how frequently each option ( 1-5 ) is selected. A uniform distribution would indicate an unbiased decision process, whereas strong deviations suggest systematic preferences unrelated to content [21].

Second, we investigate inter-model agreement to assess how similarly diferent models behave when prompted in the same way. For each prompt and setup, we compare the predicted answers across all model pairs and measure the percentage of matching responses. This reveals which models tend to converge on the same decisions and thus behave similarly, and which ones diverge more often.

Together, these two analyses provide insight into the internal consistency of each model and the structural similarity between them.

5. Results and Discussion In this section, we present and analyse the performance of all evaluated models based on two key metrics: macroaveraged F1 score and final admission score (Figure 1). The reported values are computed by averaging results

across three distinct prompt formulations, as we observed a high degree of consistency across prompts for both metrics.

Model Size Across model groups defined by pre

training language origin, increasing model size generally correlates with improved performance, with only a few exceptions. In the Gemma series, for instance, the 9B

The analysis is structured around four main factors models (Gemma-2-9b and Gemma-2-9b-it) significantly hypothesized to influence model performance: language- outperform their 2B counterparts, particularly in terms specific pre-training, model size, prompt design, and in- of F1 score. The diference is striking: Gemma-2-9b-it struction tuning. achieves 74% F1 in zero-shot settings, while Gemma-2-2bit remains below 50%. This scaling efect, however, proves Language-Specific Pre-Training The results high- less predictable among models trained natively on Italian light a clear stratification based on language specializa- corpora or tailored to Italian. Within the Minerva family, tion. Non-specialised multilingual models, particularly performance increases modestly from 350M to 7B, though gemma-2-9b-it and gemma-2-9b, consistently outperform overall results remain limited. Moreover, Minerva-7Bother classes, achieving the highest F1 scores (≈ 74-76%) instruct shows no substantial advantage over Minervaand final scores ( ≈ 58-60) across all settings. Notably, 3B-base, and Loquace-7B-Mistral underperforms relative both models exceed the admission threshold of 20 in to Cerbero-7B, despite similar model architecture and every configuration. parameter count. Overall, larger models tend to perform better, but these results suggest that size must be com- example, the instruction-tuned gemma-2-2b-it outperbined with efective training objectives and data coverage forms its base counterpart, gemma-2-2b, by more than 20 to yield consistent gains. 1-score percentage points across all prompting conditions. Similar gains are observed for loquace-7b-mistral Prompt-Template Comparison Figure 1 reports the over the untuned loquace-7b, and for minerva-7b-instruct mean F1-score averaged across the three prompt tem- compared to minerva-7b-base. The impact of instrucplates; for nearly all models, the whiskers are tightly tion tuning is particularly pronounced in smaller models. clustered, reflecting how little the specific wording shifts While the performance gap between gemma-2-9b and the central tendency. Only a few isolated exceptions gemma-2-9b-it remains modest (typically around 2-3 1emerge - e.g., Zefiro underperforms with P2, Cerbero score percentage points), tuning significantly enhances shows higher variance in the FS IT setting, and gemma- the usability of smaller variants, suggesting that instruc2b displays slight sensitivity to prompt verbosity. When tion tuning complements model scaling and is especially runs are examined separately, however, a small yet consis- valuable in resource-constrained contexts [24]. Neverthetent ranking emerges: the minimalist P1 systematically less, instruction tuning alone is not suficient to ensure attains the highest scores, the verbose P3 lands in the competitive performance. Models such as zefiro-7b-dpomiddle, and P2 is invariably the weakest. Although the ita and italia-9b-instruct, despite being instruction-tuned, gap is only about 1-2 F1 points, its persistence across still underperform relative to top-tier generalist models. the entire model suite indicates that concise phrasing This underscores the importance of tuning quality and reduces ambiguity, whereas the intermediate framing of alignment with the target domain. P2 introduces just enough noise to dampen performance. Interestingly, instruction tuning appears to be most efective in the zero-shot setting, likely by helping the model better align with the intent of the prompt. However, when combined with few-shot exemplars, it can sometimes introduce redundancy or ambiguity, potentially hindering performance.

Prompt Design Prompt formulation plays a significant role in modulating model output. We evaluated instruction-tuned models under four prompting conditions: zero-shot (ZS), zero-shot with instruction-tuned formatting (ZS IT), few-shot (FS), and few-shot with instruction-tuned formatting (FS IT). All other non- 5.1. Per-Domain Performance instruction models were tested only in the ZS and FS settings. To complement the aggregate metrics discussed above,

Overall, few-shot prompting leads to improved F1 we conducted a topic-wise analysis of model perforscores compared to zero-shot, particularly for mid-tier mance, reporting final admission scores separately for models such as DanteLLM and Cerbero, which show each discipline in the entrance exam. gains of approximately 5-10 points in F1. In contrast, This additional evaluation aims to reveal domainhigh-performing models like gemma-2-9b-it achieve specific strengths and weaknesses that may be masked strong results even in zero-shot settings, indicating ro- by overall scores, and to better understand how diferbustness to minimal context and reduced reliance on ent model families handle the heterogeneous cognitive explicit examples. demands of the test.

Interestingly, zero-shot with instruction-tuned format- For consistency, we selected the best-performing ting often performs comparably to few-shot, especially model within each family, prioritizing the few-shot setfor models with strong instruction-following capabilities. ting whenever it led to superior results. The only excepHowever, adding instructions to few-shot prompts does tion is the family of non-specialised multilingual models, not consistently improve performance; for instance, Ze- where the best performance was achieved in the zeroifro and Loquace exhibit a decline in F1 score compared shot condition, though this setting proved competitively to the few-shot setting without instructions, likely due robust, even relative to few-shot prompting. to prompt verbosity introducing cognitive overload or The selected models are: disrupting the model’s internal heuristics [22, 23]. These minerva-7b-instruct-v1.0 (natively Italianifndings reinforce prior work on large language model pretrained family) sensitivity to prompt phrasing and structure [8, 9], and Cerbero-7b (Italian-tuned multilingual family) underscore the need for carefully tuned prompt engi- gemma-2-9b-it (non-specialised multilingual famneering, particularly in lower-resource or lower-capacity ily) models. Given the consistency across prompts, we report results obtained with Prompt 3, which corresponds to the most verbose instruction. The results, summarized in Figure 2, show that gemma-2-9b-it achieves the highest final

Instruction Tuning Instruction tuning provides con

sistent improvements across diferent model families. For admission scores across all five disciplines, with particu- ens, yet does not eliminate, the tendency to latch onto a larly strong margins in Biology and Knowledge & Skills. preferred position.

Cerbero-7b displays moderate performance overall but General Multilingual Models General multilingual remains consistently below Gemma, with its best result models scores, shown in green, cluster close to the 20% also in Biology. Minerva-7b-instruct, despite instruction baseline expected from random choice, with no extreme tuning, obtains markedly lower scores across the board, outliers. These models appear to read the answers rather with final admission scores that remain below 40% in all than the position, and they also lead our quantitative subjects. The relative ranking of the models remains sta- table, hinting at a link between genuine understanding ble across domains, suggesting that global performance and low positional bias. diferences persist even when decomposed by topic. Crosses mark the best model in each family: Minerva

Interestingly, all models achieve their highest marks 7B-instr (blue), Cerbero-7B (red) and Gemma-2 9B-it in Biology and General Knowledge, two domains that (purple). Gemma and Cerbero stay comfortably inside largely reward factual recall, the ability to retrieve canon- their inter-quartile bands, whereas Minerva still predicts ical facts memorised during pre-training (e.g., “mitochon- about 40% of its answers as label 2, illustrating that even dria produce ATP”) [25]. In sharp contrast, Mathemat- the best native-Italian model has some residual bias. ics & Physics and Logic & Reasoning are consistently Taken together, the figure draws a clear line: positional the hardest areas, even for the best-performing Gemma bias is most pronounced in smaller, language-specific checkpoint, because they demand multi-step quantita- models, softens with targeted nfie-tuning, and is almost tive or set-theoretic reasoning that current LLMs still absent in large multilingual LLMs. The trend mirrors struggle to perform reliably [26, 27]. Recent work further overall performance, suggesting that as models learn shows that simply scaling up parameters does not bridge to solve the task they naturally stop relying on posithis gap: efective reasoning requires mechanisms that tional shortcuts. Monitoring this bias might ofer a quick, disentangle memory retrieval from inference, rather than model-agnostic check on whether apparent gains stem larger parametric memory alone [28]. from real comprehension or from gaming the answer

The discipline-level analysis confirms the trends ob- format. Concrete examples of typical model errors, inserved in the global scores, underscoring the persistent cluding failures in numerical reasoning and logical minigap between non-specialized multilingual models and mization, are provided in Appendix B. those trained exclusively on Italian data. These results highlight that cross-domain generalization remains a critical diferentiator among models. They also reveal that even high-performing systems can display significant weaknesses in specific domains, an important consideration for real-world applications. Overall, the findings emphasize the crucial role of both model scale and pretraining diversity in developing LLMs with strong multidisciplinary capabilities.

Inter-Model Agreement To gauge how closely the

models behave, we compute for every pair the percentage of identical predictions on Prompt 3 and visualise these overlaps in Figure 4.7

General Overlap. Figure 4 reveals two compact blocks of high agreement. The first appears as a compact central block along the diagonal and involves the Minerva family: the four base checkpoints (1B, 350M, 3B, 7B) plus the instruction-tuned variant share ≥ 60% 5.2. Qualitative Analysis identical answers, well above the ≈ 35% background Positional Bias. For every model we counted how its level observed between unrelated models, and, in line answers are distributed across the five option slots : the with the positional-bias analysis, this consensus largely resulting percentages make up the box-plots in Figure 36. reflects their tendency to pick the same (often incorrect)

Native-Italian Models The native-Italian models, option. Interestingly, scaling Minerva from 350 M to 7 B cyan boxes, peak around 70% on option 2, and two sys- parameters does little to break this uniformity: the 3 B tems select it in every single question. Such consistency 7 B pair overlaps by ≈ 65%, only marginally higher than betrays a positional shortcut: the model “trusts” the sec- the 350 M - 1 B pair (≈ 61%), suggesting that increased ond slot more than the content it contains. capacity amplifies the same bias instead of diversifying

Italian-Specialised Multilingual Models Italian- behaviours. specialised multilingual models, presented with the or- The second block, smaller but denser, occupies the ange distributions, still favour label 2, but the median upper-left portion of the diagonal and links Gemma-2 drops to roughly 45% and the whiskers now range from 9B with its instruction-tuned sibling (Gemma-2 9B-it). ≈ 25% to 90%. Extra Italian supervision therefore weak- Their overlap exceeds 75%; unlike Minerva, they agree mostly on correct answers, underscoring their stronger

6Shown for Prompt 3, the richest prompt; Prompts 1 and 2 lead to

the same qualitative picture.

7Prompts 1 and 2 show the same qualitative pattern.

underlying capability. A size efect is evident here too: 75%, comfortably above the oficial ranking threshold of Gemma-2 2B and its instruction-tuned counterpart align 20. A handful of Italian-tuned multilingual checkpoints at ≈ 55%, noticeably lower than the 9 B pair, hinting (e.g. Cerbero-7B) also edged past the cut-of in favourable that larger multilingual backbones converge toward more prompting conditions, whereas every natively Italian stable (and more accurate) decision patterns. model remained well below it.

Between these two extremes sit the multilingual models Detailed error analysis confirms that genuine reasonspecialised in Italian, such as Cerbero-7B, DanteLLM- ing remains an open challenge. Even top models stumble 7B, and LLaMantino-2 7B. They form a looser band of on Logic and on Mathematics & Physics and display mid-level agreement (45-60%), often acting as a bridge: residual positional shortcuts, signalling reliance on surthey overlap moderately with Gemma while retaining face cues rather than deep understanding. Bridging this some afinity with native-Italian systems. The pattern gap will demand progress in numerical and deductive mirrors their performance table these models outperform reasoning, stronger defences against prompt variability, Minerva yet trail the Gemma large pair, indicating that and tighter integration with external tools and retrieval. Italian-specific fine-tuning narrows the gap without fully In future work, we plan to extend the evaluation to a matching the breadth of a high-capacity multilingual cloze-style, open-ended generation setting, where models pre-training. must produce the correct answer without being shown

Outside the highlighted blocks agreement drops the five multiple-choice options. This format may ofer a sharply, especially between native-Italian and general more faithful assessment of their reasoning abilities and multilingual systems, supporting the idea that language- reduce positional biases. The dataset is already formatted specific pre-training steers models toward distinct deci- to support this task. However, given that only a subset sion patterns. of LLMs currently achieves suficient performance in the

Topic-Wise Agreement (see Appendix C). Topic- classification setting, such a shift could pose an even specific heat-maps paint a similar picture with nuanced greater challenge. In addition, we plan to carry out a shifts: systematic exploration of decoding strategies and hyper

Biology and Chemistry closely reflect the global pat- parameters to quantify how sensitive exam performance tern: Minerva models cluster tightly, while Gemma leads and answer stability are to these settings. Such ablations a smaller high-accuracy duo, confirming that factual dis- might provide deeper insights into model robustness and ciplines accentuate family-specific biases. optimal inference configurations.

In Logic & Reasoning, the Minerva block tightens even further, with overlaps reaching ≥ 70%, implying that reasoning errors are strongly correlated across those Acknowledgments checkpoints.

Mathematics & Physics show the widest dispersion: Authors were supported by two projects: 1) the European cross-family overlaps fall below 40% for most pairs, sug- Union under the Horizon Europe Programme through gesting numerical items provoke model-specific heuris- the Innovative Health Initiative Joint Undertaking (IHI tics rather than common patterns. JU) – Project GRACE (Project number: 101194778, Project

General Knowledge falls in between, exhibiting mod- name: bridGing gaps in caRdiAC health managEment). erate agreement across the board. 2) the European Union - Next Generation EU - NRRP

Altogether, these observations confirm the main find- M6C2 - Investment 2.1 Enhancement and strengthening ing: models that share pre-training data and objectives of biomedical research in the NHS - Project PNRR-MR1tend to converge on the same answers while larger, 2022-12376635 - ”Early Detection of Rare Inherited Retibroadly-trained multilingual baselines remain both accu- nal Dystrophies and Cardiac Amyloidosis enhanced by rate and mutually consistent. Model size amplifies these Artificial Intelligence: the impact on the patient’s pathtrends, and Italian-specialised multilingual checkpoints way in Campania Region” (CUP: C83C22001540007) occupy an intermediate space, benefiting from targeted ifne-tuning yet still trailing the strongest generalist pair.

6. Conclusions Large multilingual LLMs have begun to clear the Italian

medical-school admission bar, but they are still far from matching the level reached by human examinees. On the 3 301-question benchmark, the 9-billion-parameter Gemma-2 family scored 58-60 / 90 with macro-1 around A. Prompts

Logic & Reasoning Italian I partecipanti a una gara di corsa sono 150, di cui 98 maschi, 120 biondi e 90 destrorsi. Qual è il numero minimo di maschi, biondi e destrorsi che partecipano alla gara? English There are 150 participants in a running race: 98 are male, 120 are blond, and 90 are right-handed. What is the minimum possible number of participants who are simultaneously male, blond, and right-handed? Options: (A) 8 (B) 10 (C) 20 (D) 12 (E) 18 Correct answer: (A) 8 gemma-2-9B-it answer: (B) 10

These snapshots show two common failure modes: a

numerical miscalculation in the geometry element and an overestimation in the logical minimization problem, both of which may be influenced by the model’s observed positional bias towards option (B) (as shown in Figure 3). P1. Domanda con 5 risposte ( 0-4 ). Scrivi solo il numero della risposta corretta. Nessun altro output.

P2. Simula il comportamento di un candidato che ha studiato intensamente per il test di Medicina in Italia. Per ciascuna domanda a risposta multipla con opzioni da 0 a 4, indica solo il numero della risposta che ritieni corretta. Non fornire alcuna motivazione. Concentrati sull’eficienza e sulla precisione.

P3. Sei uno studente italiano con solide competenze in Logica e Ragionamento, Biologia, Chimica, Matematica, Fisica e Cultura Generale. Stai sostenendo il test di ingresso alla facoltà di Medicina. Ti verrà fornita una domanda a risposta multipla seguita da cinque opzioni numerate da 0 a 4. Il tuo compito è indicare esclusivamente il numero ( 0-4 ) corrispondente all’alternativa corretta. Non fornire spiegazioni.

English (translation) P1. Question with 5 answers ( 0-4 ). Write only the number of the correct answer. No other output.

P2. Simulate the behaviour of a candidate who has studied extensively for the Italian Medical School admission test. For each multiple-choice question with options 0-4, output only the number of the option you believe is correct. Provide no justification. Focus on eficiency and accuracy.

P3. You are an Italian student with strong skills in Logic and Reasoning, Biology, Chemistry, Mathematics, Physics, and General Culture. You are taking the entrance exam for the Faculty of Medicine. You will be given a multiple-choice question followed by five options numbered 0 to 4. Your task is to output only the number ( 0-4 ) corresponding to the correct option. Do not provide any explanation. (a) Biology (b) Chemistry (c) General Knowledge (d) Mathematics & Physics (e) Logic & Reasoning

C. Heat-maps of Model Agreement

models rarely exceed 40% overlap. Still, domain-specific traits emerge: Logic & Reasoning shows high Minerva coherence (≥ 70%), suggesting shared shortcuts; Math & Physics shows the lowest cross-family overlap, likely due to numerical complexity. These results confirm that agreement varies by domain and should be interpreted accordingly.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Text translation, Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , et al., Llama: Open and eficient foundation language models , arXiv preprint arXiv:2302.13971 ( 2023 ).

[2]

Achiam ,

Adler ,

Agarwal ,

Ahmad ,

Akkaya ,

F. L.

Aleman ,

Almeida ,

Altenschmidt ,

Altman ,

Anadkat , et al., Gpt-4 technical report, Benchmarking large language models on italarXiv preprint arXiv:2303.08774 ( 2023 ). ian, 2025 . URL: https://arxiv.org/abs/2502.02289.

[3]

Casola ,

Labruna ,

Lavelli ,

Magnini , Testing arXiv: 2502 .02289. chatgpt for stability and reasoning: a case study [13]

Bacciu ,

Campagnano ,

Trappolini , F.

Silusing italian medical specialty tests (2023). vestri, DanteLLM: Let's push Italian LLM research

[4]

Altuna ,

Karunakaran ,

Lavelli ,

Magnini , forward!, in: N. Calzolari , M.- Y.

Kan , V.

Hoste , M.

Speranza , R.

Zanoli , Clinkart at evalita 2023 : A . Lenci , S. Sakti , N. Xue (Eds.), Proceedings of Overview of the task on linking a lab result to its the 2024 Joint International Conference on Comtest event in the clinical domain ., EVALITA ( 2023 ). putational Linguistics , Language Resources and

[5]

Puccetti ,

Cassese ,

Esuli , INVALSI - mathe - Evaluation (LREC-COLING 2024), ELRA and ICCL, matical and language understanding in Italian: A Torino , Italia , 2024 , pp. 4343 - 4355 . URL: https: CALAMITA challenge , in: F. Dell'Orletta , A . Lenci, //aclanthology.org/ 2024 .lrec-main. 388 /. S. Montemagni, R. Sprugnoli (Eds.), Proceedings [14]

F. A.

Galatolo ,

M. G.

Cimino , Cerbero-7b: A leap forof the 10th Italian Conference on Computational ward in language-specific llms through enhanced Linguistics (CLiC-it 2024 ), CEUR Workshop Pro - chat corpus generation and evaluation, arXiv ceedings , Pisa, Italy, 2024 , pp. 1168 - 1175 . URL: preprint arXiv: 2311 .15698 ( 2023 ). https://aclanthology.org/ 2024 .clicit- 1 .129/. [15]

Basile , E. Musacchio,

Polignano , L. Siciliani,

[6]

Rinaldi ,

Gili ,

Francis ,

Gofetti ,

Patti , G. Fiameni, G. Semeraro, Llamantino: Llama 2 modM. Nissim, Mult-IT multiple choice questions els for efective text generation in italian language, on multiple topics in Italian: A CALAMITA chal- 2023 . arXiv: 2312 .09993. lenge, in: F. Dell'Orletta , A.

Lenci , S. Montemagni, [16] L.

Tunstall , E.

Beeching , N.

Lambert , N.

Rajani , R. Sprugnoli (Eds.), Proceedings of the 10th Italian K . Rasul,

Belkada ,

Huang , L. von Werra, Conference on Computational Linguistics (CLiC- C. Fourrier , N.

Habib , N.

Sarrazin , O.

Sanseviero , it 2024 ), CEUR Workshop Proceedings, Pisa, Italy,

A. M.

Rush , T. Wolf, Zephyr: Direct distillation of 2024 , pp. 1184 - 1201 . URL: https://aclanthology.org/ lm alignment, 2023 . arXiv: 2310 . 16944 . 2024 .clicit- 1 .131/. [17]

Touvron ,

Lavril ,

Izacard ,

Martinet , M.-A.

[7]

Attanasio ,

Basile ,

Borazio ,

Croce , M. Fran- Lachaux,

Lacroix ,

Rozière ,

Goyal ,

Hamcis ,

Gili , E. Musacchio,

Nissim , V. Patti, bro,

Azhar ,

Rodriguez ,

Joulin , E. Grave,

Rinaldi ,

Scalena , CALAMITA: Challenge G. Lample, Llama: Open and eficient foundation the abilities of LAnguage models in ITAlian , in: language models, 2023 . URL: https://arxiv.org/abs/ F. Dell'Orletta,

Lenci ,

Montemagni , R. Sprug- 2302 .13971. arXiv: 2302 .13971. noli (Eds.), Proceedings of the 10th Italian Confer- [18]

Team ,

Mesnard ,

Hardin ,

Dadashi , S.

Bhuence on Computational Linguistics (CLiC-it

2024 ), patiraju, S. Pathak,

Sifre ,

Rivière ,

M. S.

Kale , CEUR Workshop Proceedings, Pisa, Italy, 2024 ,

Love , et al., Gemma: Open models based on pp. 1054 - 1063 . URL: https://aclanthology.org/ 2024 . gemini research and technology, arXiv preprint clicit-1 .116/. arXiv: 2403 .08295 ( 2024 ).

[8]

Zhao ,

Wallace ,

Feng ,

Klein ,

Singh , Cal- [19]

A. Q.

Jiang ,

Sablayrolles ,

Mensch , C. Bamibrate before use: Improving few-shot performance ford ,

D. S.

Chaplot , D. de las Casas, F. Bressand, of language models , in: International conference G. Lengyel,

Lample ,

Saulnier ,

L. R.

Lavaud , M.- on machine learning , PMLR , 2021 , pp. 12697 - 12706 . A. Lachaux , P.

Stock , T. L.

Scao , T.

Lavril , T. Wang,

[9]

Wang ,

Liu ,

K. H.

Park ,

Jiang ,

Zheng ,

Wu ,

Lacroix ,

W. E.

Sayed , Mistral 7b, 2023 . URL: https: M. Chen , C. Xiao , Adversarial demonstration at- //arxiv.org/abs/2310.06825. arXiv: 2310 .06825. tacks on large language models , arXiv preprint [20]

Orlando ,

Moroni , P.-L. Huguet Cabot , S. CoarXiv: 2305 .14950 ( 2023 ). nia, E. Barba,

Orlandini , G. Fiameni, R. Nav-

[10]

Wang ,

Li ,

Chen ,

Cai ,

Zhu ,

Lin ,

Cao , igli, Minerva LLMs: The first family of large Q . Liu,

Liu ,

Sui , Large language models are language models trained from scratch on Italian not fair evaluators , arXiv preprint arXiv:2305 .17926 data, in: F. Dell'Orletta , A. Lenci , S. Montemagni , ( 2023 ). R. Sprugnoli (Eds.), Proceedings of the 10th Italian

[11]

Pezeshkpour , E. Hruschka, Large language mod- Conference on Computational Linguistics (CLiCels sensitivity to the order of options in multiple - it 2024 ), CEUR Workshop Proceedings, Pisa, Italy, choice questions, arXiv preprint arXiv:2308.11483 2024 , pp. 707 - 719 . URL: https://aclanthology.org/ ( 2023 ). 2024 .clicit- 1 .77/.

[12]

Magnini ,

Zanoli ,

Resta ,

Cimmino , [21]

Zheng ,

Zhou ,

Meng ,

Zhou ,

Huang ,

Albano ,

Madeddu ,

Patti , Evalita-llm: Large language models are not robust multiple choice selectors , arXiv preprint arXiv:2309.03882 ( 2023 ).

[22]

Upadhayay ,

Behzadan ,

Karbasi , Cognitive The study adopts three system prompts that difer systemoverload attack: Prompt injection for long context, atically in both length and semantic richness , allowing us arXiv preprint arXiv:2410.11272 ( 2024 ). to examine how sensitive each model is to the amount of

[23]

Zhang ,

S. S. S.

Das ,

Zhang , Verbosity ̸= contextual information it receives before attempting the veracity: Demystify verbosity compensation be- task. The three system prompts are presented in Table 3. havior of large language models , arXiv preprint arXiv:2411.07858 ( 2024 ). B. Concrete answer examples

[24]

H. W.

Chung ,

Hou ,

Longpre ,

Zoph ,

Tay ,

Fedus ,

Li ,

Wang ,

Dehghani ,

Brahma , To illustrate the kinds of mistakes made by the topet al ., Scaling instruction-finetuned language mod- performing model (gemma-2-9B-it, prompt 3), we report els , Journal of Machine Learning Research 25 ( 2024 ) two representative items: one from the Mathematics & 1 - 53 . Physics subset and one from the Logic & Reasoning sub-

[25]

Wang ,

Chen ,

Wen ,

Sheng ,

Li , D. D. set, together with the label and the model's prediction. Zeng, Unveiling factual recall behaviors of large lan- Each question is shown first in Italian and then in English. guage models through knowledge neurons , arXiv preprint arXiv:2408.03247 ( 2024 ). Mathematics & Physics

[26]

Parmar ,

Patel ,

Varshney ,

Nakamura ,

Luo ,

Mashetty ,

Mitra ,

Baral , Logicbench: Italian Towards systematic evaluation of logical reasoning Quanto vale il rapporto tra il volume e la superficie di un ability of large language models, arXiv preprint cilindro di raggio 6 cm e altezza 12 cm ? arXiv: 2404 .15522 ( 2024 ). English

[27]

Ahn ,

Verma ,

Lou , D. Liu,

Zhang , W. Yin, What is the ratio between the volume and the surface area Large language models for mathematical reason- of a cylinder with 6 cm radius and 12 cm height? ing: Progresses and challenges, arXiv preprint Options: (A) 2 cm (B) 1,5 cm (C) 1 cm (D) 0,5 cm (E ) arXiv: 2402 .00157 ( 2024 ). 4 cm

[28]

Jin ,

Luo , S. Cheng,

Wang ,

Hua ,

Tang , Correct answer: (A) 2 cm W . Y. Wang , Y. Zhang, Disentangling memory and gemma-2-9B-it answer: (B) 1,5 cm reasoning ability in large language models , arXiv preprint arXiv:2411.13504 ( 2024 ).