<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Doctor, Is That You? Evaluating Large Language Models on Italy's Medical School Entrance Exams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ruben Piperno</string-name>
          <email>ruben.piperno@unicampus.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Agnese Bonfigli</string-name>
          <email>agnese.bonfigli@unicampus.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <email>felice.dellorletta@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leandro Pecchia</string-name>
          <email>leandro.pecchia@unicampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Merone</string-name>
          <email>m.merone@unicampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Bacco</string-name>
          <email>l.bacco@unicampus.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Policlinico Universitario Campus Bio-Medico</institution>
          ,
          <addr-line>Via Alvaro del Portillo 200, 00128 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ItaliaNLP Lab, Institute of Computational Linguistics “Antonio Zampolli”, National Research Council</institution>
          ,
          <addr-line>Via Giuseppe Moruzzi 1, 56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Research Unit of Intelligent Health-Technologies, Department of Engineering, Università Campus Bio-Medico di Roma</institution>
          ,
          <addr-line>Via Alvaro del Portillo 21, 00128 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of linguistic and cognitive tasks. This study investigates whether such models can succeed in one of Europe's most selective academic assessments: the Italian medical school entrance exam. We evaluate a wide selection of open-weights LLMs, ranging from natively Italian-pretrained models to multilingual and Italian-specialised variants, on a benchmark dataset comprising over 3,300 real-world exam questions across five knowledge domains. Our experiments systematically explore the impact of language-specific pretraining, model size, prompt formulation and instruction tuning on exam performance. Results show that large multilingual models, particularly the Gemma-2-9B family, consistently outperform all other systems, surpassing the oficial admission threshold under all prompting settings. In contrast, models trained exclusively on Italian data fail to reach this threshold, even with larger architectures or instruction tuning. Additional analyses reveal that high-performing models display lower positional bias and greater inter-model consistency. These findings suggest that cross-domain reasoning and multilingual pretraining are key to handling multi-disciplinary educational tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Italian Medical Admission Test</kwd>
        <kwd>Instruction Tuning</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>NLP in healthcare</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The Italian medical school entrance exam is widely
regarded as one of the most competitive and demanding
standardized tests in Europe. Each year, approximately
60-65,000 aspiring students face this rigorous
assessment1, which consists of 60 multiple-choice questions
spanning biology, chemistry, physics, mathematics, and
logical reasoning. Preparation typically begins as early
as the penultimate year of high school, with students
dedicating countless hours to theoretical study, targeted
quizzes, and full-length simulated exams. Despite this
intense efort, only a portion of students manage to be
included in the national ranking: for example, in 2019
only 42.7% achieved the minimum score, while in 2020
this rose to 68.3%2. These figures highlight the exam’s
reputation as a formidable educational hurdle and a
critical turning point in the academic lives of thousands of
ambitious young individuals.</p>
      <p>Against this backdrop, it is natural to ask what kind of
cognitive skill set is truly necessary to succeed in such
a highly selective process. Within this context, in an
era increasingly shaped by Artificial Intelligence (AI), a
provocative question arises:</p>
      <p>To date, could a powerful Large Language Model (LLM),
trained on vast data of human knowledge and capable of
performing complex reasoning tasks, actually achieve what
so many well-prepared students cannot? Could it earn a
high enough score to gain admission to an Italian medical
school?</p>
      <p>LLMs represent a significant paradigm shift within
Natural Language Processing (NLP), consistently
demonstrating exceptional performance across diverse
linguistic and cognitive tasks. Recent advancements have
illustrated that these models frequently match or exceed
traditional supervised methodologies and, in certain
instances, surpass established human benchmarks [1, 2].</p>
      <p>Complementary works in Italian have shown that
GPTstyle models can reach near-human scores on the
national medical-specialty exam [3], introduced CLinkaRT</p>
      <sec id="sec-1-1">
        <title>2Analysis of Medicine admission test scores</title>
        <p>for clinical information extraction [4], and released na- Content and scale. The benchmark consists of 3, 301
tive large-scale benchmarks such as INVALSI-MATE/ITA high-quality items covering five domains (Table 1). Each
[5], Mult-IT [6] and the broader CALAMITA suite [7], item includes a question text (or stem) along with five
laying the groundwork for systematic Italian-language multiple-choice answers, only one of which is correct.
evaluation. This structure supports two task formulations: a
classi</p>
        <p>With proven capabilities in natural language compre- fication task, when the question is presented with the
hension and logical reasoning, LLMs have exhibited sub- answer options, and a generation task, when only the
stantial potential in educational contexts, ofering instant question is provided and the model is expected to
propersonalized feedback, efectively summarizing intricate duce the correct answer. In our experiments, we adopt
information, and even simulating complex human-like the classification setting, supplying both the question
problem-solving processes. and the five candidate answers to the model.</p>
        <p>However, despite their strong capabilities, previous
studies have pointed out some limitations of LLMs. In Scoring Scheme. Each item is graded individually and
particular, these models can be very sensitive to small then aggregated through a three-stage pipeline:
changes in the prompt [8, 9]. One major issue is how Per-Item Mark. A correct answer yields +1.5 points,
the arrangement of elements within the prompt afects an omission 0, and an incorrect answer − 0.4. Negative
their performance, especially in tasks that require under- marking discourages guessing and keeps the expected
standing and reasoning. For example, prior research has value of random choice below zero.
shown that LLMs are sensitive to both the specific few- Per-Domain Average. Let  be the mark obtained
shot examples provided and the order in which answer on the -th question of domain  ∈ {bio, chem, . . . } and
choices are presented [10, 11].  the number of items in that domain (Table 1). The</p>
        <p>
          In this work, our key contribution is an in-depth anal- mean score for the domain is
ysis of how current LLMs, both Italian-specific and
mulIttilailniagnuaml, epdeircfaolrmschoonotlheenmtrualntic-echeoxiacme,, minuvletis-tdiigsactiipnligntahrye  = 1 ∑=︁1  ∈ [− 0.4, 1.5]. (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
following factors that may afect the performance:
        </p>
        <p>Language-specific pre-training. We compare gen- Weighted Aggregation. Since domains contribute
eral multilingual models, both with multilingual pre- unequally to the final mark, mirroring both the weighting
training and Italian specialization, and models specifi- and question distribution of the actual exam, we adopt
cally pre-trained in Italian, to assess the role of language- the oficial weights  shown in Table 1 to compute the
specific knowledge in a complex downstream task. overall average per item:</p>
        <p>
          Model size. We evaluate models of diferent sizes
to understand how parameter count influences perfor-  = ∑︁   ∈ [− 0.4, 1.5]. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
mance.
        </p>
        <p>Prompt design. We explore the impact of prompt
formulation, including zero-shot vs. few-shot prompting, Finally, the average is rescaled to the admission-test scale
as well as the efects of prompt length and specificity. of [− 24, 90] by</p>
        <p>
          Instruction tuning. We analyze how models that  = 60 . (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
underwent instruction tuning (training on datasets
designed to follow human-like task instructions) perform
in comparison to base LLMs when faced with exam-style
tasks.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>The employed corpus3 consists of the oficial Italian
medical school entrance exams administered in past years,
collected from the public archive of the Ministry of
Education, University and Research (MIUR)4. As such, it
faithfully reproduces the exact wording, structure, and
dificulty level encountered by real candidates.</p>
      <sec id="sec-2-1">
        <title>3https://huggingface.co/datasets/room-b007/test-medicina 4https://www.miur.gov.it</title>
        <p>Domain
Biology
Chemistry
Mathematics &amp; Physics
Logic &amp; Reasoning
General Knowledge
Total
# Questions</p>
        <p>Distribution</p>
        <p>Hence a model (or a student) that answers everything
correctly attains max = 90, whereas one that is wrong
on every question falls to min = − 24. Conversely, a
purely random guesser (i.e., one that selects an answer
leading to an overall expected mark of
uniformly at random and is therefore correct with proba- Selected Models Table 2 lists every model considered
bility 1/5) has an expected per-item score of in our experiments, organized by pre-training origin
(Italian vs. multilingual) and instruction-tuning status. Each
¯= 51 · 1.5 + 45 · (− 0.4) = − 0.02, entry reports parameter count, original paper (if any)
and the Hugging Face identifier.</p>
      </sec>
      <sec id="sec-2-2">
        <title>This curated pool encompasses a wide range of model</title>
        <p>rand = 60 × ¯ ≈ − 1.2 scales, pre-training strategies, instruction-tuning
variants and backbone architectures, enabling us to
rigor</p>
        <p>According to the oficial admission rules, only candi- ously evaluate how these factors afect each model’s
abildates who score at least 20 out of 90 are included in the ity to tackle the Italian medical-school entrance test.
national ranking. This threshold is fixed each year and
represents the minimum requirement for consideration,
although substantially higher scores are typically needed
to secure a study place.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Large Language Models</title>
      <sec id="sec-3-1">
        <title>5https://huggingface.co/spaces/evalitahf/evalita_llm_leaderboard</title>
      </sec>
      <sec id="sec-3-2">
        <title>Data Leakage To the best of our knowledge, none of</title>
        <p>the questions included in the dataset were seen during
the pre-training or fine-tuning of the evaluated models.
The oficial model cards and papers explicitly exclude
proprietary multiple-choice exam content, including the
MIUR admission tests. While we cannot entirely rule
out the possibility of indirect exposure (e.g., paraphrased
content shared in online forums), we consider the risk of
such leakage to be minimal.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Recent progress in open-weights LLMs has produced</title>
        <p>Italian-centric and Italian-specialised systems that still
outperform much larger multilingual baselines on the
EvalITA benchmark5 [12]. In this study, we select from
the EvalITA leaderboard the top-performing models with 4. Experiments
fewer than or equal to 9B parameters, balancing
stateof-the-art performance and computational feasibility, and 4.1. Experimental Setup
we supplement them with four Italian-specialist models
(DanteLLM [13], Cerbero [14], Loquace, Zefiro [ 15, 16]) All experiments are performed on the dataset described
that satisfy the same parameter budget but were not sub- in Section 2 and the models detailed in Section 3. No
mitted to the leaderboard. This guarantees architectural parameter is updated at any point: every model is used
diversity (LLaMA and Mistral families) while maintaining solely in inference mode. Unless otherwise specified in
computational feasibility. the original checkpoint, all models are queried with their
default generation parameters (temperature = 1.0, top_p
= 1.0, top_k = 50, repetition_penalty = 1.0); no
hyperparameter tuning is performed.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Selection Criteria Models were selected to facilitate</title>
        <p>the analysis of the factors outlined in Section1, while
maintaining a constant computational budget. The
selection criteria are summarised below: Few-Shot Selection. For each topic of the dataset we</p>
        <p>Language of Pre-Training. We included (i) purely- randomly sample exactly two in-context demonstrations.
Italian LLMs trained from scratch on Italian corpora, (ii) These demonstrations are fixed once and reused across
multilingual models that were later specialised to Italian all models, prompts, and runs. In the zero-shot setting
and (iii) non-specialised multilingual models. the demonstrations are omitted, while in the few-shot</p>
        <p>Model Size (Scaling). Families of LLMs ofering sev- setting they are inserted directly into the prompt as fixed
eral sizes in the 0.35 B - 9 B range, allowing us to gauge in-context examples.
the efect of scale while holding architecture and
linguistic coverage constant. Prompting Strategies. Instruction-tuned (IT)
check</p>
        <p>Instruction Tuning. Whenever a base and an points are queried under two conditions:
instruction-tuned (or DPO-tuned) variant coexist, we plain — the prompt text in Table 3 is provided as a
included both. single user message, identical to the one used for base</p>
        <p>Architectural Diversity. We cover the three models;
dominant open-weights backbones available with chat-template — the same text is
embedan Italian specialisation under 9 B parameters: ded in the model’s native chat schema via
LLaMA / Gemma / Mistral [17, 18, 19]. tokenizer.apply_chat_template.</p>
        <p>Base Architecture Params Instr. Tuned Checkpoint and Reference</p>
        <p>Hardware and Precision. All runs are executed on a the task. The three system prompts are presented in
single NVIDIA A100 80GB GPU, with torch.float16 Appendix A.
weights. P1 is an ultra-minimal template that provides nothing
more than the formal task instruction: the model is told
Evaluation Metrics. Model performance is assessed that it will face a five-option multiple-choice question
with four complementary metrics: and must output only the index of the correct answer. It
(i) Overall score  is computed by first averaging the contains no role play, no mention of the entrance exam,
per-item marks using the oficial domain weights  (Ta- and no hint about the underlying knowledge domains.
ble 1) to obtain a weighted score  ∈ [− 0.4, 1.5], and This prompt therefore functions as a lower bound on
then applying the linear rescaling  = 60 · , which instruction length.
maps the result to the standard entrance-exam range P2 retains the same output constraint but introduces a
[− 24, 90] expressed in sixtieths, as explained in Section concise role play: the model is asked to “simulate a
can2. Since our setup assumes that the model always selects didate who has studied intensively for the Italian medical
an answer among the given options, we do not consider admission test”. This framing injects moderate priming
the possibility of no response. Consequently, each item about the exam context and about the desired mindset
is scored either +1.5 for a correct answer or − 0.4 for an (eficiency and accuracy) while remaining compact.
incorrect one. P3 is the most verbose instruction. It explicitly lists six
(ii) Per-topic score  reports the same quantity com- knowledge areas (Logic, Biology, Chemistry,
Mathematputed separately for each domain (Biology, Chemistry, ics, Physics, and General Culture), thereby grounding
Mathematics &amp; Physics, Logic, General Knowledge). the task in the domains required by the real-world exam.</p>
        <p>(iii) Overall Macro-averaged 1 aggregates precision The prompt also reiterates the number-only policy in
and recall uniformly across the five answer classes, mak- boldface to maximise compliance.
ing it robust to the pronounced class imbalance of the Importantly, all three prompts prescribe the
identidataset, as shown in Table 1. cal answer format: a single digit in {1, . . . , 5} with no
(iv) Per-topic macro-averaged 1 applies the same statis- accompanying text or explanation. Consequently, any
tic within each domain , highlighting areas where a variation in performance, positional bias, or inter-model
model may be disproportionately strong or weak despite agreement can be attributed to the incremental context
similar global performance. rather than to diferences in expected output style.</p>
        <sec id="sec-3-4-1">
          <title>4.2. Prompt Design</title>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>The study adopts three system prompts that difer systematically in both length and semantic richness, allowing us to examine how sensitive each model is to the amount of contextual information it receives before attempting</title>
        <sec id="sec-3-5-1">
          <title>4.3. Qualitative Analysis</title>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>We complement the quantitative evaluation with a qual</title>
        <p>itative analysis aimed at assessing the robustness and
behavioural patterns of the tested models.</p>
        <p>
          First, we analyse positional bias, i.e., the tendency
(a) Pretrained natively in Italian - F1
(b) Multilingual specialised in Italian - F1
(c) Non-specialised multilingual - F1
(d) Pretrained natively in Italian - Final (e) Multilingual specialised in Italian - Fi- (f) Non-specialised multilingual - Final
score nal score score
of a model to overproduce certain answer indices (e.g.,
“1” or “3”) regardless of the question. For each model
and prompt, we compute how frequently each option
(
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1-5</xref>
          ) is selected. A uniform distribution would indicate
an unbiased decision process, whereas strong deviations
suggest systematic preferences unrelated to content [21].
        </p>
        <p>Second, we investigate inter-model agreement to
assess how similarly diferent models behave when
prompted in the same way. For each prompt and setup,
we compare the predicted answers across all model pairs
and measure the percentage of matching responses. This
reveals which models tend to converge on the same
decisions and thus behave similarly, and which ones diverge
more often.</p>
        <p>Together, these two analyses provide insight into the
internal consistency of each model and the structural
similarity between them.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results and Discussion</title>
      <sec id="sec-4-1">
        <title>In this section, we present and analyse the performance of all evaluated models based on two key metrics: macroaveraged F1 score and final admission score (Figure 1). The reported values are computed by averaging results</title>
        <p>across three distinct prompt formulations, as we observed
a high degree of consistency across prompts for both
metrics.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Model Size Across model groups defined by pre</title>
        <p>training language origin, increasing model size generally
correlates with improved performance, with only a few
exceptions. In the Gemma series, for instance, the 9B</p>
        <p>The analysis is structured around four main factors models (Gemma-2-9b and Gemma-2-9b-it) significantly
hypothesized to influence model performance: language- outperform their 2B counterparts, particularly in terms
specific pre-training, model size, prompt design, and in- of F1 score. The diference is striking: Gemma-2-9b-it
struction tuning. achieves 74% F1 in zero-shot settings, while
Gemma-2-2bit remains below 50%. This scaling efect, however, proves
Language-Specific Pre-Training The results high- less predictable among models trained natively on Italian
light a clear stratification based on language specializa- corpora or tailored to Italian. Within the Minerva family,
tion. Non-specialised multilingual models, particularly performance increases modestly from 350M to 7B, though
gemma-2-9b-it and gemma-2-9b, consistently outperform overall results remain limited. Moreover,
Minerva-7Bother classes, achieving the highest F1 scores (≈ 74-76%) instruct shows no substantial advantage over
Minervaand final scores ( ≈ 58-60) across all settings. Notably, 3B-base, and Loquace-7B-Mistral underperforms relative
both models exceed the admission threshold of 20 in to Cerbero-7B, despite similar model architecture and
every configuration. parameter count. Overall, larger models tend to perform
better, but these results suggest that size must be com- example, the instruction-tuned gemma-2-2b-it
outperbined with efective training objectives and data coverage forms its base counterpart, gemma-2-2b, by more than 20
to yield consistent gains. 1-score percentage points across all prompting
conditions. Similar gains are observed for loquace-7b-mistral
Prompt-Template Comparison Figure 1 reports the over the untuned loquace-7b, and for minerva-7b-instruct
mean F1-score averaged across the three prompt tem- compared to minerva-7b-base. The impact of
instrucplates; for nearly all models, the whiskers are tightly tion tuning is particularly pronounced in smaller models.
clustered, reflecting how little the specific wording shifts While the performance gap between gemma-2-9b and
the central tendency. Only a few isolated exceptions gemma-2-9b-it remains modest (typically around 2-3
1emerge - e.g., Zefiro underperforms with P2, Cerbero score percentage points), tuning significantly enhances
shows higher variance in the FS IT setting, and gemma- the usability of smaller variants, suggesting that
instruc2b displays slight sensitivity to prompt verbosity. When tion tuning complements model scaling and is especially
runs are examined separately, however, a small yet consis- valuable in resource-constrained contexts [24].
Neverthetent ranking emerges: the minimalist P1 systematically less, instruction tuning alone is not suficient to ensure
attains the highest scores, the verbose P3 lands in the competitive performance. Models such as
zefiro-7b-dpomiddle, and P2 is invariably the weakest. Although the ita and italia-9b-instruct, despite being instruction-tuned,
gap is only about 1-2 F1 points, its persistence across still underperform relative to top-tier generalist models.
the entire model suite indicates that concise phrasing This underscores the importance of tuning quality and
reduces ambiguity, whereas the intermediate framing of alignment with the target domain.
P2 introduces just enough noise to dampen performance. Interestingly, instruction tuning appears to be most
efective in the zero-shot setting, likely by helping the
model better align with the intent of the prompt.
However, when combined with few-shot exemplars, it can
sometimes introduce redundancy or ambiguity,
potentially hindering performance.</p>
        <p>Prompt Design Prompt formulation plays a
significant role in modulating model output. We evaluated
instruction-tuned models under four prompting
conditions: zero-shot (ZS), zero-shot with instruction-tuned
formatting (ZS IT), few-shot (FS), and few-shot with
instruction-tuned formatting (FS IT). All other non- 5.1. Per-Domain Performance
instruction models were tested only in the ZS and FS
settings. To complement the aggregate metrics discussed above,</p>
        <p>Overall, few-shot prompting leads to improved F1 we conducted a topic-wise analysis of model
perforscores compared to zero-shot, particularly for mid-tier mance, reporting final admission scores separately for
models such as DanteLLM and Cerbero, which show each discipline in the entrance exam.
gains of approximately 5-10 points in F1. In contrast, This additional evaluation aims to reveal
domainhigh-performing models like gemma-2-9b-it achieve specific strengths and weaknesses that may be masked
strong results even in zero-shot settings, indicating ro- by overall scores, and to better understand how
diferbustness to minimal context and reduced reliance on ent model families handle the heterogeneous cognitive
explicit examples. demands of the test.</p>
        <p>Interestingly, zero-shot with instruction-tuned format- For consistency, we selected the best-performing
ting often performs comparably to few-shot, especially model within each family, prioritizing the few-shot
setfor models with strong instruction-following capabilities. ting whenever it led to superior results. The only
excepHowever, adding instructions to few-shot prompts does tion is the family of non-specialised multilingual models,
not consistently improve performance; for instance, Ze- where the best performance was achieved in the
zeroifro and Loquace exhibit a decline in F1 score compared shot condition, though this setting proved competitively
to the few-shot setting without instructions, likely due robust, even relative to few-shot prompting.
to prompt verbosity introducing cognitive overload or The selected models are:
disrupting the model’s internal heuristics [22, 23]. These minerva-7b-instruct-v1.0 (natively
Italianifndings reinforce prior work on large language model pretrained family)
sensitivity to prompt phrasing and structure [8, 9], and Cerbero-7b (Italian-tuned multilingual family)
underscore the need for carefully tuned prompt engi- gemma-2-9b-it (non-specialised multilingual
famneering, particularly in lower-resource or lower-capacity ily)
models. Given the consistency across prompts, we report
results obtained with Prompt 3, which corresponds to the
most verbose instruction. The results, summarized in
Figure 2, show that gemma-2-9b-it achieves the highest final</p>
      </sec>
      <sec id="sec-4-3">
        <title>Instruction Tuning Instruction tuning provides con</title>
        <p>sistent improvements across diferent model families. For
admission scores across all five disciplines, with particu- ens, yet does not eliminate, the tendency to latch onto a
larly strong margins in Biology and Knowledge &amp; Skills. preferred position.</p>
        <p>Cerbero-7b displays moderate performance overall but General Multilingual Models General multilingual
remains consistently below Gemma, with its best result models scores, shown in green, cluster close to the 20%
also in Biology. Minerva-7b-instruct, despite instruction baseline expected from random choice, with no extreme
tuning, obtains markedly lower scores across the board, outliers. These models appear to read the answers rather
with final admission scores that remain below 40% in all than the position, and they also lead our quantitative
subjects. The relative ranking of the models remains sta- table, hinting at a link between genuine understanding
ble across domains, suggesting that global performance and low positional bias.
diferences persist even when decomposed by topic. Crosses mark the best model in each family: Minerva</p>
        <p>Interestingly, all models achieve their highest marks 7B-instr (blue), Cerbero-7B (red) and Gemma-2 9B-it
in Biology and General Knowledge, two domains that (purple). Gemma and Cerbero stay comfortably inside
largely reward factual recall, the ability to retrieve canon- their inter-quartile bands, whereas Minerva still predicts
ical facts memorised during pre-training (e.g., “mitochon- about 40% of its answers as label 2, illustrating that even
dria produce ATP”) [25]. In sharp contrast, Mathemat- the best native-Italian model has some residual bias.
ics &amp; Physics and Logic &amp; Reasoning are consistently Taken together, the figure draws a clear line: positional
the hardest areas, even for the best-performing Gemma bias is most pronounced in smaller, language-specific
checkpoint, because they demand multi-step quantita- models, softens with targeted nfie-tuning, and is almost
tive or set-theoretic reasoning that current LLMs still absent in large multilingual LLMs. The trend mirrors
struggle to perform reliably [26, 27]. Recent work further overall performance, suggesting that as models learn
shows that simply scaling up parameters does not bridge to solve the task they naturally stop relying on
posithis gap: efective reasoning requires mechanisms that tional shortcuts. Monitoring this bias might ofer a quick,
disentangle memory retrieval from inference, rather than model-agnostic check on whether apparent gains stem
larger parametric memory alone [28]. from real comprehension or from gaming the answer</p>
        <p>The discipline-level analysis confirms the trends ob- format. Concrete examples of typical model errors,
inserved in the global scores, underscoring the persistent cluding failures in numerical reasoning and logical
minigap between non-specialized multilingual models and mization, are provided in Appendix B.
those trained exclusively on Italian data. These results
highlight that cross-domain generalization remains a
critical diferentiator among models. They also reveal that
even high-performing systems can display significant
weaknesses in specific domains, an important
consideration for real-world applications. Overall, the findings
emphasize the crucial role of both model scale and
pretraining diversity in developing LLMs with strong
multidisciplinary capabilities.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Inter-Model Agreement To gauge how closely the</title>
        <p>models behave, we compute for every pair the percentage
of identical predictions on Prompt 3 and visualise these
overlaps in Figure 4.7</p>
        <p>General Overlap. Figure 4 reveals two compact
blocks of high agreement. The first appears as a
compact central block along the diagonal and involves the
Minerva family: the four base checkpoints (1B, 350M,
3B, 7B) plus the instruction-tuned variant share ≥ 60%
5.2. Qualitative Analysis identical answers, well above the ≈ 35% background
Positional Bias. For every model we counted how its level observed between unrelated models, and, in line
answers are distributed across the five option slots : the with the positional-bias analysis, this consensus largely
resulting percentages make up the box-plots in Figure 36. reflects their tendency to pick the same (often incorrect)</p>
        <p>Native-Italian Models The native-Italian models, option. Interestingly, scaling Minerva from 350 M to 7 B
cyan boxes, peak around 70% on option 2, and two sys- parameters does little to break this uniformity: the 3 B
tems select it in every single question. Such consistency 7 B pair overlaps by ≈ 65%, only marginally higher than
betrays a positional shortcut: the model “trusts” the sec- the 350 M - 1 B pair (≈ 61%), suggesting that increased
ond slot more than the content it contains. capacity amplifies the same bias instead of diversifying</p>
        <p>Italian-Specialised Multilingual Models Italian- behaviours.
specialised multilingual models, presented with the or- The second block, smaller but denser, occupies the
ange distributions, still favour label 2, but the median upper-left portion of the diagonal and links Gemma-2
drops to roughly 45% and the whiskers now range from 9B with its instruction-tuned sibling (Gemma-2 9B-it).
≈ 25% to 90%. Extra Italian supervision therefore weak- Their overlap exceeds 75%; unlike Minerva, they agree
mostly on correct answers, underscoring their stronger</p>
      </sec>
      <sec id="sec-4-5">
        <title>6Shown for Prompt 3, the richest prompt; Prompts 1 and 2 lead to</title>
        <p>the same qualitative picture.</p>
      </sec>
      <sec id="sec-4-6">
        <title>7Prompts 1 and 2 show the same qualitative pattern.</title>
        <p>underlying capability. A size efect is evident here too: 75%, comfortably above the oficial ranking threshold of
Gemma-2 2B and its instruction-tuned counterpart align 20. A handful of Italian-tuned multilingual checkpoints
at ≈ 55%, noticeably lower than the 9 B pair, hinting (e.g. Cerbero-7B) also edged past the cut-of in favourable
that larger multilingual backbones converge toward more prompting conditions, whereas every natively Italian
stable (and more accurate) decision patterns. model remained well below it.</p>
        <p>Between these two extremes sit the multilingual models Detailed error analysis confirms that genuine
reasonspecialised in Italian, such as Cerbero-7B, DanteLLM- ing remains an open challenge. Even top models stumble
7B, and LLaMantino-2 7B. They form a looser band of on Logic and on Mathematics &amp; Physics and display
mid-level agreement (45-60%), often acting as a bridge: residual positional shortcuts, signalling reliance on
surthey overlap moderately with Gemma while retaining face cues rather than deep understanding. Bridging this
some afinity with native-Italian systems. The pattern gap will demand progress in numerical and deductive
mirrors their performance table these models outperform reasoning, stronger defences against prompt variability,
Minerva yet trail the Gemma large pair, indicating that and tighter integration with external tools and retrieval.
Italian-specific fine-tuning narrows the gap without fully In future work, we plan to extend the evaluation to a
matching the breadth of a high-capacity multilingual cloze-style, open-ended generation setting, where models
pre-training. must produce the correct answer without being shown</p>
        <p>Outside the highlighted blocks agreement drops the five multiple-choice options. This format may ofer a
sharply, especially between native-Italian and general more faithful assessment of their reasoning abilities and
multilingual systems, supporting the idea that language- reduce positional biases. The dataset is already formatted
specific pre-training steers models toward distinct deci- to support this task. However, given that only a subset
sion patterns. of LLMs currently achieves suficient performance in the</p>
        <p>Topic-Wise Agreement (see Appendix C). Topic- classification setting, such a shift could pose an even
specific heat-maps paint a similar picture with nuanced greater challenge. In addition, we plan to carry out a
shifts: systematic exploration of decoding strategies and
hyper</p>
        <p>Biology and Chemistry closely reflect the global pat- parameters to quantify how sensitive exam performance
tern: Minerva models cluster tightly, while Gemma leads and answer stability are to these settings. Such ablations
a smaller high-accuracy duo, confirming that factual dis- might provide deeper insights into model robustness and
ciplines accentuate family-specific biases. optimal inference configurations.</p>
        <p>In Logic &amp; Reasoning, the Minerva block tightens
even further, with overlaps reaching ≥ 70%, implying
that reasoning errors are strongly correlated across those Acknowledgments
checkpoints.</p>
        <p>Mathematics &amp; Physics show the widest dispersion: Authors were supported by two projects: 1) the European
cross-family overlaps fall below 40% for most pairs, sug- Union under the Horizon Europe Programme through
gesting numerical items provoke model-specific heuris- the Innovative Health Initiative Joint Undertaking (IHI
tics rather than common patterns. JU) – Project GRACE (Project number: 101194778, Project</p>
        <p>General Knowledge falls in between, exhibiting mod- name: bridGing gaps in caRdiAC health managEment).
erate agreement across the board. 2) the European Union - Next Generation EU - NRRP</p>
        <p>Altogether, these observations confirm the main find- M6C2 - Investment 2.1 Enhancement and strengthening
ing: models that share pre-training data and objectives of biomedical research in the NHS - Project
PNRR-MR1tend to converge on the same answers while larger, 2022-12376635 - ”Early Detection of Rare Inherited
Retibroadly-trained multilingual baselines remain both accu- nal Dystrophies and Cardiac Amyloidosis enhanced by
rate and mutually consistent. Model size amplifies these Artificial Intelligence: the impact on the patient’s
pathtrends, and Italian-specialised multilingual checkpoints way in Campania Region” (CUP: C83C22001540007)
occupy an intermediate space, benefiting from targeted
ifne-tuning yet still trailing the strongest generalist pair.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <sec id="sec-5-1">
        <title>Large multilingual LLMs have begun to clear the Italian</title>
        <p>medical-school admission bar, but they are still far from
matching the level reached by human examinees. On
the 3 301-question benchmark, the 9-billion-parameter
Gemma-2 family scored 58-60 / 90 with macro-1 around
A. Prompts</p>
        <p>Logic &amp; Reasoning
Italian
I partecipanti a una gara di corsa sono 150, di cui 98
maschi, 120 biondi e 90 destrorsi. Qual è il numero minimo
di maschi, biondi e destrorsi che partecipano alla gara?
English
There are 150 participants in a running race: 98 are
male, 120 are blond, and 90 are right-handed. What is
the minimum possible number of participants who are
simultaneously male, blond, and right-handed?
Options: (A) 8 (B) 10 (C) 20 (D) 12 (E) 18
Correct answer: (A) 8
gemma-2-9B-it answer: (B) 10</p>
      </sec>
      <sec id="sec-5-2">
        <title>These snapshots show two common failure modes: a</title>
        <p>
          numerical miscalculation in the geometry element and
an overestimation in the logical minimization problem,
both of which may be influenced by the model’s observed
positional bias towards option (B) (as shown in Figure 3).
P1. Domanda con 5 risposte (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-4</xref>
          ). Scrivi solo il numero della
risposta corretta. Nessun altro output.
        </p>
        <p>P2. Simula il comportamento di un candidato che ha studiato
intensamente per il test di Medicina in Italia. Per ciascuna domanda
a risposta multipla con opzioni da 0 a 4, indica solo il numero della
risposta che ritieni corretta. Non fornire alcuna motivazione.
Concentrati sull’eficienza e sulla precisione.</p>
        <p>
          P3. Sei uno studente italiano con solide competenze in Logica e
Ragionamento, Biologia, Chimica, Matematica, Fisica e Cultura
Generale. Stai sostenendo il test di ingresso alla facoltà di
Medicina. Ti verrà fornita una domanda a risposta multipla seguita da
cinque opzioni numerate da 0 a 4. Il tuo compito è indicare
esclusivamente il numero (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-4</xref>
          ) corrispondente all’alternativa corretta.
Non fornire spiegazioni.
        </p>
        <p>
          English (translation)
P1. Question with 5 answers (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-4</xref>
          ). Write only the number of the
correct answer. No other output.
        </p>
        <p>P2. Simulate the behaviour of a candidate who has studied
extensively for the Italian Medical School admission test. For each
multiple-choice question with options 0-4, output only the number
of the option you believe is correct. Provide no justification. Focus
on eficiency and accuracy.</p>
        <p>
          P3. You are an Italian student with strong skills in Logic and
Reasoning, Biology, Chemistry, Mathematics, Physics, and General
Culture. You are taking the entrance exam for the Faculty of
Medicine. You will be given a multiple-choice question followed
by five options numbered 0 to 4. Your task is to output only the
number (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-4</xref>
          ) corresponding to the correct option. Do not provide
any explanation.
(a) Biology
(b) Chemistry
(c) General Knowledge
(d) Mathematics &amp; Physics
(e) Logic &amp; Reasoning
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>C. Heat-maps of Model Agreement</title>
      <p>models rarely exceed 40% overlap. Still, domain-specific
traits emerge: Logic &amp; Reasoning shows high Minerva
coherence (≥ 70%), suggesting shared shortcuts; Math
&amp; Physics shows the lowest cross-family overlap, likely
due to numerical complexity. These results confirm that
agreement varies by domain and should be interpreted
accordingly.</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Text
translation, Paraphrase and reword, Improve writing style, and Grammar and spelling check. After
using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, Benchmarking large language models on italarXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ). ian,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.02289.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Casola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Labruna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          , Testing arXiv:
          <volume>2502</volume>
          .02289.
          <article-title>chatgpt for stability and reasoning: a case study [13]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Campagnano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Silusing italian medical specialty tests (2023). vestri, DanteLLM: Let's push Italian LLM research</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Altuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Karunakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          , forward!, in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Speranza</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zanoli</surname>
          </string-name>
          , Clinkart at evalita
          <year>2023</year>
          :
          <article-title>A</article-title>
          .
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <article-title>Proceedings of Overview of the task on linking a lab result to its the 2024 Joint International Conference on Comtest event in the clinical domain</article-title>
          .,
          <source>EVALITA</source>
          (
          <year>2023</year>
          ).
          <source>putational Linguistics</source>
          , Language Resources and
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Puccetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cassese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          , INVALSI - mathe
          <article-title>- Evaluation (LREC-COLING 2024), ELRA and ICCL, matical and language understanding in Italian: A Torino</article-title>
          ,
          <year>Italia</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>4343</fpage>
          -
          <lpage>4355</lpage>
          . URL:
          <article-title>https: CALAMITA challenge</article-title>
          , in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Lenci, //aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>388</volume>
          /. S. Montemagni, R. Sprugnoli (Eds.), Proceedings [14]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Galatolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Cimino</surname>
          </string-name>
          , Cerbero-7b:
          <article-title>A leap forof the 10th Italian Conference on Computational ward in language-specific llms through enhanced Linguistics (CLiC-it</article-title>
          <year>2024</year>
          ), CEUR Workshop Pro
          <article-title>- chat corpus generation and evaluation, arXiv ceedings</article-title>
          , Pisa, Italy,
          <year>2024</year>
          , pp.
          <fpage>1168</fpage>
          -
          <lpage>1175</lpage>
          . URL: preprint arXiv:
          <volume>2311</volume>
          .15698 (
          <year>2023</year>
          ). https://aclanthology.org/
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .129/. [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          , L. Siciliani,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gofetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          , G. Fiameni, G. Semeraro,
          <source>Llamantino: Llama</source>
          <volume>2</volume>
          modM. Nissim,
          <article-title>Mult-IT multiple choice questions els for efective text generation in italian language, on multiple topics in Italian: A CALAMITA chal- 2023</article-title>
          . arXiv:
          <volume>2312</volume>
          .09993. lenge, in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            , S. Montemagni, [16]
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Tunstall</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Beeching</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Lambert</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Rajani</surname>
          </string-name>
          , R. Sprugnoli (Eds.),
          <source>Proceedings of the 10th Italian K</source>
          . Rasul,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belkada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          , L. von Werra, Conference on Computational
          <string-name>
            <surname>Linguistics (CLiC- C. Fourrier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Habib</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Sarrazin</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Sanseviero</surname>
          </string-name>
          , it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          , T. Wolf,
          <source>Zephyr: Direct distillation of 2024</source>
          , pp.
          <fpage>1184</fpage>
          -
          <lpage>1201</lpage>
          . URL: https://aclanthology.org/ lm alignment,
          <year>2023</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>16944</fpage>
          .
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .131/. [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.-A.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          , M. Fran- Lachaux,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lacroix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rozière</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hamcis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          , V. Patti, bro,
          <string-name>
            <given-names>F.</given-names>
            <surname>Azhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scalena</surname>
          </string-name>
          , CALAMITA: Challenge G. Lample,
          <article-title>Llama: Open and eficient foundation the abilities of LAnguage models in ITAlian</article-title>
          , in: language models,
          <year>2023</year>
          . URL: https://arxiv.org/abs/ F. Dell'Orletta,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , R. Sprug-
          <volume>2302</volume>
          .13971. arXiv:
          <volume>2302</volume>
          .13971. noli (Eds.),
          <source>Proceedings of the 10th Italian</source>
          Confer- [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mesnard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hardin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dadashi</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Bhuence on Computational Linguistics (CLiC-it</article-title>
          <year>2024</year>
          ), patiraju, S. Pathak,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sifre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rivière</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Kale</surname>
          </string-name>
          , CEUR Workshop Proceedings, Pisa, Italy,
          <year>2024</year>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Love</surname>
          </string-name>
          , et al.,
          <source>Gemma: Open models based on</source>
          pp.
          <fpage>1054</fpage>
          -
          <lpage>1063</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          . gemini research and technology,
          <source>arXiv preprint clicit-1</source>
          .116/. arXiv:
          <volume>2403</volume>
          .08295 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          , Cal- [19]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Bamibrate before use: Improving few-shot performance ford</article-title>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas, F. Bressand,
          <article-title>of language models</article-title>
          , in: International conference G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-</surname>
          </string-name>
          <article-title>on machine learning</article-title>
          ,
          <source>PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>12697</fpage>
          -
          <lpage>12706</lpage>
          . A.
          <string-name>
            <surname>Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
          </string-name>
          , T. Wang,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lacroix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Sayed</surname>
          </string-name>
          , Mistral 7b,
          <year>2023</year>
          . URL:
          <string-name>
            <surname>https: M. Chen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Xiao</surname>
          </string-name>
          , Adversarial demonstration at- //arxiv.org/abs/2310.06825. arXiv:
          <volume>2310</volume>
          .06825.
          <article-title>tacks on large language models</article-title>
          , arXiv preprint [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          , P.-L. Huguet
          <string-name>
            <surname>Cabot</surname>
          </string-name>
          , S. CoarXiv:
          <volume>2305</volume>
          .14950 (
          <year>2023</year>
          ). nia, E. Barba,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          , G. Fiameni, R. Nav-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          , igli, Minerva LLMs:
          <article-title>The first family of large Q</article-title>
          . Liu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sui</surname>
          </string-name>
          ,
          <article-title>Large language models are language models trained from scratch on Italian not fair evaluators</article-title>
          ,
          <source>arXiv preprint arXiv:2305</source>
          .17926 data, in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S. Montemagni</given-names>
          </string-name>
          , (
          <year>2023</year>
          ). R. Sprugnoli (Eds.),
          <source>Proceedings of the 10th Italian</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pezeshkpour</surname>
          </string-name>
          , E. Hruschka,
          <article-title>Large language mod- Conference on Computational Linguistics (CLiCels sensitivity to the order of options in multiple</article-title>
          - it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy, choice questions,
          <source>arXiv preprint arXiv:2308.11483</source>
          <year>2024</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          . URL: https://aclanthology.org/ (
          <year>2023</year>
          ).
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .77/.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Resta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          , [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Madeddu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <article-title>Evalita-llm: Large language models are not robust multiple choice selectors</article-title>
          ,
          <source>arXiv preprint arXiv:2309.03882</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>B.</given-names>
            <surname>Upadhayay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Behzadan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karbasi</surname>
          </string-name>
          ,
          <article-title>Cognitive The study adopts three system prompts that difer systemoverload attack: Prompt injection for long context, atically in both length and semantic richness</article-title>
          ,
          <source>allowing us arXiv preprint arXiv:2410.11272</source>
          (
          <year>2024</year>
          ).
          <article-title>to examine how sensitive each model is to the amount of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S. S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Verbosity ̸=
          <article-title>contextual information it receives before attempting the veracity: Demystify verbosity compensation be- task. The three system prompts are presented in Table 3. havior of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2411.07858</source>
          (
          <year>2024</year>
          ). B. Concrete answer examples
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          ,
          <article-title>To illustrate the kinds of mistakes made by the topet al</article-title>
          .,
          <article-title>Scaling instruction-finetuned language mod- performing model (gemma-2-9B-it, prompt 3), we report els</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>25</volume>
          (
          <year>2024</year>
          )
          <article-title>two representative items: one from the Mathematics</article-title>
          &amp;
          <fpage>1</fpage>
          -
          <lpage>53</lpage>
          .
          <article-title>Physics subset and one from the Logic &amp; Reasoning sub-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>D. set, together with the label and the model's prediction. Zeng, Unveiling factual recall behaviors of large lan- Each question is shown first in Italian and then in English. guage models through knowledge neurons</article-title>
          ,
          <source>arXiv preprint arXiv:2408.03247</source>
          (
          <year>2024</year>
          ). Mathematics &amp; Physics
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Varshney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mashetty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Baral</surname>
          </string-name>
          ,
          <article-title>Logicbench: Italian Towards systematic evaluation of logical reasoning Quanto vale il rapporto tra il volume e la superficie di un ability of large language models, arXiv preprint cilindro di raggio 6 cm e altezza 12 cm</article-title>
          ? arXiv:
          <fpage>2404</fpage>
          .15522 (
          <year>2024</year>
          ). English
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lou</surname>
          </string-name>
          , D. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Yin,
          <article-title>What is the ratio between the volume and the surface area Large language models for mathematical reason- of a cylinder with 6 cm radius and 12 cm height? ing: Progresses and challenges, arXiv preprint Options: (A) 2 cm (B) 1,5 cm (C) 1 cm (D) 0,5 cm (E</article-title>
          ) arXiv:
          <fpage>2402</fpage>
          .00157 (
          <year>2024</year>
          ). 4 cm
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Luo</surname>
          </string-name>
          , S. Cheng,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Correct answer: (A) 2 cm W</article-title>
          . Y.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y. Zhang,</given-names>
          </string-name>
          <article-title>Disentangling memory and gemma-2-9B-it answer: (B) 1,5 cm reasoning ability in large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2411.13504</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>