<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Easy to Complete, Hard to Choose: Investigating LLM Performance on the ProverbIT Benchmark</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Enrico Mensa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Zane</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Calogero J. Scozzaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Delsanto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Milani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele P. Radicioni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Turin</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Independent Researcher</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Large Language Models (LLMs) have transformed computational linguistics and achieved remarkable performance across numerous natural language processing tasks, yet significant gaps persist in understanding how these systems process culturally embedded linguistic expressions. This paper introduces ProverbIT, a novel Italian benchmark comprising 100 multiplechoice questions designed to evaluate LLMs' ability to complete Italian proverbs. We assess 13 frontier models, including Large Reasoning Models (LRMs) and traditional LLMs, across three tasks: proverb completion, multiple-choice selection with correct answers, and multiple-choice selection without correct answers. Our evaluation reveals surprising results: while nearly all models demonstrate knowledge of the proverbs through successful completion tasks, performance drops dramatically when transitioning to multiple-choice formats without correct answers, with even state-of-the-art reasoning models showing substantial degradation. Through detailed Chain-of-Thought analysis of two LRMs, we uncover that models exhibit a strong bias toward selecting literal synonyms and frequently mention correct proverb endings during reasoning without successfully identifying their absence from the given options. These findings suggest that current LLMs rely heavily on memorized patterns rather than deeper semantic understanding of culturally grounded expressions, highlighting important limitations in their reasoning capabilities for figurative language comprehension.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Proverb Completion</kwd>
        <kwd>Chain-of-Thought Analysis</kwd>
        <kwd>Large Reasoning Models</kwd>
        <kwd>Benchmark Evaluation</kwd>
        <kwd>Multiple-Choice Tasks</kwd>
        <kwd>Semantic Reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>LLMs in handling culturally grounded linguistic expres- tasks, figurative language phenomena such as idioms and
sions. By manually designing alternative endings for proverbs have received comparatively little attention.
Idthe proverbs, we can systematically examine the types ioms are a well-known source of complexity in natural
of errors LLMs make and identify common failure pat- language understanding [16], as their meanings often
terns. Our investigation shows a striking paradox: while cannot be inferred literally and require contextual and
nearly all models possess knowledge of the proverbs in cultural knowledge. Fornaciari et al. [3] introduced an
our dataset, performance deteriorates dramatically when expert-curated English dataset for idiom detection,
showmoving from auto-completion to multiple-choice selec- ing that local LLMs struggle to distinguish idiomatic from
tion, with even state-of-the-art LRMs exhibiting substan- literal usage. In the context of multilingual approaches,
tial performance drops. Tedeschi et al. [17] presented the ID10M dataset, a
high</p>
      <p>The contribution of this paper is threefold: i) we quality, automatically generated resource covering ten
contribute to Italian NLP benchmarks by introducing languages, along with a multilingual Transformer model
a novel dataset that addresses the under-representation for idiom identification. Significant diferences in LLM
of Italian in comprehensive language model evaluation performance across languages and figurative types were
resources [5]; ii) we conduct a thorough evaluation across reported by Khoshtab et al. [18], who evaluated simile
13 frontier models, including LLMs, LRMs, and smaller and idiom interpretation under various prompt
stratelocal models, providing comprehensive performance anal- gies. Notably, CoT prompting was particularly efective
ysis on proverb completion tasks; and iii) we investigate for similes in smaller models. Kim et al. [19] presented
LRMs performance through detailed Chain-of-Thought a dataset of idioms in six languages, each paired with
(CoT) analysis, revealing insights into reasoning strate- its corresponding meaning. The authors conducted a
gies and cultural language understanding mechanisms comprehensive evaluation of LLMs’ ability to process
in contemporary language models. idioms, showing that models rely not only on
memo</p>
      <p>The paper is organized as follows: Section 2 reviews rization but also on a hybrid approach that integrates
the literature on LLM performance with idioms and contextual cues and reasoning, suggesting that idiom
unproverbs; Section 3 illustrates the ProverbIT dataset derstanding emerges from an interplay between internal
constructive rationale and its features; Section 4 presents knowledge retrieval and inference. Moreover, their
reour evaluation of frontier LLMs on the task along with sults highlight a performance gap between high-resource
detailed error and CoT analysis; and Section 5 summa- and lower-resource languages.
rizes the work with a final discussion and an overview Idioms have also been an area of interest in the
maon future research directions. chine translation (MT) literature, where multiple studies
have explored how models translate figurative
expressions across languages. Lee et al. [20] presented TIDE, a
2. Related Work dataset of 512 sentence pairs containing idioms in
disambiguating contexts, with one sentence using the
idStandardized benchmarks have been fundamental in eval- iom literally and the other figuratively. They compared
uating the performance of LLMs across a variety of natu- MT systems and language models, finding that the
forral language processing tasks. Early eforts, such as GLUE mer consistently translate English idioms literally, while
[6] and SuperGLUE [7], were based on multi-task evalu- the latter are more context-aware, even though
perforation frameworks including tasks such as paraphrase mance varies across target languages. One strategy to
detection, grammatical acceptability, and natural lan- improve idiom translation, particularly in smaller
language inference. More recently, benchmarking eforts guage models, involves the use of knowledge bases (KBs).
have expanded into other domains, such as mathemat- Li et al. [21] introduced IDIOMKB, a multilingual idiom
ics [8, 9], coding [10, 11], and complex logical reasoning KB developed using LLMs, designed to enhance
transtasks [12, 13]. These advances reflect the increasing de- lation quality by providing access to idioms’ figurative
mand for language models capable of handling a broader meanings. However, this approach does not preserve the
range of cognitive challenges. cultural and stylistic nuances that make idioms so
distinc</p>
      <p>Focusing on Italian, dedicated benchmark eforts have tive. To address this issue, Donthi et al. [22] proposed
emerged to address language-specific issues and reduce two alignment-based methods that aim to identify
idreliance on translated tasks, which can introduce cultural iomatic counterparts in the target language. Their results,
bias and translation artifacts. Notable among these are based on human evaluation across multiple language
CALAMITA [14], a collaborative and evolving bench- pairs, show improved cross-lingual idiomatic fidelity and
mark initiative, and Evalita-LLM [15]. Both focus on better preservation of cultural authenticity.
tasks originally designed in Italian and include a mix of The work most closely related to ours is by Liu et al.
generative and multiple-choice tasks. [23], who focused specifically on proverbs. They
intro</p>
      <p>While these benchmarks cover a broad spectrum of duced the MAPS dataset, designed to evaluate proverb</p>
    </sec>
    <sec id="sec-2">
      <title>3. ProverbIT Dataset</title>
      <sec id="sec-2-1">
        <title>3.1. Data Collection and Dataset Creation</title>
        <sec id="sec-2-1-1">
          <title>The ProverbIT1 dataset is composed of 100 multi-choice</title>
          <p>questions, each regarding the completion of a specific
Italian proverb. To create the dataset, we started from
an initial set of 200 common Italian proverbs [25] from
which we selected 100 of the most commonly used. This
process was carried out by three of the authors, which
are all native Italian speakers. Each proverb was then
manually split into its beginning and its ending, with the
point of division determined to maintain the proverb’s
semantic coherence in the initial part while allowing
for a clear, unambiguous completion. For each proverb,
four distinct incorrect alternative endings were manually
created, leveraging the following constructive rationale:
For sake of clarity we provide an example in English for
each of the aforementioned continuations. Completions
for the proverb Actions speak... louder than words could
be:</p>
          <p>A) prouder than swords
B) at higher volume compared to speech
C) quieter than words</p>
          <p>D) when they do
As this example shows, the synonym ending is not built
on the figurative meaning of the proverb, but it is the
literal synonym of the original ending (e.g., at higher
1The full dataset can be downloaded at https://huggingface.co/
datasets/emensa/proverbIT.
understanding within conversational contexts across six volume compared to speech rather than beyond what
languages. Their evaluation of multilingual LLMs re- words can say). This design was adopted to ensure that
vealed that while many models “know” a limited set of models cannot simply rely on surface-level syntactic
patproverbs, memorization does not guarantee understand- terns but must engage in deeper semantic and contextual
ing or contextual reasoning. Models also struggled with reasoning to identify the absence of the correct
compleifgurative proverbs, particularly when asked to select in- tion.
correct answers instead of correct ones. Wang et al. [24]
extended the MAPS dataset to evaluate MT models and 3.2. Prompt
LLMs on proverb translation. Their experiments showed
that LLMs generally outperform traditional MT models, Given each proverb in ProverbIT, we can then fill a
confirming their superior ability to capture idiomatic simple prompt template that can be submitted to the
nuances. models:</p>
          <p>We specify that the proverb must be completed exactly,
and also that there are no typos in the options since
we noticed that models often assume the presence of
• A is an ending that has similar sounds to the user mistakes and modify their responses based on this
original continuation, often with an absurd/non- assumption. Since all provided endings are completely
sensical meaning. invented and thus incorrect, we expect models to always
• B is a non assonant literal synonym of the original answer E) None of the other answers. Finally we provide
ending. an Italian example [with translation] from the actual
• C is the inverse of the original proverb ending, dataset.</p>
          <p>trying to maintain the assonance when possible.
• D is a tautological/trivial ending of the proverb, Example of proverb from the dataset
with no assonance.</p>
          <p>Prompt Template (translated)
Complete the proverb exactly by choosing from the
following options (which have no typing errors)
indicating only the letter.
[Proverb beginning]...</p>
          <p>A) ...[Assonant ending]
B) ...[Synonym ending]
C) ...[Inverse ending]
D) ...[Trivial ending]
E) None of the other answers
Do not add comments, the possible answers are only
A, B, C, D, E.</p>
          <p>A buon intenditor,... [To a wise man]
A) ...foche canore [singing seals]
B) ...zero chiacchiere [zero chatter]
C) ...molte parole [many words]
D) ...è chiaro tutto [everything is clear]
E) Nessuna delle altre risposte [None of the other
answers]</p>
          <p>More examples can be found in the Supplementary</p>
          <p>Materials.
Claude Sonnet 4
Claude Sonnet 4
GPT 4o
GPT o3
DeepSeek V3
DeepSeek R1
Gemini 2.5 Flash
Gemini 2.5 Pro
Qwen 3
Grok 3
LLama 4 Maverick
Mistral Small 3.1
Gemma 3
claude-sonnet-4
claude-sonnet-4-thinking
gpt-4o
gpt-o3
deepseek-chat-v3-0324
deepseek-r1-0528
gemini-2.5-flash-preview-05-20
gemini-2.5-pro-preview-06-05
Qwen 3-235b-a22b
grok-3-beta
llama-4-maverick
mistral-small-3.1-24b-instruct
gemma-3-27b-it</p>
          <p>Provider
Anthropic
Anthropic
OpenAI
OpenAI
DeepSeek
DeepSeek
Google
Google
QwQ
xAI
Meta
Mistral
Google</p>
          <p>Num. Parameters</p>
          <p>Undisclosed
Undisclosed
Undisclosed
Undisclosed
671B
671B
Undisclosed
Undisclosed</p>
          <p>235B
Undisclosed
400B
24B
27B</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Evaluation</title>
      <p>but also providing the correct ending, we expect similar
results w.r.t. the completion task.</p>
      <p>In this Section, we describe the experimental setup
developed for evaluating 13 diferent frontier models on the 4.1.1. Evaluation &amp; Metrics
ProverbIT benchmark, followed by an error analysis and
in-depth examination of the underlying chain-of-thought In the base and base + true ending tasks we computed the
processes for two LRM models. accuracy defined as the ratio of correctly chosen options
over the multiple choices. Specifically, each prompt was
4.1. Experiments presented to each model three times and the final answer
was determined through a majority vote between them. If
In addition to evaluating the models on the ProverbIT no majority emerged across the three runs, the response
benchmark introduced in the previous Section, we also was marked as incorrect.
perform two ancillary tasks to assess whether the mod- For the automatic calculation of the accuracy on the
els possess knowledge of the proverbs. We refer to the completion task we compute the edit distance2 between
ProverbIT benchmark as to the base task, while the two the provided completion and the correct ending of the
ancillary tasks are described in the following. proverb. As with the other tasks, each prompt is
submitted three times. If the edit distance exceeds a threshold
Completion Task. Instead of a multiple-choice ap- of 0.8 in at least two out of three runs, we consider the
proach, we ask the model to directly complete a proverb answer correct.
given its beginning. This task establishes if the model is For all tasks, a zero-shot prompting strategy was
emfamiliar with the requested proverbs. The prompt used ployed and all requests have been sent separately via API,
for the completion task is as follows: specifically using the OpenRouter unified interface [ 26].</p>
      <p>For all models the temperature was left at the default
Completion Prompt Template (translated) OpenRouter value of 1.0 since we countered their
nondeterministic nature by employing a majority vote.</p>
      <p>Complete the proverb exactly:
[Proverb beginning]...</p>
      <sec id="sec-3-1">
        <title>4.2. Models</title>
        <p>In our experiments we employed a diverse set of
stateReply with the ending only, do not add further of-the-art models including both traditional LLMs and
comments. LRMs, aiming to cover a wide range of providers.
Whenever possible, we selected both a flagship LLM and its
corresponding LRM from the same organization,
allowBase + true ending Task. We add to each multiple- ing us to directly compare their performance and assess
choice question a new option that is the true ending of
the proverb. By preserving the multiple-choice format 2The implementation from https://docs.python.org/3/library/
diflib .html was employed.
the improvements brought by the reasoning mechanism.</p>
        <p>The complete list of models and their full names can be
found in Table 1.</p>
        <sec id="sec-3-1-1">
          <title>From Anthropic, we evaluated Claude Sonnet 43 and</title>
          <p>its reasoning variant Claude Sonnet 4 Thinking. From</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>OpenAI, we included GPT 4o [27] and GPT o3.4 From</title>
          <p>DeepSeek, we employed DeepSeek V3 [28] and DeepSeek</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>R1 [29]. From Google, we tested Gemini 2.5 Flash5 and</title>
          <p>Gemini 2.5 Pro.6 We also included Qwen 3 [30], a model
optimized for reasoning developed by QwQ, Grok 37
from xAI, and LLama 4 Maverick8 from Meta. We also
included two smaller models suitable for local
deployment, as these are commonly used in privacy-sensitive
contexts and contexts that require less computational
resources. Although privacy concerns are not relevant
for the ProverbIT dataset, these models were included to
ensure comprehensive evaluation coverage. In particular
we tested Mistral Small 3.19 from Mistral and Gemma 3
[31] from Google. Regarding models specifically trained
on Italian, we preliminarily tested the Italian LLM
Minerva [32] but found that it was unable to respond
coherently, often failing to follow the requested response
format (i.e., in providing a letter corresponding to a given
choice).</p>
          <p>Given that some reasoning models require a
mandatory thinking budget while others do not, we set a
reasonable thinking budget of 2000 tokens for o3, Sonnet 4,
and Gemini 2.5 Pro, while DeepSeek R1 and Qwen 3 were
left unlimited. Moreover, the first three models output
only a summarization of their CoTs, while the latter two
provide their complete trace. This makes DeepSeek R1
and Qwen 3 ideal candidates for the CoT analysis that we
performed. We observed that only 22 out of 600 CoTs
from these two models exceeded the 2000-token limit,
half of them resulting in an incorrect answer anyway.
4.3. Results &amp; Discussion
3https://www.anthropic.com/claude/sonnet
4https://openai.com/index/o3-o4-mini-system-card/
5https://storage.googleapis.com/model-cards/documents/gemini2.5-flash-preview .pdf
6https://storage.googleapis.com/model-cards/documents/gemini2.5-pro-preview.pdf
7https://x.ai/news/grok-3
8https://ai.meta.com/blog/llama-4-multimodal-intelligence/
9https://mistral.ai/news/mistral-small-3-1
In this Section we examine the results of the evaluation
and provide a detailed discussion on the errors.</p>
          <p>In Table 2 we present the results recorded on the
ProverbIT benchmark and the ancillary tasks. Models
are sorted based on their performance on the ProverbIT
task: such an ordering highlights a clear separation of an unexpected phenomenon: virtually all non-thinking
performance between thinking vs. non-thinking models. models sufer from steep performance deterioration. For</p>
          <p>By comparing the performances between the ancil- instance, GPT 4o achieves 92% on the base + true ending
lary tasks and the ProverbIT benchmark, we uncover task but only 64% on ProverbIT. Claude Sonnet 4 loses
47 percentage points, DeepSeek V3 loses 52 percentage
points, and Grok 3 drops by 69 percentage points. The
most dramatic performance decline occurs with LLama
4 Maverick, which plummets from 75% and 88% on the
ancillary tasks to merely 6% on ProverbIT. Notably,
Mistral’s performance, given its relatively modest size (24B
parameters), suggests that domain-specific optimization—
through more focused Italian and broader
Europeanlanguage training [33]—may play a significant role in
enhancing model eficiency for culturally grounded tasks.</p>
          <p>LRMs are less prone to this performance drop; however,
we still observe significant deterioration of about 10-20
percentage points. These findings suggest that the
transition from pattern completion to discriminative reasoning
fundamentally challenges current language models’
understanding mechanisms. The substantial performance
gaps confirm that models rely heavily on memorized
linguistic patterns rather than genuine semantic
comprehension of proverbs. This deterioration becomes particularly
pronounced when models must evaluate and reject
plausible but incorrect alternatives, highlighting limitations in
their ability to engage in deeper cultural and contextual
reasoning. The relatively better performance of
reasoning models suggests that explicit reasoning processes can
partially compensate for these limitations, though
significant challenges remain in achieving robust figurative
language understanding.</p>
          <p>Detailed error analysis. Table 3 details the
categorization of incorrect responses as a percentage of total errors.</p>
          <p>The results reveal a strong skew toward option B,
highlighting a consistent preference among the models for
selecting synonyms–even if they are literal and not
figurative. This pattern is less evident among local models,
whose responses appear more equally distributed,
possibly reflecting greater variability or reduced confidence
in their outputs. The complete report of each model’s
responses is provided in Table 5 in the Supplementary
Materials.</p>
          <p>Analyzed CoTs
Empty CoTs
Average Words
CoT &gt; 2000 Words
Languages</p>
          <p>DeepSeek R1</p>
          <p>• Language inconsistency in CoTs:10
Approximately half of DeepSeek’s CoTs are generated
in English while the other half appear in
Italian, with occasional language switching
occurring within individual reasoning traces. In
contrast, Qwen consistently produces CoTs
exclusively in English (except when citing the question
options). This multilingual reasoning presents
significant interpretability challenges,
particularly for tasks involving idiomatic content, as
cultural nuances and figurative meanings may
be lost or misrepresented when reasoning shifts
between languages [35, 36]. We hypothesize that
this limitation stems from these models’ training
distribution, which prioritizes Chinese and
English content with comparatively limited Italian
language exposure.
4.3.1. CoTs Analysis
For the CoT analysis, we only take in consideration
DeepSeek R1 and Qwen 3, as they are the only models
that provide a full CoT trace rather than a
summarization. As discussed earlier, these models were run with an
unlimited thinking budget.</p>
          <p>Since we ran the ProverbIT benchmark three times
in order to compute the majority vote for the accuracy,
we automatically analyzed a total of 600 CoTs (300 for
each model). Table 4 provides a preliminary overview of
our analyses. Most prompts provided a non-empty CoT,
and from our investigation we discovered two interesting
facts:</p>
          <p>We analyzed the non-empty CoTs by tracing mentions
of correct and incorrect answers within the thinking
process. We examined separately cases where the model
responds correctly versus incorrectly. Specifically, the left
subfigures of Figures 1 and 2 show the absolute number
of mentions of the correct answer (which is always E
None of the others) and all incorrect answers when the
model answers correctly. Conversely, the right subfigures
show the absolute number of mentions of the correct
answer and the specific incorrect answer provided when
the model responds incorrectly. We additionally plot as
a dotted line the absolute number of mentions of the
correct completion of the proverb (which was not given
in the prompt).</p>
          <p>These graphs reveal that both models continuously
mention all possible answers throughout their reasoning
• Overthinking: Models occasionally exhibit process, while the spikes toward the end indicate that
overthinking behavior [34], a documented phe- models reach a decision only in the final lines of their
nomenon afecting LRMs where they continu- CoTs. However, this decision-making appears tentative,
ously re-evaluate their assessment of the correct as alternative options remain heavily mentioned
alonganswer. This results in CoTs exceeding 4,000 side the chosen answer, suggesting low confidence in the
words in length, compared to an average of ap- final selection.
proximately 700 words for typical responses.</p>
          <p>10Automatic language detection was performed via https://pypi.org/</p>
          <p>project/langdetect/.</p>
          <p>When DeepSeek R1 answers correctly (213 instances)</p>
          <p>When DeepSeek R1 answers incorrectly (87 instances)
Mentions of any A/B/C/D
Mentions of the answer (E)
Mentions of correct proverb ending</p>
          <p>Mentions of the answer (either A or B or C or D)
Mentions of E</p>
          <p>Mentions of correct proverb ending
120
s
n
ito100
n
e
m
fo80
r
e
b
um60
n
e
t
u
lso40
b
A
20</p>
          <p>The dotted lines clearly demonstrate that models are ing fundamental limitations in how current LRMs handle
aware of the correct proverb ending and repeatedly ref- negative reasoning tasks [37].
erence it throughout their thinking process. These
observations highlight a critical disconnect: while models can Inconsistency between CoTs and answers. As a
successfully recall the correct proverb completion, they ifnal finding, we also discovered that sometimes model
fail to recognize its absence among the provided choices. responses were inconsistent with their corresponding
This suggests that the challenge lies not in knowledge CoT. For instance, out of the 113 incorrect responses from
retrieval but in the discriminative reasoning required to Qwen, 14 of them are inconsistent, ending with sentences
identify when the correct answer is unavailable, reveal- like ‘The answer is X’, but then the actual given answer
highlighting important limitations in their reasoning
capabilities for figurative language comprehension.
was not X. Remarkably, in all of these instances adhering
to the CoT-delivered conclusion would have resulted in
a correct answer. Similarly, for DeepSeek R1, 6 of the
87 incorrect responses exhibited such a discrepancy, 5
(5.7%) of which would have lead to the correct answer.</p>
          <p>This behavior has been observed in prior work [38].</p>
          <p>In the Supplementary Materials we report two
complete CoTs showing instances of english that leads to a
wrong answer and answer mismatch.
[1] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer,</p>
          <p>
            H. Michalewski, V. Ramasesh, A. Slone, C. Anil,
I. Schlag, T. Gutman-Solo, et al., Solving
quantitative reasoning problems with language models,
Advances in Neural Information Processing
Sys5. Conclusions &amp; Future Work tems 35 (2022) 3843–3857.
[2] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu,
In this work, we introduced ProverbIT, a novel Italian H. Chen, X. Yi, C. Wang, Y. Wang, et al., A
surbenchmark designed to evaluate Large Language Models’ vey on evaluation of large language models, ACM
ability to handle culturally grounded linguistic expres- transactions on intelligent systems and technology
sions through proverb completion tasks. Our compre- 15 (
            <xref ref-type="bibr" rid="ref1">2024</xref>
            ) 1–45.
hensive evaluation of 13 frontier models, including both [3] F. D. L. Fornaciari, B. Altuna, I. Gonzalez-Dios,
Large Reasoning Models and traditional LLMs, provides M. Melero, A hard nut to crack: Idiom detection
significant insights into the limitations of current lan- with conversational large language models, in:
Proguage understanding systems. ceedings of the 4th Workshop on Figurative
Lan
          </p>
          <p>
            Our findings demonstrate a relevant gap between guage Processing (FigLang 2024), 2024, pp. 35–44.
models’ knowledge of proverbs and their ability to ap- [4] F. Xu, Q. Hao, Z. Zong, J. Wang, Y. Zhang, J. Wang,
ply this knowledge in discriminative reasoning tasks. X. Lan, J. Gong, T. Ouyang, F. Meng, et al., Towards
While nearly all evaluated models successfully com- large reasoning models: A survey of reinforced
reaplete proverbs when presented with direct completion soning with large language models, arXiv preprint
prompts, performance drops dramatically when the same arXiv:2501.09686 (
            <xref ref-type="bibr" rid="ref12">2025</xref>
            ).
task is reformulated as multiple-choice selection without [5] M. Wu, W. Wang, S. Liu, H. Yin, X. Wang, Y. Zhao,
correct answers available. Even state-of-the-art reason- C. Lyu, L. Wang, W. Luo, K. Zhang, The bitter
lesing models like GPT o3 and Gemini 2.5 Pro experience son learned from 2,000+ multilingual benchmarks,
substantial degradation. arXiv preprint arXiv:2504.15521 (
            <xref ref-type="bibr" rid="ref12">2025</xref>
            ).
          </p>
          <p>The Chain-of-Thought analysis of DeepSeek R1 and [6] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R.
Qwen 3 further highlights this limitation: both mod- Bowman, Glue: A multi-task benchmark and
analels frequently mention correct proverb endings during ysis platform for natural language understanding,
their reasoning process yet fail to recognize their ab- arXiv preprint arXiv:1804.07461 (2018).
sence from the provided options, highlighting fundamen- [7] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh,
tal challenges in negative reasoning capabilities. More- J. Michael, F. Hill, O. Levy, S. Bowman, Superglue:
over, we uncovered concerning inconsistencies in reason- A stickier benchmark for general-purpose language
ing model behavior, including overthinking, language understanding systems, Advances in neural
inforswitching during reasoning and discrepancies between mation processing systems 32 (2019).
CoT conclusions and final answers. [8] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,</p>
          <p>Future work should focus on investigating this mis- H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton,
match between knowledge retrieval and discriminative R. Nakano, et al., Training verifiers to solve math
reasoning more deeply, particularly examining how mod- word problems, arXiv preprint arXiv:2110.14168
els handle negative reasoning tasks even in seemingly (2021).
trivial scenarios where the correct answer is absent from [9] E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro,
the given options. Additional evaluation methodologies E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain,
should also be incorporated, including answer random- A. Ho, E. d. O. Santos, et al., Frontiermath: A
benchization techniques as proposed in literature [39]. mark for evaluating advanced mathematical
reason</p>
          <p>
            In summary, our results underscore the critical im- ing in ai, arXiv preprint arXiv:2411.04872 (
            <xref ref-type="bibr" rid="ref1">2024</xref>
            ).
portance of developing language-specific benchmarks [10] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei,
that capture cultural and linguistic nuances often lost in O. Press, K. R. Narasimhan, Swe-bench: Can
lanEnglish-centric evaluations, showing that current LLMs guage models resolve real-world github issues?, in:
rely heavily on memorized patterns rather than deeper se- ICLR, 2024.
mantic understanding of culturally grounded expressions, [11] N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang,
S. Wang, A. Solar-Lezama, K. Sen, I. Stoica, Live- [23] C. Liu, F. Koto, T. Baldwin, I. Gurevych, Are
multicodebench: Holistic and contamination free eval- lingual llms culturally-diverse reasoners? an
invesuation of large language models for code, arXiv tigation into multicultural proverbs and sayings, in:
preprint arXiv:2403.07974 (
            <xref ref-type="bibr" rid="ref1">2024</xref>
            ). Proceedings of the 2024 Conference of the North
[12] F. Chollet, On the measure of intelligence, arXiv American Chapter of the Association for
Computapreprint arXiv:1911.01547 (2019). tional Linguistics: Human Language Technologies
[13] F. Chollet, M. Knoop, G. Kamradt, B. Landers, (Volume 1: Long Papers), 2024, pp. 2016–2039.
          </p>
          <p>
            H. Pinkard, Arc-agi-2: A new challenge for [24] M. Wang, V.-T. Pham, F. Moghimifar, T.-T. Vu,
frontier ai reasoning systems, arXiv preprint Proverbs run in pairs: Evaluating proverb
transarXiv:2505.11831 (
            <xref ref-type="bibr" rid="ref12">2025</xref>
            ). lation capability of large language model, arXiv
[14] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran- preprint arXiv:2501.11953 (
            <xref ref-type="bibr" rid="ref12">2025</xref>
            ).
cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri- [25] F. Caramagna, I 200 proverbi italiani più
naldi, et al., Calamita: Challenge the abilities of belli e famosi (con significato), 2025. URL:
language models in italian, in: Proceedings of the
https://aforisticamente.com/i-200-proverbi10th Italian Conference on Computational Linguis- italiani-piu-belli-e-famosi-con-significato/.
tics (CLiC-it 2024), Pisa, Italy, 2024. [26] OpenRouter, Openrouter: A unified interface for
[15] B. Magnini, R. Zanoli, M. Resta, M. Cimmino, P. Al- llms, 2024. URL: https://openrouter.ai/, accessed:
bano, M. Madeddu, V. Patti, Evalita-llm: Bench- 2025-06-15.
marking large language models on italian, arXiv [27] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman,
preprint arXiv:2502.02289 (
            <xref ref-type="bibr" rid="ref12">2025</xref>
            ). A. Ramesh, A. Clark, A. Ostrow, A. Welihinda,
[16] I. A. Sag, T. Baldwin, F. Bond, A. Copestake, A. Hayes, A. Radford, et al., Gpt-4o system card,
D. Flickinger, Multiword Expressions: A Pain in the arXiv preprint arXiv:2410.21276 (
            <xref ref-type="bibr" rid="ref1">2024</xref>
            ).
Neck for NLP, Springer Berlin Heidelberg, 2002, p. [28] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu,
1–15. URL: http://dx.doi.org/10.1007/3-540-45715- C. Zhao, C. Deng, C. Zhang, C. Ruan, et al.,
1_1. doi:10.1007/3-540-45715-1_1. Deepseek-v3 technical report, arXiv preprint
[17] S. Tedeschi, F. Martelli, R. Navigli, Id10m: Idiom arXiv:2412.19437 (
            <xref ref-type="bibr" rid="ref1">2024</xref>
            ).
          </p>
          <p>
            identification in 10 languages, in: Findings of the [29] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang,
Association for Computational linguistics: NAACL R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al.,
2022, 2022, pp. 2715–2726. Deepseek-r1: Incentivizing reasoning capability in
[18] P. Khoshtab, D. Namazifard, M. Masoudi, llms via reinforcement learning, arXiv preprint
A. Akhgary, S. M. Sani, Y. Yaghoobzadeh, Com- arXiv:2501.12948 (
            <xref ref-type="bibr" rid="ref12">2025</xref>
            ).
parative study of multilingual idioms and similes [30] Q. Team, Qwen3 technical report, 2025. URL: https:
in large language models, in: Proceedings of the //arxiv.org/abs/2505.09388. arXiv:2505.09388.
31st International Conference on Computational [31] G. Team, A. Kamath, J. Ferret, S. Pathak, N.
VieilLinguistics, 2025, pp. 8680–8698. lard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé,
[19] J. Kim, Y. Shin, U. Hwang, J. Choi, R. Xuan, M. Rivière, et al., Gemma 3 technical report, arXiv
T. Kim, Memorization or reasoning? exploring preprint arXiv:2503.19786 (
            <xref ref-type="bibr" rid="ref12">2025</xref>
            ).
the idiom understanding of llms, arXiv preprint [32] R. Orlando, L. Moroni, P.-L. H. Cabot, S. Conia,
arXiv:2505.16216 (
            <xref ref-type="bibr" rid="ref12">2025</xref>
            ). E. Barba, S. Orlandini, G. Fiameni, R. Navigli,
Min[20] J. Lee, A. Liu, O. Ahia, H. Gonen, N. A. Smith, That erva llms: The first family of large language models
was the last straw, we need more: Are translation trained from scratch on italian data, in: Proceedings
systems sensitive to disambiguating context?, in: of the 10th Italian Conference on Computational
Findings of the Association for Computational Lin- Linguistics (CLiC-it 2024), 2024, pp. 707–719.
guistics: EMNLP 2023, 2023, pp. 4555–4569. [33] mistralai, Model card for
mistral-small-3.1-24b[21] S. Li, J. Chen, S. Yuan, X. Wu, H. Yang, S. Tao, instruct-2503, 2025. URL: https://huggingface.co/
Y. Xiao, Translate meanings, not just words: Id- mistralai/Mistral-Small-3.1-24B-Instruct-2503.
iomkb’s role in optimizing idiomatic translation [34] Y. Sui, Y.-N. Chuang, G. Wang, J. Zhang, T. Zhang,
with language models, in: Proceedings of the AAAI J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen, et al.,
Conference on Artificial Intelligence, volume 38, Stop overthinking: A survey on eficient
reason2024, pp. 18554–18563. ing for large language models, arXiv preprint
[22] S. Donthi, M. Spencer, O. B. Patel, J. Y. Doh, E. Ro- arXiv:2503.16419 (
            <xref ref-type="bibr" rid="ref12">2025</xref>
            ).
          </p>
          <p>dan, K. Zhu, S. O’Brien, Improving llm abilities in [35] J. Etxaniz, G. Azkune, A. Soroa, O. Lopez de
idiomatic translation, in: Proceedings of the First Lacalle, M. Artetxe, Do multilingual language
Workshop on Language Models for Low-Resource models think better in English?, in: K. Duh,
Languages, 2025, pp. 175–181. H. Gomez, S. Bethard (Eds.), Proceedings of the</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Supplementary Materials</title>
      <sec id="sec-4-1">
        <title>Proverb Examples</title>
        <p>In the following, we report a few examples in Italian from
the ProverbIT benchmark. Each example presents the
beginning of a proverb followed by five possible
continuations. Only one option is correct, which is E) Nessuna
delle altre risposte (None of the other answers).</p>
        <p>Example 1
Non è tutto oro...</p>
        <p>A) ...quel che lucida
B) ...ciò che brilla
C) ...ciò che è opaco
D) ...quel che è metallico
E) Nessuna delle altre risposte
Example 2
Vivi e ...</p>
        <p>A) ...lascia ridere
B) ...fai decorrere
C) ...giudica tutti
D) ...morirai più tardi
E) Nessuna delle altre risposte
Example 3
Il gioco è bello quando...</p>
        <p>A) ...cura dopo
B) ...è breve
C) ...dura molto
D) ...tutti si divertono
E) Nessuna delle altre risposte
Example 4
A caval donato ...</p>
        <p>A) ...non si guarda la brocca
B) ...non si ispezionan le fauci
C) ...si guarda in bocca
D) ...si mette la sella
E) Nessuna delle altre risposte
Example 5
Chi non beve in compagnia...</p>
        <p>A) ...o è un lardo o è una spa
B) ...o è un delinquente o è uno spione
C) ...non è un ladro né una spia
D) ...guida la macchina
E) Nessuna delle altre risposte
Example 6
Chi dorme...</p>
        <p>A) ...non piglia cesti
B) ...non acchiappa trote
C) ...piglia pesci
D) ...non è sveglio
E) Nessuna delle altre risposte</p>
      </sec>
      <sec id="sec-4-2">
        <title>Complete CoTs</title>
        <p>In the following, we report two examples of CoT. In the
ifrst, the use of English leads to an incorrect answer,
while in the second, an answer mismatch is observed:
although the CoT concludes with option C, the model
selects option B.</p>
        <p>CoT of Qwen
The use of English leads to a wrong answer
Okay, let’s see. The user wants me to complete
the proverb "Lontano dagli occhi,..." by choosing
the correct option from A to E. The options are:
A) ...lontano dalle suore, B) ...distante dal
miocardio, C) ...vicino al cuore, D) ...si vede poco, E)
Nessuna delle altre risposte.</p>
        <p>First, I need to recall the actual Italian proverb.</p>
        <p>The common one I know is "Lontano dagli occhi,
vicino al cuore," which translates to "Out of sight,
out of mind" but literally means "Far from the
eyes, close to the heart." Wait, that doesn’t make
sense in terms of translation, but the structure
is similar to English proverbs where the second
part contrasts or complements the first. So the
original proverb should have a contrasting part.</p>
        <p>In Italian, "vicino al cuore" is the correct
completion, meaning that even if someone is far from
your eyes (not seen), they are still near your
heart (in your thoughts or loved).</p>
        <p>Now, checking the options. Option C is "...vicino
al cuore," which matches what I know. The other
options: A) "lontano dalle suore" (far from the
nuns) - doesn’t make sense in this context. B)
"distante dal miocardio" (distant from the
myocardium) - that’s a medical term, not a proverb.</p>
        <p>D) "si vede poco" (you see little) - logical but
not the traditional saying. E) would only be
correct if none of the options are right, but since
C seems correct, E isn’t the choice here. So the
answer must be C.</p>
        <p>C.</p>
        <p>CoT of DeepSeek
Answer Mismatch w.r.t the CoT
Okay, ho bisogno di completare correttamente
il proverbio "Non è tutto oro..." scegliendo tra
le opzioni date. Le opzioni sono: A) ...quel che
lucida B) ...ciò che brilla C) ...ciò che è opaco
D) ...quel che è metallico E) Nessuna delle altre
risposte
Prima, mi ricordo come va il proverbio completo.</p>
        <p>So che il proverbio classico è "Non è tutto oro
quel che luccica". Tuttavia, nessuna delle opzioni
ofre esattamente "luccica".</p>
        <p>Guardando le opzioni: A) "quel che lucida" -
"Lucida" è simile a "luccica", dato che entrambi
significano brillare o splendere. B) "ciò che brilla"
Anche questo significa brillare, quindi è molto
vicino al significato originale. C) "ciò che è opaco"
- Opaco è l’opposto di brillante, quindi non ha
senso in questo contesto. D) "quel che è
metallico" - Mentre l’oro è un metallo, il proverbio
non si riferisce semplicemente ai metalli ma a
ciò che appare prezioso ma non lo è.</p>
        <p>L’essenza del proverbio è che non tutto ciò che
sembra prezioso o attraente (come l’oro che
brilla) è efettivamente di valore. Quindi le
opzioni A e B catturano entrambe questo
significato.</p>
        <p>Confrontando A e B: "quel che lucida" usa
"lucida", che è un sinonimo più diretto di "luccica"
nel proverbio originale. "Lucida" significa
lucidare o brillare, quindi ha lo stesso senso. Invece,
"ciò che brilla" (B) è più generico per brillare, ma
comunque corretto nel contesto.</p>
        <p>Tuttavia, il proverbio originale italiano è
tipicamente "Non è tutto oro quel che luccica", quindi
"quel che lucida" (A) è la versione più fedele
perché "lucida" funziona come sostituto di "luccica".</p>
        <p>L’opzione E ("Nessuna delle altre risposte") non
è necessaria perché A o B sono accettabili. Ma
devo vedere quale è la scelta migliore.</p>
        <p>Analizzo: Tecnicamente, la versione corretta è
con "luccica", ma poiché non è tra le opzioni,
"lucida" è un’alternativa valida e comune.
Inoltre, molte fonti a volte usano varianti come
"brilla", ma "lucida" è più vicino in termini di
espressione.</p>
        <p>Ho sentito usare sia "luccica" che "lucida"
informalmente, quindi A mi sembra corretta. B è
anche corretta nel significato, ma non è la
formulazione tradizionale. Quindi scelgo A come
risposta.</p>
        <p>E) None of the others</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used Gemini (Google), Other, and Claude in order
to: Grammar and spelling check and Peer review simulation. After using these tool(s)/service(s), the
author(s) reviewed and edited the content as needed and take(s) full responsibility for the
publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>2024 Conference of the North American Chap-</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>tics: Human Language Technologies</surname>
          </string-name>
          (Volume
          <volume>2</volume>
          :
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>guistics</surname>
          </string-name>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>550</fpage>
          -
          <lpage>564</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          URL: https://aclanthology.org/
          <year>2024</year>
          .naacl-short.
          <volume>46</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>doi:10</source>
          .18653/v1/
          <year>2024</year>
          .naacl-short.
          <volume>46</volume>
          . [36]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. M.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          noli (Eds.),
          <source>Proceedings of the 10th Italian</source>
          Confer-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>ence on Computational Linguistics (CLiC-it</article-title>
          <year>2024</year>
          ),
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>CEUR Workshop Proceedings</surname>
          </string-name>
          , Pisa, Italy,
          <year>2024</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          781-
          <fpage>795</fpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .clicit-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          1.85/. [37]
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Salido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          , G. Marco, None of the oth-
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>tion benchmarks</article-title>
          ,
          <source>arXiv preprint arXiv:2502.12896</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          (
          <year>2025</year>
          ). [38]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Benton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radhakrishnan</surname>
          </string-name>
          , J. Uesato,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>arXiv:2505.05410</source>
          (
          <year>2025</year>
          ). [39]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Weber-Genzel</surname>
          </string-name>
          , P. Röttger,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>preprint arXiv:2402.14499</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>