1. Introduction

Easy to Complete, Hard to Choose: Investigating LLM Performance on the ProverbIT Benchmark

Enrico Mensa

Lorenzo Zane

Calogero J. Scozzaro

Matteo Delsanto

Tommaso Milani

Daniele P. Radicioni

0 0 Department of Computer Science, University of Turin , Turin , Italy 1 Independent Researcher

2025

Large Language Models (LLMs) have transformed computational linguistics and achieved remarkable performance across numerous natural language processing tasks, yet significant gaps persist in understanding how these systems process culturally embedded linguistic expressions. This paper introduces ProverbIT, a novel Italian benchmark comprising 100 multiplechoice questions designed to evaluate LLMs' ability to complete Italian proverbs. We assess 13 frontier models, including Large Reasoning Models (LRMs) and traditional LLMs, across three tasks: proverb completion, multiple-choice selection with correct answers, and multiple-choice selection without correct answers. Our evaluation reveals surprising results: while nearly all models demonstrate knowledge of the proverbs through successful completion tasks, performance drops dramatically when transitioning to multiple-choice formats without correct answers, with even state-of-the-art reasoning models showing substantial degradation. Through detailed Chain-of-Thought analysis of two LRMs, we uncover that models exhibit a strong bias toward selecting literal synonyms and frequently mention correct proverb endings during reasoning without successfully identifying their absence from the given options. These findings suggest that current LLMs rely heavily on memorized patterns rather than deeper semantic understanding of culturally grounded expressions, highlighting important limitations in their reasoning capabilities for figurative language comprehension.

eol>Large Language Models Proverb Completion Chain-of-Thought Analysis Large Reasoning Models Benchmark Evaluation Multiple-Choice Tasks Semantic Reasoning

1. Introduction

LLMs in handling culturally grounded linguistic expres- tasks, figurative language phenomena such as idioms and sions. By manually designing alternative endings for proverbs have received comparatively little attention. Idthe proverbs, we can systematically examine the types ioms are a well-known source of complexity in natural of errors LLMs make and identify common failure pat- language understanding [16], as their meanings often terns. Our investigation shows a striking paradox: while cannot be inferred literally and require contextual and nearly all models possess knowledge of the proverbs in cultural knowledge. Fornaciari et al. [3] introduced an our dataset, performance deteriorates dramatically when expert-curated English dataset for idiom detection, showmoving from auto-completion to multiple-choice selec- ing that local LLMs struggle to distinguish idiomatic from tion, with even state-of-the-art LRMs exhibiting substan- literal usage. In the context of multilingual approaches, tial performance drops. Tedeschi et al. [17] presented the ID10M dataset, a high

The contribution of this paper is threefold: i) we quality, automatically generated resource covering ten contribute to Italian NLP benchmarks by introducing languages, along with a multilingual Transformer model a novel dataset that addresses the under-representation for idiom identification. Significant diferences in LLM of Italian in comprehensive language model evaluation performance across languages and figurative types were resources [5]; ii) we conduct a thorough evaluation across reported by Khoshtab et al. [18], who evaluated simile 13 frontier models, including LLMs, LRMs, and smaller and idiom interpretation under various prompt stratelocal models, providing comprehensive performance anal- gies. Notably, CoT prompting was particularly efective ysis on proverb completion tasks; and iii) we investigate for similes in smaller models. Kim et al. [19] presented LRMs performance through detailed Chain-of-Thought a dataset of idioms in six languages, each paired with (CoT) analysis, revealing insights into reasoning strate- its corresponding meaning. The authors conducted a gies and cultural language understanding mechanisms comprehensive evaluation of LLMs’ ability to process in contemporary language models. idioms, showing that models rely not only on memo

The paper is organized as follows: Section 2 reviews rization but also on a hybrid approach that integrates the literature on LLM performance with idioms and contextual cues and reasoning, suggesting that idiom unproverbs; Section 3 illustrates the ProverbIT dataset derstanding emerges from an interplay between internal constructive rationale and its features; Section 4 presents knowledge retrieval and inference. Moreover, their reour evaluation of frontier LLMs on the task along with sults highlight a performance gap between high-resource detailed error and CoT analysis; and Section 5 summa- and lower-resource languages. rizes the work with a final discussion and an overview Idioms have also been an area of interest in the maon future research directions. chine translation (MT) literature, where multiple studies have explored how models translate figurative expressions across languages. Lee et al. [20] presented TIDE, a 2. Related Work dataset of 512 sentence pairs containing idioms in disambiguating contexts, with one sentence using the idStandardized benchmarks have been fundamental in eval- iom literally and the other figuratively. They compared uating the performance of LLMs across a variety of natu- MT systems and language models, finding that the forral language processing tasks. Early eforts, such as GLUE mer consistently translate English idioms literally, while [6] and SuperGLUE [7], were based on multi-task evalu- the latter are more context-aware, even though perforation frameworks including tasks such as paraphrase mance varies across target languages. One strategy to detection, grammatical acceptability, and natural lan- improve idiom translation, particularly in smaller language inference. More recently, benchmarking eforts guage models, involves the use of knowledge bases (KBs). have expanded into other domains, such as mathemat- Li et al. [21] introduced IDIOMKB, a multilingual idiom ics [8, 9], coding [10, 11], and complex logical reasoning KB developed using LLMs, designed to enhance transtasks [12, 13]. These advances reflect the increasing de- lation quality by providing access to idioms’ figurative mand for language models capable of handling a broader meanings. However, this approach does not preserve the range of cognitive challenges. cultural and stylistic nuances that make idioms so distinc

Focusing on Italian, dedicated benchmark eforts have tive. To address this issue, Donthi et al. [22] proposed emerged to address language-specific issues and reduce two alignment-based methods that aim to identify idreliance on translated tasks, which can introduce cultural iomatic counterparts in the target language. Their results, bias and translation artifacts. Notable among these are based on human evaluation across multiple language CALAMITA [14], a collaborative and evolving bench- pairs, show improved cross-lingual idiomatic fidelity and mark initiative, and Evalita-LLM [15]. Both focus on better preservation of cultural authenticity. tasks originally designed in Italian and include a mix of The work most closely related to ours is by Liu et al. generative and multiple-choice tasks. [23], who focused specifically on proverbs. They intro

While these benchmarks cover a broad spectrum of duced the MAPS dataset, designed to evaluate proverb

3. ProverbIT Dataset 3.1. Data Collection and Dataset Creation The ProverbIT1 dataset is composed of 100 multi-choice

questions, each regarding the completion of a specific Italian proverb. To create the dataset, we started from an initial set of 200 common Italian proverbs [25] from which we selected 100 of the most commonly used. This process was carried out by three of the authors, which are all native Italian speakers. Each proverb was then manually split into its beginning and its ending, with the point of division determined to maintain the proverb’s semantic coherence in the initial part while allowing for a clear, unambiguous completion. For each proverb, four distinct incorrect alternative endings were manually created, leveraging the following constructive rationale: For sake of clarity we provide an example in English for each of the aforementioned continuations. Completions for the proverb Actions speak... louder than words could be:

A) prouder than swords B) at higher volume compared to speech C) quieter than words

D) when they do As this example shows, the synonym ending is not built on the figurative meaning of the proverb, but it is the literal synonym of the original ending (e.g., at higher 1The full dataset can be downloaded at https://huggingface.co/ datasets/emensa/proverbIT. understanding within conversational contexts across six volume compared to speech rather than beyond what languages. Their evaluation of multilingual LLMs re- words can say). This design was adopted to ensure that vealed that while many models “know” a limited set of models cannot simply rely on surface-level syntactic patproverbs, memorization does not guarantee understand- terns but must engage in deeper semantic and contextual ing or contextual reasoning. Models also struggled with reasoning to identify the absence of the correct compleifgurative proverbs, particularly when asked to select in- tion. correct answers instead of correct ones. Wang et al. [24] extended the MAPS dataset to evaluate MT models and 3.2. Prompt LLMs on proverb translation. Their experiments showed that LLMs generally outperform traditional MT models, Given each proverb in ProverbIT, we can then fill a confirming their superior ability to capture idiomatic simple prompt template that can be submitted to the nuances. models:

We specify that the proverb must be completed exactly, and also that there are no typos in the options since we noticed that models often assume the presence of • A is an ending that has similar sounds to the user mistakes and modify their responses based on this original continuation, often with an absurd/non- assumption. Since all provided endings are completely sensical meaning. invented and thus incorrect, we expect models to always • B is a non assonant literal synonym of the original answer E) None of the other answers. Finally we provide ending. an Italian example [with translation] from the actual • C is the inverse of the original proverb ending, dataset.

trying to maintain the assonance when possible. • D is a tautological/trivial ending of the proverb, Example of proverb from the dataset with no assonance.

Prompt Template (translated) Complete the proverb exactly by choosing from the following options (which have no typing errors) indicating only the letter. [Proverb beginning]...

A) ...[Assonant ending] B) ...[Synonym ending] C) ...[Inverse ending] D) ...[Trivial ending] E) None of the other answers Do not add comments, the possible answers are only A, B, C, D, E.

A buon intenditor,... [To a wise man] A) ...foche canore [singing seals] B) ...zero chiacchiere [zero chatter] C) ...molte parole [many words] D) ...è chiaro tutto [everything is clear] E) Nessuna delle altre risposte [None of the other answers]

More examples can be found in the Supplementary

Materials. Claude Sonnet 4 Claude Sonnet 4 GPT 4o GPT o3 DeepSeek V3 DeepSeek R1 Gemini 2.5 Flash Gemini 2.5 Pro Qwen 3 Grok 3 LLama 4 Maverick Mistral Small 3.1 Gemma 3 claude-sonnet-4 claude-sonnet-4-thinking gpt-4o gpt-o3 deepseek-chat-v3-0324 deepseek-r1-0528 gemini-2.5-flash-preview-05-20 gemini-2.5-pro-preview-06-05 Qwen 3-235b-a22b grok-3-beta llama-4-maverick mistral-small-3.1-24b-instruct gemma-3-27b-it

Provider Anthropic Anthropic OpenAI OpenAI DeepSeek DeepSeek Google Google QwQ xAI Meta Mistral Google

Num. Parameters

Undisclosed Undisclosed Undisclosed Undisclosed 671B 671B Undisclosed Undisclosed

235B Undisclosed 400B 24B 27B

4. Evaluation

but also providing the correct ending, we expect similar results w.r.t. the completion task.

In this Section, we describe the experimental setup developed for evaluating 13 diferent frontier models on the 4.1.1. Evaluation & Metrics ProverbIT benchmark, followed by an error analysis and in-depth examination of the underlying chain-of-thought In the base and base + true ending tasks we computed the processes for two LRM models. accuracy defined as the ratio of correctly chosen options over the multiple choices. Specifically, each prompt was 4.1. Experiments presented to each model three times and the final answer was determined through a majority vote between them. If In addition to evaluating the models on the ProverbIT no majority emerged across the three runs, the response benchmark introduced in the previous Section, we also was marked as incorrect. perform two ancillary tasks to assess whether the mod- For the automatic calculation of the accuracy on the els possess knowledge of the proverbs. We refer to the completion task we compute the edit distance2 between ProverbIT benchmark as to the base task, while the two the provided completion and the correct ending of the ancillary tasks are described in the following. proverb. As with the other tasks, each prompt is submitted three times. If the edit distance exceeds a threshold Completion Task. Instead of a multiple-choice ap- of 0.8 in at least two out of three runs, we consider the proach, we ask the model to directly complete a proverb answer correct. given its beginning. This task establishes if the model is For all tasks, a zero-shot prompting strategy was emfamiliar with the requested proverbs. The prompt used ployed and all requests have been sent separately via API, for the completion task is as follows: specifically using the OpenRouter unified interface [ 26].

For all models the temperature was left at the default Completion Prompt Template (translated) OpenRouter value of 1.0 since we countered their nondeterministic nature by employing a majority vote.

Complete the proverb exactly: [Proverb beginning]...

4.2. Models

In our experiments we employed a diverse set of stateReply with the ending only, do not add further of-the-art models including both traditional LLMs and comments. LRMs, aiming to cover a wide range of providers. Whenever possible, we selected both a flagship LLM and its corresponding LRM from the same organization, allowBase + true ending Task. We add to each multiple- ing us to directly compare their performance and assess choice question a new option that is the true ending of the proverb. By preserving the multiple-choice format 2The implementation from https://docs.python.org/3/library/ diflib .html was employed. the improvements brought by the reasoning mechanism.

The complete list of models and their full names can be found in Table 1.

From Anthropic, we evaluated Claude Sonnet 43 and

its reasoning variant Claude Sonnet 4 Thinking. From

OpenAI, we included GPT 4o [27] and GPT o3.4 From

DeepSeek, we employed DeepSeek V3 [28] and DeepSeek

R1 [29]. From Google, we tested Gemini 2.5 Flash5 and

Gemini 2.5 Pro.6 We also included Qwen 3 [30], a model optimized for reasoning developed by QwQ, Grok 37 from xAI, and LLama 4 Maverick8 from Meta. We also included two smaller models suitable for local deployment, as these are commonly used in privacy-sensitive contexts and contexts that require less computational resources. Although privacy concerns are not relevant for the ProverbIT dataset, these models were included to ensure comprehensive evaluation coverage. In particular we tested Mistral Small 3.19 from Mistral and Gemma 3 [31] from Google. Regarding models specifically trained on Italian, we preliminarily tested the Italian LLM Minerva [32] but found that it was unable to respond coherently, often failing to follow the requested response format (i.e., in providing a letter corresponding to a given choice).

Given that some reasoning models require a mandatory thinking budget while others do not, we set a reasonable thinking budget of 2000 tokens for o3, Sonnet 4, and Gemini 2.5 Pro, while DeepSeek R1 and Qwen 3 were left unlimited. Moreover, the first three models output only a summarization of their CoTs, while the latter two provide their complete trace. This makes DeepSeek R1 and Qwen 3 ideal candidates for the CoT analysis that we performed. We observed that only 22 out of 600 CoTs from these two models exceeded the 2000-token limit, half of them resulting in an incorrect answer anyway. 4.3. Results & Discussion 3https://www.anthropic.com/claude/sonnet 4https://openai.com/index/o3-o4-mini-system-card/ 5https://storage.googleapis.com/model-cards/documents/gemini2.5-flash-preview .pdf 6https://storage.googleapis.com/model-cards/documents/gemini2.5-pro-preview.pdf 7https://x.ai/news/grok-3 8https://ai.meta.com/blog/llama-4-multimodal-intelligence/ 9https://mistral.ai/news/mistral-small-3-1 In this Section we examine the results of the evaluation and provide a detailed discussion on the errors.

In Table 2 we present the results recorded on the ProverbIT benchmark and the ancillary tasks. Models are sorted based on their performance on the ProverbIT task: such an ordering highlights a clear separation of an unexpected phenomenon: virtually all non-thinking performance between thinking vs. non-thinking models. models sufer from steep performance deterioration. For

By comparing the performances between the ancil- instance, GPT 4o achieves 92% on the base + true ending lary tasks and the ProverbIT benchmark, we uncover task but only 64% on ProverbIT. Claude Sonnet 4 loses 47 percentage points, DeepSeek V3 loses 52 percentage points, and Grok 3 drops by 69 percentage points. The most dramatic performance decline occurs with LLama 4 Maverick, which plummets from 75% and 88% on the ancillary tasks to merely 6% on ProverbIT. Notably, Mistral’s performance, given its relatively modest size (24B parameters), suggests that domain-specific optimization— through more focused Italian and broader Europeanlanguage training [33]—may play a significant role in enhancing model eficiency for culturally grounded tasks.

LRMs are less prone to this performance drop; however, we still observe significant deterioration of about 10-20 percentage points. These findings suggest that the transition from pattern completion to discriminative reasoning fundamentally challenges current language models’ understanding mechanisms. The substantial performance gaps confirm that models rely heavily on memorized linguistic patterns rather than genuine semantic comprehension of proverbs. This deterioration becomes particularly pronounced when models must evaluate and reject plausible but incorrect alternatives, highlighting limitations in their ability to engage in deeper cultural and contextual reasoning. The relatively better performance of reasoning models suggests that explicit reasoning processes can partially compensate for these limitations, though significant challenges remain in achieving robust figurative language understanding.

Detailed error analysis. Table 3 details the categorization of incorrect responses as a percentage of total errors.

The results reveal a strong skew toward option B, highlighting a consistent preference among the models for selecting synonyms–even if they are literal and not figurative. This pattern is less evident among local models, whose responses appear more equally distributed, possibly reflecting greater variability or reduced confidence in their outputs. The complete report of each model’s responses is provided in Table 5 in the Supplementary Materials.

Analyzed CoTs Empty CoTs Average Words CoT > 2000 Words Languages

DeepSeek R1

• Language inconsistency in CoTs:10 Approximately half of DeepSeek’s CoTs are generated in English while the other half appear in Italian, with occasional language switching occurring within individual reasoning traces. In contrast, Qwen consistently produces CoTs exclusively in English (except when citing the question options). This multilingual reasoning presents significant interpretability challenges, particularly for tasks involving idiomatic content, as cultural nuances and figurative meanings may be lost or misrepresented when reasoning shifts between languages [35, 36]. We hypothesize that this limitation stems from these models’ training distribution, which prioritizes Chinese and English content with comparatively limited Italian language exposure. 4.3.1. CoTs Analysis For the CoT analysis, we only take in consideration DeepSeek R1 and Qwen 3, as they are the only models that provide a full CoT trace rather than a summarization. As discussed earlier, these models were run with an unlimited thinking budget.

Since we ran the ProverbIT benchmark three times in order to compute the majority vote for the accuracy, we automatically analyzed a total of 600 CoTs (300 for each model). Table 4 provides a preliminary overview of our analyses. Most prompts provided a non-empty CoT, and from our investigation we discovered two interesting facts:

We analyzed the non-empty CoTs by tracing mentions of correct and incorrect answers within the thinking process. We examined separately cases where the model responds correctly versus incorrectly. Specifically, the left subfigures of Figures 1 and 2 show the absolute number of mentions of the correct answer (which is always E None of the others) and all incorrect answers when the model answers correctly. Conversely, the right subfigures show the absolute number of mentions of the correct answer and the specific incorrect answer provided when the model responds incorrectly. We additionally plot as a dotted line the absolute number of mentions of the correct completion of the proverb (which was not given in the prompt).

These graphs reveal that both models continuously mention all possible answers throughout their reasoning • Overthinking: Models occasionally exhibit process, while the spikes toward the end indicate that overthinking behavior [34], a documented phe- models reach a decision only in the final lines of their nomenon afecting LRMs where they continu- CoTs. However, this decision-making appears tentative, ously re-evaluate their assessment of the correct as alternative options remain heavily mentioned alonganswer. This results in CoTs exceeding 4,000 side the chosen answer, suggesting low confidence in the words in length, compared to an average of ap- final selection. proximately 700 words for typical responses.

10Automatic language detection was performed via https://pypi.org/

project/langdetect/.

When DeepSeek R1 answers correctly (213 instances)

When DeepSeek R1 answers incorrectly (87 instances) Mentions of any A/B/C/D Mentions of the answer (E) Mentions of correct proverb ending

Mentions of the answer (either A or B or C or D) Mentions of E

Mentions of correct proverb ending 120 s n ito100 n e m fo80 r e b um60 n e t u lso40 b A 20

The dotted lines clearly demonstrate that models are ing fundamental limitations in how current LRMs handle aware of the correct proverb ending and repeatedly ref- negative reasoning tasks [37]. erence it throughout their thinking process. These observations highlight a critical disconnect: while models can Inconsistency between CoTs and answers. As a successfully recall the correct proverb completion, they ifnal finding, we also discovered that sometimes model fail to recognize its absence among the provided choices. responses were inconsistent with their corresponding This suggests that the challenge lies not in knowledge CoT. For instance, out of the 113 incorrect responses from retrieval but in the discriminative reasoning required to Qwen, 14 of them are inconsistent, ending with sentences identify when the correct answer is unavailable, reveal- like ‘The answer is X’, but then the actual given answer highlighting important limitations in their reasoning capabilities for figurative language comprehension. was not X. Remarkably, in all of these instances adhering to the CoT-delivered conclusion would have resulted in a correct answer. Similarly, for DeepSeek R1, 6 of the 87 incorrect responses exhibited such a discrepancy, 5 (5.7%) of which would have lead to the correct answer.

This behavior has been observed in prior work [38].

In the Supplementary Materials we report two complete CoTs showing instances of english that leads to a wrong answer and answer mismatch. [1] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer,

H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al., Solving quantitative reasoning problems with language models, Advances in Neural Information Processing Sys5. Conclusions & Future Work tems 35 (2022) 3843–3857. [2] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, In this work, we introduced ProverbIT, a novel Italian H. Chen, X. Yi, C. Wang, Y. Wang, et al., A surbenchmark designed to evaluate Large Language Models’ vey on evaluation of large language models, ACM ability to handle culturally grounded linguistic expres- transactions on intelligent systems and technology sions through proverb completion tasks. Our compre- 15 ( 2024 ) 1–45. hensive evaluation of 13 frontier models, including both [3] F. D. L. Fornaciari, B. Altuna, I. Gonzalez-Dios, Large Reasoning Models and traditional LLMs, provides M. Melero, A hard nut to crack: Idiom detection significant insights into the limitations of current lan- with conversational large language models, in: Proguage understanding systems. ceedings of the 4th Workshop on Figurative Lan

Our findings demonstrate a relevant gap between guage Processing (FigLang 2024), 2024, pp. 35–44. models’ knowledge of proverbs and their ability to ap- [4] F. Xu, Q. Hao, Z. Zong, J. Wang, Y. Zhang, J. Wang, ply this knowledge in discriminative reasoning tasks. X. Lan, J. Gong, T. Ouyang, F. Meng, et al., Towards While nearly all evaluated models successfully com- large reasoning models: A survey of reinforced reaplete proverbs when presented with direct completion soning with large language models, arXiv preprint prompts, performance drops dramatically when the same arXiv:2501.09686 ( 2025 ). task is reformulated as multiple-choice selection without [5] M. Wu, W. Wang, S. Liu, H. Yin, X. Wang, Y. Zhao, correct answers available. Even state-of-the-art reason- C. Lyu, L. Wang, W. Luo, K. Zhang, The bitter lesing models like GPT o3 and Gemini 2.5 Pro experience son learned from 2,000+ multilingual benchmarks, substantial degradation. arXiv preprint arXiv:2504.15521 ( 2025 ).

The Chain-of-Thought analysis of DeepSeek R1 and [6] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Qwen 3 further highlights this limitation: both mod- Bowman, Glue: A multi-task benchmark and analels frequently mention correct proverb endings during ysis platform for natural language understanding, their reasoning process yet fail to recognize their ab- arXiv preprint arXiv:1804.07461 (2018). sence from the provided options, highlighting fundamen- [7] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, tal challenges in negative reasoning capabilities. More- J. Michael, F. Hill, O. Levy, S. Bowman, Superglue: over, we uncovered concerning inconsistencies in reason- A stickier benchmark for general-purpose language ing model behavior, including overthinking, language understanding systems, Advances in neural inforswitching during reasoning and discrepancies between mation processing systems 32 (2019). CoT conclusions and final answers. [8] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,

Future work should focus on investigating this mis- H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, match between knowledge retrieval and discriminative R. Nakano, et al., Training verifiers to solve math reasoning more deeply, particularly examining how mod- word problems, arXiv preprint arXiv:2110.14168 els handle negative reasoning tasks even in seemingly (2021). trivial scenarios where the correct answer is absent from [9] E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, the given options. Additional evaluation methodologies E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain, should also be incorporated, including answer random- A. Ho, E. d. O. Santos, et al., Frontiermath: A benchization techniques as proposed in literature [39]. mark for evaluating advanced mathematical reason

In summary, our results underscore the critical im- ing in ai, arXiv preprint arXiv:2411.04872 ( 2024 ). portance of developing language-specific benchmarks [10] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, that capture cultural and linguistic nuances often lost in O. Press, K. R. Narasimhan, Swe-bench: Can lanEnglish-centric evaluations, showing that current LLMs guage models resolve real-world github issues?, in: rely heavily on memorized patterns rather than deeper se- ICLR, 2024. mantic understanding of culturally grounded expressions, [11] N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, I. Stoica, Live- [23] C. Liu, F. Koto, T. Baldwin, I. Gurevych, Are multicodebench: Holistic and contamination free eval- lingual llms culturally-diverse reasoners? an invesuation of large language models for code, arXiv tigation into multicultural proverbs and sayings, in: preprint arXiv:2403.07974 ( 2024 ). Proceedings of the 2024 Conference of the North [12] F. Chollet, On the measure of intelligence, arXiv American Chapter of the Association for Computapreprint arXiv:1911.01547 (2019). tional Linguistics: Human Language Technologies [13] F. Chollet, M. Knoop, G. Kamradt, B. Landers, (Volume 1: Long Papers), 2024, pp. 2016–2039.

H. Pinkard, Arc-agi-2: A new challenge for [24] M. Wang, V.-T. Pham, F. Moghimifar, T.-T. Vu, frontier ai reasoning systems, arXiv preprint Proverbs run in pairs: Evaluating proverb transarXiv:2505.11831 ( 2025 ). lation capability of large language model, arXiv [14] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran- preprint arXiv:2501.11953 ( 2025 ). cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri- [25] F. Caramagna, I 200 proverbi italiani più naldi, et al., Calamita: Challenge the abilities of belli e famosi (con significato), 2025. URL: language models in italian, in: Proceedings of the https://aforisticamente.com/i-200-proverbi10th Italian Conference on Computational Linguis- italiani-piu-belli-e-famosi-con-significato/. tics (CLiC-it 2024), Pisa, Italy, 2024. [26] OpenRouter, Openrouter: A unified interface for [15] B. Magnini, R. Zanoli, M. Resta, M. Cimmino, P. Al- llms, 2024. URL: https://openrouter.ai/, accessed: bano, M. Madeddu, V. Patti, Evalita-llm: Bench- 2025-06-15. marking large language models on italian, arXiv [27] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, preprint arXiv:2502.02289 ( 2025 ). A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, [16] I. A. Sag, T. Baldwin, F. Bond, A. Copestake, A. Hayes, A. Radford, et al., Gpt-4o system card, D. Flickinger, Multiword Expressions: A Pain in the arXiv preprint arXiv:2410.21276 ( 2024 ). Neck for NLP, Springer Berlin Heidelberg, 2002, p. [28] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, 1–15. URL: http://dx.doi.org/10.1007/3-540-45715- C. Zhao, C. Deng, C. Zhang, C. Ruan, et al., 1_1. doi:10.1007/3-540-45715-1_1. Deepseek-v3 technical report, arXiv preprint [17] S. Tedeschi, F. Martelli, R. Navigli, Id10m: Idiom arXiv:2412.19437 ( 2024 ).

identification in 10 languages, in: Findings of the [29] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, Association for Computational linguistics: NAACL R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al., 2022, 2022, pp. 2715–2726. Deepseek-r1: Incentivizing reasoning capability in [18] P. Khoshtab, D. Namazifard, M. Masoudi, llms via reinforcement learning, arXiv preprint A. Akhgary, S. M. Sani, Y. Yaghoobzadeh, Com- arXiv:2501.12948 ( 2025 ). parative study of multilingual idioms and similes [30] Q. Team, Qwen3 technical report, 2025. URL: https: in large language models, in: Proceedings of the //arxiv.org/abs/2505.09388. arXiv:2505.09388. 31st International Conference on Computational [31] G. Team, A. Kamath, J. Ferret, S. Pathak, N. VieilLinguistics, 2025, pp. 8680–8698. lard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, [19] J. Kim, Y. Shin, U. Hwang, J. Choi, R. Xuan, M. Rivière, et al., Gemma 3 technical report, arXiv T. Kim, Memorization or reasoning? exploring preprint arXiv:2503.19786 ( 2025 ). the idiom understanding of llms, arXiv preprint [32] R. Orlando, L. Moroni, P.-L. H. Cabot, S. Conia, arXiv:2505.16216 ( 2025 ). E. Barba, S. Orlandini, G. Fiameni, R. Navigli, Min[20] J. Lee, A. Liu, O. Ahia, H. Gonen, N. A. Smith, That erva llms: The first family of large language models was the last straw, we need more: Are translation trained from scratch on italian data, in: Proceedings systems sensitive to disambiguating context?, in: of the 10th Italian Conference on Computational Findings of the Association for Computational Lin- Linguistics (CLiC-it 2024), 2024, pp. 707–719. guistics: EMNLP 2023, 2023, pp. 4555–4569. [33] mistralai, Model card for mistral-small-3.1-24b[21] S. Li, J. Chen, S. Yuan, X. Wu, H. Yang, S. Tao, instruct-2503, 2025. URL: https://huggingface.co/ Y. Xiao, Translate meanings, not just words: Id- mistralai/Mistral-Small-3.1-24B-Instruct-2503. iomkb’s role in optimizing idiomatic translation [34] Y. Sui, Y.-N. Chuang, G. Wang, J. Zhang, T. Zhang, with language models, in: Proceedings of the AAAI J. Yuan, H. Liu, A. Wen, S. Zhong, H. Chen, et al., Conference on Artificial Intelligence, volume 38, Stop overthinking: A survey on eficient reason2024, pp. 18554–18563. ing for large language models, arXiv preprint [22] S. Donthi, M. Spencer, O. B. Patel, J. Y. Doh, E. Ro- arXiv:2503.16419 ( 2025 ).

dan, K. Zhu, S. O’Brien, Improving llm abilities in [35] J. Etxaniz, G. Azkune, A. Soroa, O. Lopez de idiomatic translation, in: Proceedings of the First Lacalle, M. Artetxe, Do multilingual language Workshop on Language Models for Low-Resource models think better in English?, in: K. Duh, Languages, 2025, pp. 175–181. H. Gomez, S. Bethard (Eds.), Proceedings of the

Supplementary Materials Proverb Examples

In the following, we report a few examples in Italian from the ProverbIT benchmark. Each example presents the beginning of a proverb followed by five possible continuations. Only one option is correct, which is E) Nessuna delle altre risposte (None of the other answers).

Example 1 Non è tutto oro...

A) ...quel che lucida B) ...ciò che brilla C) ...ciò che è opaco D) ...quel che è metallico E) Nessuna delle altre risposte Example 2 Vivi e ...

A) ...lascia ridere B) ...fai decorrere C) ...giudica tutti D) ...morirai più tardi E) Nessuna delle altre risposte Example 3 Il gioco è bello quando...

A) ...cura dopo B) ...è breve C) ...dura molto D) ...tutti si divertono E) Nessuna delle altre risposte Example 4 A caval donato ...

A) ...non si guarda la brocca B) ...non si ispezionan le fauci C) ...si guarda in bocca D) ...si mette la sella E) Nessuna delle altre risposte Example 5 Chi non beve in compagnia...

A) ...o è un lardo o è una spa B) ...o è un delinquente o è uno spione C) ...non è un ladro né una spia D) ...guida la macchina E) Nessuna delle altre risposte Example 6 Chi dorme...

A) ...non piglia cesti B) ...non acchiappa trote C) ...piglia pesci D) ...non è sveglio E) Nessuna delle altre risposte

Complete CoTs

In the following, we report two examples of CoT. In the ifrst, the use of English leads to an incorrect answer, while in the second, an answer mismatch is observed: although the CoT concludes with option C, the model selects option B.

CoT of Qwen The use of English leads to a wrong answer Okay, let’s see. The user wants me to complete the proverb "Lontano dagli occhi,..." by choosing the correct option from A to E. The options are: A) ...lontano dalle suore, B) ...distante dal miocardio, C) ...vicino al cuore, D) ...si vede poco, E) Nessuna delle altre risposte.

First, I need to recall the actual Italian proverb.

The common one I know is "Lontano dagli occhi, vicino al cuore," which translates to "Out of sight, out of mind" but literally means "Far from the eyes, close to the heart." Wait, that doesn’t make sense in terms of translation, but the structure is similar to English proverbs where the second part contrasts or complements the first. So the original proverb should have a contrasting part.

In Italian, "vicino al cuore" is the correct completion, meaning that even if someone is far from your eyes (not seen), they are still near your heart (in your thoughts or loved).

Now, checking the options. Option C is "...vicino al cuore," which matches what I know. The other options: A) "lontano dalle suore" (far from the nuns) - doesn’t make sense in this context. B) "distante dal miocardio" (distant from the myocardium) - that’s a medical term, not a proverb.

D) "si vede poco" (you see little) - logical but not the traditional saying. E) would only be correct if none of the options are right, but since C seems correct, E isn’t the choice here. So the answer must be C.

CoT of DeepSeek Answer Mismatch w.r.t the CoT Okay, ho bisogno di completare correttamente il proverbio "Non è tutto oro..." scegliendo tra le opzioni date. Le opzioni sono: A) ...quel che lucida B) ...ciò che brilla C) ...ciò che è opaco D) ...quel che è metallico E) Nessuna delle altre risposte Prima, mi ricordo come va il proverbio completo.

So che il proverbio classico è "Non è tutto oro quel che luccica". Tuttavia, nessuna delle opzioni ofre esattamente "luccica".

Guardando le opzioni: A) "quel che lucida" - "Lucida" è simile a "luccica", dato che entrambi significano brillare o splendere. B) "ciò che brilla" Anche questo significa brillare, quindi è molto vicino al significato originale. C) "ciò che è opaco" - Opaco è l’opposto di brillante, quindi non ha senso in questo contesto. D) "quel che è metallico" - Mentre l’oro è un metallo, il proverbio non si riferisce semplicemente ai metalli ma a ciò che appare prezioso ma non lo è.

L’essenza del proverbio è che non tutto ciò che sembra prezioso o attraente (come l’oro che brilla) è efettivamente di valore. Quindi le opzioni A e B catturano entrambe questo significato.

Confrontando A e B: "quel che lucida" usa "lucida", che è un sinonimo più diretto di "luccica" nel proverbio originale. "Lucida" significa lucidare o brillare, quindi ha lo stesso senso. Invece, "ciò che brilla" (B) è più generico per brillare, ma comunque corretto nel contesto.

Tuttavia, il proverbio originale italiano è tipicamente "Non è tutto oro quel che luccica", quindi "quel che lucida" (A) è la versione più fedele perché "lucida" funziona come sostituto di "luccica".

L’opzione E ("Nessuna delle altre risposte") non è necessaria perché A o B sono accettabili. Ma devo vedere quale è la scelta migliore.

Analizzo: Tecnicamente, la versione corretta è con "luccica", ma poiché non è tra le opzioni, "lucida" è un’alternativa valida e comune. Inoltre, molte fonti a volte usano varianti come "brilla", ma "lucida" è più vicino in termini di espressione.

Ho sentito usare sia "luccica" che "lucida" informalmente, quindi A mi sembra corretta. B è anche corretta nel significato, ma non è la formulazione tradizionale. Quindi scelgo A come risposta.

E) None of the others

Declaration on Generative AI During the preparation of this work, the author(s) used Gemini (Google), Other, and Claude in order to: Grammar and spelling check and Peer review simulation. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

2024 Conference of the North American Chap-

tics: Human Language Technologies (Volume 2 :

guistics , Mexico City, Mexico, 2024 , pp. 550 - 564 .

URL: https://aclanthology.org/ 2024 .naacl-short. 46 /.

doi:10 .18653/v1/ 2024 .naacl-short. 46 . [36]

Ranaldi ,

Pucci ,

Ranaldi ,

E. S.

Ruzzetti , F. M.

noli (Eds.), Proceedings of the 10th Italian Confer-

ence on Computational Linguistics (CLiC-it

2024 ),

CEUR Workshop Proceedings , Pisa, Italy, 2024 , pp.

781- 795 . URL: https://aclanthology.org/ 2024 .clicit-

1.85/. [37]

E. S.

Salido ,

Gonzalo , G. Marco, None of the oth-

tion benchmarks , arXiv preprint arXiv:2502.12896

( 2025 ). [38]

Chen ,

Benton ,

Radhakrishnan , J. Uesato,

arXiv:2505.05410 ( 2025 ). [39]

Wang ,

Ma ,

Hu ,

Weber-Genzel , P. Röttger,

preprint arXiv:2402.14499 ( 2024 ).