1. Introduction

Language Models and the Magic of Metaphor: A Comparative Evaluation with Human Judgments

Simone Mazzoli

Alice Suozzi

Gianluca E. Lebani

0 1 0 European Centre for Living Technology (ECLT) , Ca' Bottacin, Dorsoduro 3911, 30123 Venice , Italy 1 QuaCLing Lab, Dipartimento di Studi Linguistici e Culturali Comparati, Università Ca' Foscari Venezia , Dorsoduro 1075, 30123 Venice , Italy

2025

This study evaluates whether Italian-trained Large Language Models (LLMs) can interpret metaphors by comparing their performance to both human judgments and human-produced interpretations. Using three datasets containing metaphors, human interpretations, and implausible alternatives, we assess model performance via log-likelihood scores. Results show that LLMs partially replicate human understanding and are influenced by expression conventionality and linguistic context. construct plausible representations of meaning or merely memorize patterns of form [3], as captured by the wellMetaphor is counted among the violations of the princi- known stochastic parrots metaphor [4]. Given their sucple of compositionality, according to which the meaning cess, there has been growing interest in the development of a linguistic expression can be determined based on the of LLMs optimized for contexts in which languages other meaning of its individual parts and their syntactic struc- than English are predominant. Although multilingual ture [1]. It is configured as a syntactically well-formed models or those primarily trained on English are capable sentence that is semantically incongruent when inter- of processing and generating text in Italian, they are ofpreted literally, based on the lexically-encoded meanings ten considered less capable of capturing the nuances and of its components. Its definitions have undergone numer- specific characteristics of the language [ 5]. The recent ous variations, ranging from the idea of simple lexical introduction of LLMs trained from scratch on Italian data, substitution of a literal term to that of a constitutive prin- together with models subsequently adapted through opciple of the human conceptual system [2]. This is because, timization processes for a specific language, makes it although there is general agreement that an interaction particularly interesting to verify whether their ability to occurs between the two concepts evoked by the metaphor understand metaphors can approach that of humans. in determining the meaning of the metaphorical expres- In light of this, this study aims to examine the extent sion, a comprehensive formalization of the nature of this to which interpretations and related inferences produced interaction has yet to be achieved. In fact, understanding by humans in response to metaphorical stimuli are fametaphors requires the integration of linguistic, contex- vored by LLMs, as opposed to implausible interpretatual, and cultural knowledge, thus representing a chal- tions that are either meaningless or convey the opposite lenge not only for humans but also for Large Language of the intended meaning. A systematic preference for Models (LLMs). human-generated interpretations would suggest that the LLMs have seen significant growth in recent years, semantic representations of LLMs are suficiently robust demonstrating excellent performance across a wide range to produce accurate interpretations and replicate human of interpretation and language production tasks. Their inferential processes. More broadly, this would imply ability to understand and generate textual information that the distributional information in text, which underhas revolutionized many areas of natural language pro- pins the internal representations of these models [6], is cessing and numerous other fields. Since their introduc- suficient to construct a semantic and common-sense tion, a central question has been whether these models knowledge framework capable of generating valid inferences about figurative language. Another promising line of research at the intersection of psycholinguistics and computational linguistics explores the cognitive plausibility of LLMs, that is, the extent to which metrics derived from these models can predict human performance on cognitive tasks. This project takes a step in that direction by collecting human judg-

eol>Metaphor Interpretation Linguistic Evaluation Benchmark Italian Language Models

1. Introduction

CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics, September 24 — 26, 2025, Cagliari, Italy * Corresponding author. $ Simone.mazzoli@unive.it (S. Mazzoli); Alice.suozzi@unive.it (A. Suozzi); Gianluca.lebani@unive.it (G. E. Lebani) https://www.unive.it/data/persone/29007635 (S. Mazzoli); https://www.unive.it/data/persone/24102251 (A. Suozzi); https://www.unive.it/data/persone/21257857 (G. E. Lebani) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License ments on the conventionality of linguistic stimuli and the Attribution 4.0 International (CC BY 4.0). adequacy of sentence-level context for comprehending expressions. It then investigates the correlation between these human ratings and LLM performance, with the aim of evaluating the models’ sensitivity to such aspects.

2. Related Works

Metaphor interpretation tasks can be grouped into three categories [7]: property extraction, word-level paraphrasing, and explanation matching. Property extraction involves identifying shared attributes between the metaphor’s Topic and Vehicle (e.g., Love is a tide → Love is unstoppable), inspired by comparison-based theories such as the Salience Imbalance Theory [8, 9] and the Career of Metaphor Theory [10]. Word-level paraphrasing replaces the metaphorical term with a literal counterpart (e.g., She devoured the novels → She read the novels very quickly), though this is limited when metaphors include multiple figurative terms or when idioms are involved. Explanation matching pairs metaphors with dictionarylike glosses (e.g., A red-letter day → A day of significance ), but struggles with extended metaphors.

Previous works have leveraged such tasks to assess the models’ ability of interpreting metaphors. This project ifts within current research eforts aimed at testing the semantic capabilities of large language models in processing metaphors, combining several innovative aspects inspired by the following studies.

Pedinotti et al. [11] tested BERT on a dataset of 100 metaphors across four syntactic types. BERT successfully distinguished between metaphorical, literal, and nonsensical variants based on pseudo-log-likelihood. Embedding analysis showed alignment with metaphorical senses, suggesting that BERT encodes metaphor-relevant features. Following this example in the organization of stimuli, the present study ensures that metaphorical expressions are balanced across fine-grained syntactic groups. This design choice addresses an often overlooked aspect in related work, which tends to rely on examples with limited structural variation or narrow contextual constraints. Furthermore, as in the aforementioned work, the stimuli and the models tested are in Italian, ofering a perspective on metaphor that difers from the more commonly adopted anglocentric approach.

Tong et al. [12] developed the MUNCH dataset, which included 10,000 metaphorical sentence paraphrases and 1,500 triplets (metaphor, correct paraphrase, incorrect paraphrase). They proposed two tasks: paraphrase selection and paraphrase generation. GPT-3.5 outperformed other models but often diverged from human responses, highlighting challenges in capturing metaphorical nuance. A notable strength of this work was its attempt to accommodate the presence of multiple correct responses produced by humans, which served as an efective strategy to address the variability and intrinsic originality of linguistic expression. Similarly, the present study aims to reflect, as much as possible, the originality of speakers in generating the stimuli on which models are evaluated. To this end, multiple correct interpretations are collected and systematically compared against incorrect ones, so that subjectivity and individuality in metaphor interpretation are explicitly taken into account. Moreover, particular attention was paid to the ecological validity of the stimuli: metaphorical expressions were directly extracted from a linguistic corpus, with minimal alterations to the original excerpts. Correct interpretations used for evaluation were produced by human annotators.

With a more explicit focus on the relationship between metaphor and its interpretation, Liu et al. [13] introduced Fig-QA, a Winograd-style task that requires models to pair metaphoric expressions with their appropriate literal reformulations. Incorrect pairings may involve either mismatched metaphors or literal paraphrases that convey the opposite meaning of the original metaphor. GPT-3 performed best in zero-shot settings, though still below human level. Fine-tuned models like RoBERTa approached human accuracy, particularly when inferring literal meaning from figurative language. In Liu et al.’s setup, choosing the correct metaphor-meaning pair was equivalent to assigning a higher probability to that pair, which is the same principle used in the present study. Each metaphor in their dataset was paired with both a correct and an opposing interpretation, forming the positive and negative instances, respectively. Similarly, in this study, a distinction is drawn between plausible interpretations, which are formulated by humans, and implausible ones, represented by two distractors carefully constructed according to two distinct semantic rules. This approach prevents inflated accuracy due to models consistently rejecting only one type of distractor, thus supporting a more balanced and accurate assessment of their interpretative abilities.

3. The Magic of Metaphor: our Study 3.1. Dataset As previously mentioned, the linguistic data used in this

study include metaphors, human-generated interpretations and ratings, as well as strings functioning as distractors. The following section describes the methods employed for data collection. 3.1.1. Metaphors The metaphors included in the dataset were manually extracted from the oficial records of the Italian Parliament, specifically from debates in the Chamber of Deputies dur- Table 1 ing the 16th, 17th, and 18th legislatures (covering a time Balanced groups in the metaphor dataset span from 2008 to 2022)1. These records, consisting of Pattern Valency Metaphorical Group stenographic transcripts and committee summaries, were Element Size consulted to identify metaphorical expressions, with only (PoS) (n = 140) minimal edits. Selected text segments include variable 1 di 2 None Noun1 20 amounts of syntactic context (e.g., coordination and sub- ∼ Adj None Noun 20 ordination) to preserve interpretability of the metaphor.2 ∼ Adj None Adjective 20

A political discourse corpus was selected over literary 1 = 2 None Noun2 20 or general-purpose corpora for two main reasons. First, ∼ Intransitive Verb 20 although poetic texts contain rich and frequent figurative ∼ Transitive Verb 20 language, poetic metaphors often involve extended net- ∼ Transitive Verb and Noun 20 works of interrelated expressions, making them hard to isolate for individual analysis. In contrast, metaphors in political language are typically employed to emphasize and lexical class of the metaphorical term, thereby oferconceptual content and are more concise due to the oral ing a robust foundation for experimental and computanature of parliamentary discourse. These characteris- tional studies on metaphor interpretation. tics make them easier to isolate, interpret, and analyze without compromising semantic coherence. 3.1.2. Human Interpretations and Ratings

Second, political speech allows for more eficient metaphor identification and clearer estimation of We collected metaphor interpretations through a quesifgurative-to-literal usage ratios. For example, the word tionnaire structured into four sections: informed conscheletro ‘skeleton’ is more likely to appear figuratively sent, demographic data, completion instructions (in both (e.g., scheletro normativo) in political language than in video and text format) and the experimental section conmedical contexts, where it retains a purely literal mean- taining the metaphors. Each questionnaire included 14 ing. A specialized corpus thus ofers a clearer view of metaphors, two for each balancing group, presented in metaphor usage patterns than a general corpus, where random order. A total of 10 diferent questionnaires were both uses may be equally distributed. created to cover the dataset of 140 metaphors.

Metaphors were annotated using the Metaphor Iden- Participants were presented with sentence prompts tification Procedure (MIP) by the Pragglejaz Group [14]. that followed a fixed syntactic structure and pragmatic MIP operates at the word level and requires annotators function, deliberately designed by the researchers to to compare the contextual meaning of a lexical unit with ensure consistency and reduce interpretive bias stema more basic, concrete, and historically prior meaning. A ming from linguistic variation (see Tab. 8 in Appendix word is tagged as metaphorical if its contextual meaning A). For each metaphor, participants were asked to write contrasts with its basic meaning but can still be under- one or more possible completions based on the prostood via it. vided standardized sentence frame. The layout of the

To ensure syntactic and lexical variety, the dataset was questionnaire as viewed by the participants is provided balanced across seven groups, defined by three key vari- in Appendix B. A total of 121 Italian-speaking adults ables, as detailed in Table 1: ( 1 ) pattern, or the syntactic ( = 32.8 years, = 13.6) participated in the relation between the metaphorical term and its context experiment. Only one participant reported a diferent marker; ( 2 ) valency, or the number of syntactic arguments native language, and their responses were excluded from of the metaphorical verb; and ( 3 ) metaphorical element the analysis. class, indicating whether the metaphor is expressed by The responses were corrected for grammatical consisa noun, verb, or adjective. Subscript indices were used tency where necessary, including verb agreement, mergto distinguish items when two elements shared the same ing of prepositions and articles, and the addition of coplexical class. An example of a metaphor from each group ulas. Grammatically incorrect interpretations were disis provided in Table 7 in Appendix A. carded. In total, 2,540 interpretations were collected, of

The final dataset contains 140 metaphorical items, sys- which 2,117 were unique3. The distribution of interpretematically balanced across syntactic patterns, valency, tations per metaphor was described using descriptive statistics: mean (18.14), median ( 17 ), standard deviation (4.57), minimum ( 10 ) and maximum (31). 1Oficial records consulted from the website of the Italian Chamber of Deputies: https://www.camera.it/leg18/221 2The metaphor collection process involved using a database search tool to identify lexical units in parliamentary debates by querying word roots. Each occurrence whose metaphorical nature was confirmed was subsequently added to our database.

3This means that 0.83% of all collected interpretations consist of

duplicates, that is, identical interpretations provided by diferent participants in response to metaphors that tend to elicit higher agreement.

In addition, the conventionality of each metaphor was Topic. (ii) Opposite Metaphorical Distractors (OMD) exevaluated on a scale of 1 to 5, how frequently the partici- press the opposite meaning of the most frequently given pant hears the expression used with the same meaning. human interpretation. For example: The adequacy of the context was also evaluated on the same scale, measuring whether the provided sentence ( 5 ) Si intende che il risultato è molto importante come context was suficient for understanding the metaphor. una briciola.

The rating collection described above allowed us to ob- ‘It is meant that the result is very important, like a tain an average conventionality score for each metaphor. crumb.’ This score reflects the degree of conventionality or nov- ( 6 ) Dicendo cassaforte di eccellenze si intende qualcosa elty of the metaphor perceived by the participants. To che contiene cose di poco valore come una cassaforte. illustrate, we report one metaphor rated as novel (e.g., ‘By saying safe of excellences, it is meant something ( 1 ), with an average score of 2.40) and one rated as con- that contains things of little value, like a safe.’ ventional (e.g., ( 2 ), with an average score of 4.86): In ( 5 ), molto importante contradicts the typical interpre( 1 ) La Repubblica italiana con questo Governo sta diven- tation of briciola (small, insignificant). Similarly, in ( 6 ), tando lo zampirone per l’impresa. cose di poco valore is the opposite of preziose, which was ‘The Italian Republic, with this Government, is be- the dominant human interpretation of the metaphorical coming like a mosquito coil for businesses.’ cassaforte. ( 2 ) È un dramma determinato a sua volta dall’esplosione demografica dell’Africa subsahariana. ‘It is a crisis caused in turn by the demographic explosion in sub-Saharan Africa.’ 3.1.3. Distractors To create implausible interpretations for the collected metaphors (i.e., distractors), inspiration was drawn from the APL Medea test [15], a standardized tool designed to assess pragmatic skills in children aged 5 to 14. One of its subtests presents a figurative metaphor, and the child must choose the image that best represents it among one correct and three distractors. These include a literal interpretation, a semantically related image, and one showing elements of the sentence without integrating them meaningfully.

In this study, a similar approach was used: two distractors were created for each of the 140 metaphors, totaling 280 distractors. They were based on alternative completions of the sentences presented to human participants (see Tab. 8), following two specific criteria: (i) Literal Distractors (LD) are plausible only if the metaphorical word is taken literally. For instance:

3.2. Models We evaluated six autoregressive models based on

three diferent architectures (LLaMA, GPT-2, Mistral), trained on Italian data using two distinct approaches: LLaMAntino-2-7b (adapted model) [16], and GePpeTto [17] and Minerva (trained from scratch) [18]. Information about the models’ architectures can be found in Table 2, while their training data are summarized in Table 3.

We also include several baselines for comparison. The ifrst baseline is the accuracy level expected from random selection among interpretations (0.33). Additionally, we test two simple models based on input string length: Longest String, which always selects the interpretation with the highest number of characters, and Shortest String, which chooses the interpretation with the fewest characters. Furthermore, we adopted a model based on the Gulpease index, a readability metric designed to assess the complexity of Italian texts. The index considers the number of sentences, letters, and words in a given text segment [19]. This model consistently selects the interpretation with the highest Gulpease score.

3.3. Data analysis These distractors use predicates or attributes that belong

solely to the metaphor’s Vehicle and not the intended ( 3 ) Dei numeri aridi sono dei numeri che sono privi di umidità. ‘Dry numbers are numbers that are devoid of moisture.’

This study uses log-likelihood as a measure comparable to human preference, already employed in studies on grammaticality and semantic plausibility judgments [20, 21, 22], assuming that a model capable of understand( 4 ) Dicendo elefante burocratico si intende qualcosa che ing metaphorical expressions assigns a higher probability ha una lunga proboscide come un elefante. to human-generated interpretations than to the two dis‘By saying bureaucratic elephant, one means some- tractors. Autoregressive language models define a probathing that has a long trunk, like an elephant.’ bility distribution over subsequent tokens conditioned on the sequence of prior tokens. Consequently, the probability of an entire sentence can be obtained by computing the product of the conditional probabilities of each token at its respective time step: where 1() =

( 2 ) {︃1 if is true,

0 otherwise.

The comparison among the three strings, as illustrated

by Equation 2, was therefore carried out for all interpre˜ (1 . . . ) = (1) ∏︁ ︀( | 1 . . . − 1︀) ( 1 ) tations provided by human participants along with their =2 corresponding distractors.

We consider a metaphor from the dataset of 140 metaphors, a set of interpretations of produced by participants denoted as , a literal distractor LD, and an opposite metaphorical distractor OMD. For each interpretation belonging to , the log-likelihoods of the strings * , LD* , and OMD* are extracted, where * indicates that the metaphor is concatenated before each string4. Accuracy is calculated by taking the ratio of the number of cases in which the string * receives a log-likelihood greater than or equal to the highest probability among the two distractors, and the cardinality of .

∑︁ 1{︀ ˜ (* ) ≥

max[︀ ˜ (LD* ), ˜ (OMD* )]︀} ACC = ∈

|| 4The existence of a significant diference between the proportions of strings (interpretations and distractors) preferred by the models, comparing the two conditions, presented in isolation versus preceded by the metaphor, was confirmed through chi-square tests, demonstrating the efectiveness of this manipulation and ensuring the soundness of the experimental paradigm.

3.4. Results We report in Table 4 the accuracy values achieved by

the models5, highlighting an improvement for the larger models, with the exception of LLaMAntino-2-7b, which achieves higher accuracy only compared to GePpeTto.

A chi-square test revealed that all models exhibit distributions that are significantly diferent from those expected for the four baselines. As shown in Figure 1, there is a trend within the Minerva family models to disfavor OMDs, and this trend is directly proportional to the size of the model. This makes it necessary to test whether, in cases where this type of distractor does not receive a higher probability, the choice between human interpre5An additional metric, weighted accuracy, was computed using the full set of 2,540 interpretations, including repeated responses from multiple participants. This metric captures the model’s ability to assign higher probabilities to more frequently produced interpretations. Weighted accuracy increased by 0.02 points for all LLMs except GePpeTto, which improved by 0.01, suggesting that retaining repeated interpretations has minimal impact on model comparisons. tations and LDs is due to chance or to one of the simple strategies represented by the baselines.

To analyze this hypothesis, an additional chi-square test was conducted, excluding OMDs from the observations. The results allow us to reject the hypothesis that Minerva-350M randomly chooses between human interpretations and LDs ( 2( 1 ) = 14.618, < .001), however this is not possible for any other model in the same family. The same hypothesis can also be rejected for LLaMAntino-2-7b ( 2( 1 ) = 11.132, < .001) and for GePpeTto ( 2( 1 ) = 4.713, < .05). Yet, only for Minerva-350M and GePpeTto is it true that human interpretations are non-randomly favored, whereas LLaMAntino-2-7b, in contrast, shows a stronger preference for LDs.

In addition to the inability to reject the hypothesis of random choice between human interpretations and LDs, for Minerva-7B it was not possible to reject the hypothesis that the model always chooses the longer string between LDs and OMDs. The opposite is true for the smaller Minerva-3B model, whose results difer significantly from the expected distribution of preferences between the two distractors if it follows the "longer string" strategy ( 2( 1 ) = 18.833, < .001).

The correlation analysis in Table 5 shows a positive rela

tionship between metaphor conventionality and model accuracy, confirming that models tend to achieve better performance on more conventional metaphors. However, the strength of this correlation varies across models. Minerva-350M shows the highest correlation. Other Minerva models follow a similar trend, with correlation values gradually decreasing as model size increases, from Minerva-1B to Minerva-7B.GePpeTto shows the lowest and non-significant correlation, whereas LLaMAntino-27b shows a weak but significant correlation, in line with the larger Minerva models.

The correlation analysis in Table 6 shows a positive relationship between contextual appropriateness and model accuracy, although the strength of this correlation is very low or nearly negligible for some models.

Minerva-350M exhibits the highest correlation, suggestTable 6 of strings receiving the highest probability varies with Correlation between model accuracy and context adequacy the conventionality of the metaphors. Whereas a posiModel Pearson’s r sig. tive correlation between human interpretation proporGePpeTto .055 tions (i.e., accuracy) and metaphor conventionality has Minerva-350M-base-v1.0 .255 ** been previously observed across all models (albeit nonMinerva-1B-base-v1.0 .191 * significant for Geppetto), a one-tailed test for negative Minerva-3B-base-v1.0 .213 * correlation revealed a slight negative correlation between Minerva-7B-base-v1.0 .160 average conventionality and the proportion of LDs that LLaMAntino-2-7b-hf-ITA .079 received the highest probability across all models: GeP* < .05, ** < .01, *** < .001 peTto ( = − .176, < .05), Minerva-350M ( = − .184, < .05), Minerva-1B ( = − .188, < .05), Minerva3B ( = − .189, < .05), Minerva-7B ( = − .189, < .05), and LLaMAntino-2-7b ( = − .168, < .05). ing that this model benefits the most from more appro- Similar analyses were conducted to examine how the priate context in determining correct interpretations. average contextual adequacy of metaphors relates to the Minerva-1B and Minerva-3B show significant correla- distribution of preferences across the three interpretation tions, indicating a positive but weaker efect compared options. Figure 3 illustrates the proportions of interpretato Minerva-350M. Interestingly, the correlation observed tions that received the highest probability as contextual for the larger model (3B) exceeds that of the smaller one adequacy varies. A one-tailed test for negative correla(1B), representing an exception to the previously noted tion between contextual adequacy and the proportion of trend in which larger models tend to be less sensitive to LDs with the highest probability revealed a significant variables derived from human judgments. Minerva-7B relationship in both Minerva-1B ( = − .142, < .05) does not reach the threshold for significance, suggesting and Minerva-3B ( = − .171, < .05). Both models also that in larger models, the relationship between contex- show a positive correlation between contextual adequacy tual relevance and accuracy may be less relevant. The and the proportion of human interpretations, suggesting same holds for GePpeTto and LLaMAntino-2-7B with that these interpretations may gain preference at the exnegligible correlations. pense of LDs, with minimal interference from OMDs. In

The correlation between average conventionality and Minerva-350M, while the proportion of human interpremodel accuracy ofers a solid foundation for investigating tations positively correlates with contextual adequacy how preferences are distributed across the three string ( = .255, < .01), no significant negative correlation types. It enables an analysis of how increasing conven- was found for either distractor type. tionality afects the likelihood assigned to human inter- For further analysis, we report the accuracy of the modpretations, to OMDs, and to LDs. els grouped by the syntactic pattern of the metaphors

Figure 2 shows the trends in the percentages of sen- (see Fig. 4). Broadly speaking, the lowest performance tences selected by the models, broken down by aver- was found in the group featuring a metaphorical intranage conventionality. The chart illustrates how the share sitive verb combined with a literal subject. In contrast, the highest accuracy was achieved on metaphors that textual information. Within the Minerva family, smaller included both a metaphorical verb and a metaphorical models, such as Minerva-350M, appear more sensitive to direct object. These trends provide evidence that specific these variables, whereas the sensitivity of larger models syntactic configurations either disadvantage or support gradually decreases. This may indicate that larger modthe models’ ability to understand metaphors. els are relatively less dependent on perceptual, stimulusspecific variables than smaller ones, likely due to their greater generalization capabilities. 4. Discussion Specifically, considering the results of the positive correlation test between average conventionality and Results highlight distinct preference patterns among lan- model accuracy, it emerges that for most models, as the guage models when choosing between human interpre- metaphors become more conventional, human interpretations and distractors. Notably, Minerva-350M and GeP- tations are favored while LDs are gradually penalized. peTto show a statistically significant preference for hu- GePpeTto, however, does not follow this first trend, but man interpretations over LDs, while LLaMAntino-2-7b only the second. This suggests that, when LDs are exfavors LDs. Larger models in the Minerva family tend to cluded by this model, human interpretations and OMDs disfavor OMDs, with some exhibiting behavior consistent exhibit a similar increasing trend, yet they are not equally with simple baseline strategies. probable: human interpretations are generally assigned

Moreover, model performance is influenced by the con- higher probabilities. ventionality of the metaphor and the adequacy of con- The results regarding the correlation with the adequacy of the sentential context in supporting the com- These results provide a nuanced picture of the curprehension of the metaphorical expression show that, in rent capabilities and limitations of Italian-specific LLMs larger models like Minerva-1B and Minerva-3B, higher in metaphor interpretation. They also underscore the contextual adequacy is associated with a reduced pref- importance of linguistic diversity in model training and erence for literal distractors, and a corresponding in- evaluation. Future work may benefit from expanding crease in the selection of human interpretations. In con- the range of figurative phenomena studied and refining trast, Minerva-350M shows a diferent pattern: while distractor generation to probe more deeply into modthe proportion of human interpretations positively cor- els’ semantic representations. Additionally, collecting relates with contextual adequacy, neither distractor type a broader set of psychometric judgments could provide shows a significantly correlated decrease: when human- valuable insight into how these human factors correlate generated interpretations are not selected, both distractor with model performance. types contribute equally to the highest-probability outcome.

Furthermore, the observed performance diferences 6. Limitations across syntactic patterns may reflect underlying biases in the training data. One possible explanation for the poor results on ∼ constructions is the overrepresentation in the training data of literal constructions similar to the LDs, such as example ( 7 ). ( 7 ) Dicendo dormire si intende riposare.

‘By saying sleep, one means to rest’

This over-representation may lead the model to favor

literal readings, assigning higher probabilities to LDs. Conversely, the higher accuracy on ∼ constructions may be due to their idiomatic nature and the presence in the training data of explanations that closely resemble human interpretations: ( 8 ) Dicendo fare lo struzzo si intende nascondersi. ‘By saying burying one’s head in the sand, one means to hide.’

These findings collectively underscore the importance of syntactic and idiomatic features in metaphor comprehension, while also pointing to potential limitations in training data diversity. 5. Conclusion This study explored the capacity of Italian-trained Large

Language Models to interpret metaphorical expressions, evaluating their performance based on their ability to choose between human-produced interpretations and systematically designed distractors. Our findings indicate that, while no model fully replicates human-level metaphor comprehension, smaller models, particularly Minerva-350M and GePpeTto, demonstrate a statistically significant preference for human-generated interpretations over distractors.

The observed correlations suggest that distributional semantic representations, though not yet equivalent to human inferential processes, are capable of capturing ifgurative meaning, particularly for conventional expressions.

This study has several limitations. First, the dataset includes only 140 metaphors, which may constrain the generalizability of the results. Second, all metaphors were drawn from parliamentary discourse, limiting coverage of metaphor use in other domains. Third, conventionality was assessed through subjective ratings, which reflect perceived rather than actual frequency of use and should therefore be considered only a proxy for true conventionality. Finally, limited access to the models’ training corpora prevents clear conclusions about whether model performance reflects genuine interpretive ability or memorization of previously seen patterns.

A. Appendix A

L’Italia ha bisogno di una politica estera trasparente, matura, lungimirante e programmatica. ‘Italy needs a transparent, mature, forward-looking, and strategic foreign policy’ Venezia è una perla che racchiude in se stessa quella che è l’identità del popolo veneto. ‘Venice is a pearl that embodies the identity of the Venetian people’ Il sostegno è necessario a chi oggi ha visto evaporare, da un giorno all’altro, il suo reddito. ‘Support is needed for those who saw their income evaporate overnight’ La disgustosa tappa odierna, di fatto, narcotizza il Parlamento. ‘Today’s disgraceful stage efectively narcotizes the Parliament’ Questa regione afonda le sue radici in una cultura profonda, in un senso civico importante. ‘This region sinks its roots into a deep culture and a strong civic spirit’ Interpretation to be completed Dicendo giungla di burocrazia si intende qualcosa che . . . come una giungla ‘By saying jungle of bureaucracy, one means something that . . . like a jungle’ Dicendo elefante burocratico si intende qualcosa che . . . come un elefante ‘By saying bureaucratic elephant, one means something that . . . like an elephant’ Una politica estera trasparente è una politica estera che . . . ‘A transparent foreign policy is a foreign policy that . . . ’ Si intende che Venezia . . . come una perla ‘One means that Venice . . . like a pearl’ Dicendo evaporare si intende . . . ‘By saying evaporate, one means . . . ’ Dicendo narcotizzare il Parlamento si intende . . . il Parlamento ‘By saying narcotize the Parliament, one means . . . the Parliament’ Dicendo afondare le radici si intende . . .

‘By saying sink the roots, one means . . . ’

Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1]

Pustejovsky ,

Batiukova , The Lexicon, Cambridge University Press, Cambridge, 2019 .

[2]

Lakof , M. Johnson, Metaphors We Live By, University of Chicago Press, Chicago and London, 1980 .

[3]

Mitchell , D. C. Krakauer, The debate over understanding in AI's large language models , Proceedings of the National Academy of Sciences 120 ( 2023 ). doi: 10 .1073/pnas.2215907120.

[4]

E. M.

Bender ,

Gebru ,

McMillan-Major ,

Shmitchell , On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , 2021 , pp. 610 - 623 . doi: 10 .1145/3442188.3445922.

[5]

Bacciu ,

Campagnano ,

Trappolini ,

Silvestri , DanteLLM: Let's Push Italian LLM Research Forward! , in: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation , 2024 , pp. 4343 - 4355 . URL: https://aclanthology.org/ 2024 .lrec-main. 388 .

[6]

Lenci ,

Sahlgren , Distributional Semantics, Studies in Natural Language Processing , Cambridge University Press, 2023 .

[7]

Ge ,

Mao ,

Cambria , A survey on computa- Proceedings of the 10th Italian Conference on Comtional metaphor processing techniques: From iden- putational Linguistics , 2024 , p. 707 - 719 . URL: https: tification, interpretation, generation to application, //aclanthology.org/ 2024 .clicit- 1 .77.pdf. Artificial Intelligence Review 56 ( 2023 ) 1829 - 1895 . [19]

Lucisano ,

M. E.

Piemontese , Gulpease: una fordoi: 10 .1007/s10462-023 -10564-7. mula per la predizione della leggibilità di testi in

[8]

Ortony , Beyond literal similarity, Psycho- lingua italiana, Scuola e Città ( 1988 ) 110 - 124 . logical Review 86 ( 1979 ) 161 - 180 . doi: 10 .1037/ [20]

Marvin , T. Linzen, Targeted syntactic evalu0033-295X.86.3 .161. ation of language models , in: E. Rilof , D. Chi-

[9]

Ortony (Ed.), Metaphor and Thought , 2 ed., ang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings Cambridge University Press, 1993 . doi: 10 .1017/ of the 2018 Conference on Empirical Methods in CBO9781139173865. Natural Language Processing , 2018 , pp. 1192 - 1202 .

[10]

B. F.

Bowdle ,

Gentner , The career of metaphor, doi:10.18653/v1/D18-1151. Psychological Review 112 ( 2005 ) 193 - 216 . doi:10. [21]

Kauf ,

Chersoni ,

Lenci ,

Fedorenko , A . A. 1037 / 0033 - 295X . 112 .1.193. Ivanova , Log probabilities are a reliable estimate

[11]

Pedinotti ,

E. D.

Palma ,

Cerini , A . Lenci, of semantic plausibility in base and instructionA howling success or a working sea? testing tuned language models, in: Proceedings of the 7th what bert knows about metaphors , in: Proceed- BlackboxNLP Workshop: Analyzing and Interpretings of the Fourth BlackboxNLP Workshop on ing Neural Networks for NLP , 2024 , pp. 263 - 277 . Analyzing and Interpreting Neural Networks for doi:10 .18653/v1/ 2024 .blackboxnlp- 1 .18. NLP , 2021 , pp. 192 - 204 . doi: 10 .18653/v1/ 2021 . [22]

Kauf ,

A. A.

Ivanova ,

Rambelli , E. Chersoni, blackboxnlp- 1 .13.

J. S.

She ,

Chowdhury ,

Fedorenko , A . Lenci,

[12]

Tong ,

Choenni ,

Lewis , E. Shutova, Event knowledge in large language models: The Metaphor understanding challenge dataset for gap between the impossible and the unlikely, CogniLLMs , in : Proceedings of the 62nd Annual Meet- tive Science 47 ( 2023 ) e13386 . doi: 10 .1111/cogs. ing of the Association for Computational Linguis- 13386 . tics (Volume 1 : Long

Papers)

, 2024 , p. 3517 - 3536 . doi: 10 .48550/arXiv.2403.11810.

[13]

Liu ,

Cui ,

Zheng , G. Neubig, Testing the ability of language models to interpret figurative language , in: M. Carpuat , M.-C. de Marnefe , I. V. Meza Ruiz (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2022 , pp. 4437 - 4452 . doi: 10 .18653/v1/ 2022 .naacl-main. 330 .

[14]

Group , MIP: A method for identifying metaphorically used words in discourse , Metaphor and Symbol 22 ( 2007 ) 1 - 39 . doi: 10 . 1080/10926480709336752.

[15] L. M. LoRusso , APL-Medea - Abilità Pragmatiche Nel Linguaggio , Giunti - OS Organizzazioni Speciali, Firenze, 2009 .

[16]

Basile , E. Musacchio,

Polignano ,

Siciliani , G. Fiameni, G. Semeraro, LLaMAntino: LLaMA 2 Models for Efective Text Generation in Italian Language, arXiv preprint ( 2023 ). doi: 10 .48550/ arXiv.2312.09993. arXiv: 2312 . 09993 .

[17]

L. D.

Mattei ,

Cafagna ,

Dell'Orletta ,

Nissim ,

Guerini , Geppetto carves italian into a language model, arXiv preprint ( 2020 ). doi: 10 .48550/ arXiv. 2004 . 14253 . arXiv: 2004 .14253.

[18]

Orlando ,

Moroni ,

P.-L. H.

Cabot ,

Barba ,

Conia ,

Orlandini , G. Fiameni,

Navigli , Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data , in: