<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Language Models and the Magic of Metaphor: A Comparative Evaluation with Human Judgments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simone Mazzoli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alice Suozzi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca E. Lebani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Centre for Living Technology (ECLT)</institution>
          ,
          <addr-line>Ca' Bottacin, Dorsoduro 3911, 30123 Venice</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>QuaCLing Lab, Dipartimento di Studi Linguistici e Culturali Comparati, Università Ca' Foscari Venezia</institution>
          ,
          <addr-line>Dorsoduro 1075, 30123 Venice</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This study evaluates whether Italian-trained Large Language Models (LLMs) can interpret metaphors by comparing their performance to both human judgments and human-produced interpretations. Using three datasets containing metaphors, human interpretations, and implausible alternatives, we assess model performance via log-likelihood scores. Results show that LLMs partially replicate human understanding and are influenced by expression conventionality and linguistic context. construct plausible representations of meaning or merely memorize patterns of form [3], as captured by the wellMetaphor is counted among the violations of the princi- known stochastic parrots metaphor [4]. Given their sucple of compositionality, according to which the meaning cess, there has been growing interest in the development of a linguistic expression can be determined based on the of LLMs optimized for contexts in which languages other meaning of its individual parts and their syntactic struc- than English are predominant. Although multilingual ture [1]. It is configured as a syntactically well-formed models or those primarily trained on English are capable sentence that is semantically incongruent when inter- of processing and generating text in Italian, they are ofpreted literally, based on the lexically-encoded meanings ten considered less capable of capturing the nuances and of its components. Its definitions have undergone numer- specific characteristics of the language [ 5]. The recent ous variations, ranging from the idea of simple lexical introduction of LLMs trained from scratch on Italian data, substitution of a literal term to that of a constitutive prin- together with models subsequently adapted through opciple of the human conceptual system [2]. This is because, timization processes for a specific language, makes it although there is general agreement that an interaction particularly interesting to verify whether their ability to occurs between the two concepts evoked by the metaphor understand metaphors can approach that of humans. in determining the meaning of the metaphorical expres- In light of this, this study aims to examine the extent sion, a comprehensive formalization of the nature of this to which interpretations and related inferences produced interaction has yet to be achieved. In fact, understanding by humans in response to metaphorical stimuli are fametaphors requires the integration of linguistic, contex- vored by LLMs, as opposed to implausible interpretatual, and cultural knowledge, thus representing a chal- tions that are either meaningless or convey the opposite lenge not only for humans but also for Large Language of the intended meaning. A systematic preference for Models (LLMs). human-generated interpretations would suggest that the LLMs have seen significant growth in recent years, semantic representations of LLMs are suficiently robust demonstrating excellent performance across a wide range to produce accurate interpretations and replicate human of interpretation and language production tasks. Their inferential processes. More broadly, this would imply ability to understand and generate textual information that the distributional information in text, which underhas revolutionized many areas of natural language pro- pins the internal representations of these models [6], is cessing and numerous other fields. Since their introduc- suficient to construct a semantic and common-sense tion, a central question has been whether these models knowledge framework capable of generating valid inferences about figurative language. Another promising line of research at the intersection of psycholinguistics and computational linguistics explores the cognitive plausibility of LLMs, that is, the extent to which metrics derived from these models can predict human performance on cognitive tasks. This project takes a step in that direction by collecting human judg-</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Metaphor Interpretation</kwd>
        <kwd>Linguistic Evaluation</kwd>
        <kwd>Benchmark</kwd>
        <kwd>Italian</kwd>
        <kwd>Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CLiC-it 2025: Eleventh Italian Conference on Computational
Linguistics, September 24 — 26, 2025, Cagliari, Italy
* Corresponding author.
$ Simone.mazzoli@unive.it (S. Mazzoli); Alice.suozzi@unive.it
(A. Suozzi); Gianluca.lebani@unive.it (G. E. Lebani)
 https://www.unive.it/data/persone/29007635 (S. Mazzoli);
https://www.unive.it/data/persone/24102251 (A. Suozzi);
https://www.unive.it/data/persone/21257857 (G. E. Lebani)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License ments on the conventionality of linguistic stimuli and the
Attribution 4.0 International (CC BY 4.0).
adequacy of sentence-level context for comprehending
expressions. It then investigates the correlation between
these human ratings and LLM performance, with the aim
of evaluating the models’ sensitivity to such aspects.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Metaphor interpretation tasks can be grouped into three
categories [7]: property extraction, word-level
paraphrasing, and explanation matching. Property
extraction involves identifying shared attributes between the
metaphor’s Topic and Vehicle (e.g., Love is a tide → Love
is unstoppable), inspired by comparison-based theories
such as the Salience Imbalance Theory [8, 9] and the
Career of Metaphor Theory [10]. Word-level paraphrasing
replaces the metaphorical term with a literal counterpart
(e.g., She devoured the novels → She read the novels very
quickly), though this is limited when metaphors include
multiple figurative terms or when idioms are involved.
Explanation matching pairs metaphors with
dictionarylike glosses (e.g., A red-letter day → A day of significance ),
but struggles with extended metaphors.</p>
      <p>Previous works have leveraged such tasks to assess the
models’ ability of interpreting metaphors. This project
ifts within current research eforts aimed at testing the
semantic capabilities of large language models in
processing metaphors, combining several innovative aspects
inspired by the following studies.</p>
      <p>Pedinotti et al. [11] tested BERT on a dataset of 100
metaphors across four syntactic types. BERT
successfully distinguished between metaphorical, literal, and
nonsensical variants based on pseudo-log-likelihood.
Embedding analysis showed alignment with metaphorical
senses, suggesting that BERT encodes metaphor-relevant
features. Following this example in the organization
of stimuli, the present study ensures that metaphorical
expressions are balanced across fine-grained syntactic
groups. This design choice addresses an often overlooked
aspect in related work, which tends to rely on examples
with limited structural variation or narrow contextual
constraints. Furthermore, as in the aforementioned work,
the stimuli and the models tested are in Italian, ofering
a perspective on metaphor that difers from the more
commonly adopted anglocentric approach.</p>
      <p>Tong et al. [12] developed the MUNCH dataset, which
included 10,000 metaphorical sentence paraphrases and
1,500 triplets (metaphor, correct paraphrase, incorrect
paraphrase). They proposed two tasks: paraphrase
selection and paraphrase generation. GPT-3.5 outperformed
other models but often diverged from human responses,
highlighting challenges in capturing metaphorical
nuance. A notable strength of this work was its attempt to
accommodate the presence of multiple correct responses
produced by humans, which served as an efective
strategy to address the variability and intrinsic originality of
linguistic expression. Similarly, the present study aims
to reflect, as much as possible, the originality of speakers
in generating the stimuli on which models are evaluated.
To this end, multiple correct interpretations are collected
and systematically compared against incorrect ones, so
that subjectivity and individuality in metaphor
interpretation are explicitly taken into account. Moreover,
particular attention was paid to the ecological validity of
the stimuli: metaphorical expressions were directly
extracted from a linguistic corpus, with minimal alterations
to the original excerpts. Correct interpretations used for
evaluation were produced by human annotators.</p>
      <p>With a more explicit focus on the relationship between
metaphor and its interpretation, Liu et al. [13] introduced
Fig-QA, a Winograd-style task that requires models to
pair metaphoric expressions with their appropriate
literal reformulations. Incorrect pairings may involve
either mismatched metaphors or literal paraphrases that
convey the opposite meaning of the original metaphor.
GPT-3 performed best in zero-shot settings, though still
below human level. Fine-tuned models like RoBERTa
approached human accuracy, particularly when inferring
literal meaning from figurative language. In Liu et al.’s
setup, choosing the correct metaphor-meaning pair was
equivalent to assigning a higher probability to that pair,
which is the same principle used in the present study.
Each metaphor in their dataset was paired with both
a correct and an opposing interpretation, forming the
positive and negative instances, respectively. Similarly,
in this study, a distinction is drawn between plausible
interpretations, which are formulated by humans, and
implausible ones, represented by two distractors
carefully constructed according to two distinct semantic rules.
This approach prevents inflated accuracy due to models
consistently rejecting only one type of distractor, thus
supporting a more balanced and accurate assessment of
their interpretative abilities.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The Magic of Metaphor: our</title>
    </sec>
    <sec id="sec-4">
      <title>Study</title>
      <sec id="sec-4-1">
        <title>3.1. Dataset</title>
        <sec id="sec-4-1-1">
          <title>As previously mentioned, the linguistic data used in this</title>
          <p>study include metaphors, human-generated
interpretations and ratings, as well as strings functioning as
distractors. The following section describes the methods
employed for data collection.
3.1.1. Metaphors
The metaphors included in the dataset were manually
extracted from the oficial records of the Italian Parliament,
specifically from debates in the Chamber of Deputies dur- Table 1
ing the 16th, 17th, and 18th legislatures (covering a time Balanced groups in the metaphor dataset
span from 2008 to 2022)1. These records, consisting of Pattern Valency Metaphorical Group
stenographic transcripts and committee summaries, were Element Size
consulted to identify metaphorical expressions, with only (PoS) (n = 140)
minimal edits. Selected text segments include variable 1 di 2 None Noun1 20
amounts of syntactic context (e.g., coordination and sub-  ∼ Adj None Noun 20
ordination) to preserve interpretability of the metaphor.2  ∼ Adj None Adjective 20</p>
          <p>A political discourse corpus was selected over literary 1 = 2 None Noun2 20
or general-purpose corpora for two main reasons. First,  ∼  Intransitive Verb 20
although poetic texts contain rich and frequent figurative  ∼  Transitive Verb 20
language, poetic metaphors often involve extended net-  ∼  Transitive Verb and Noun 20
works of interrelated expressions, making them hard to
isolate for individual analysis. In contrast, metaphors in
political language are typically employed to emphasize and lexical class of the metaphorical term, thereby
oferconceptual content and are more concise due to the oral ing a robust foundation for experimental and
computanature of parliamentary discourse. These characteris- tional studies on metaphor interpretation.
tics make them easier to isolate, interpret, and analyze
without compromising semantic coherence. 3.1.2. Human Interpretations and Ratings</p>
          <p>Second, political speech allows for more eficient
metaphor identification and clearer estimation of We collected metaphor interpretations through a
quesifgurative-to-literal usage ratios. For example, the word tionnaire structured into four sections: informed
conscheletro ‘skeleton’ is more likely to appear figuratively sent, demographic data, completion instructions (in both
(e.g., scheletro normativo) in political language than in video and text format) and the experimental section
conmedical contexts, where it retains a purely literal mean- taining the metaphors. Each questionnaire included 14
ing. A specialized corpus thus ofers a clearer view of metaphors, two for each balancing group, presented in
metaphor usage patterns than a general corpus, where random order. A total of 10 diferent questionnaires were
both uses may be equally distributed. created to cover the dataset of 140 metaphors.</p>
          <p>Metaphors were annotated using the Metaphor Iden- Participants were presented with sentence prompts
tification Procedure (MIP) by the Pragglejaz Group [14]. that followed a fixed syntactic structure and pragmatic
MIP operates at the word level and requires annotators function, deliberately designed by the researchers to
to compare the contextual meaning of a lexical unit with ensure consistency and reduce interpretive bias
stema more basic, concrete, and historically prior meaning. A ming from linguistic variation (see Tab. 8 in Appendix
word is tagged as metaphorical if its contextual meaning A). For each metaphor, participants were asked to write
contrasts with its basic meaning but can still be under- one or more possible completions based on the
prostood via it. vided standardized sentence frame. The layout of the</p>
          <p>
            To ensure syntactic and lexical variety, the dataset was questionnaire as viewed by the participants is provided
balanced across seven groups, defined by three key vari- in Appendix B. A total of 121 Italian-speaking adults
ables, as detailed in Table 1: (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) pattern, or the syntactic ( = 32.8 years,  = 13.6) participated in the
relation between the metaphorical term and its context experiment. Only one participant reported a diferent
marker; (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) valency, or the number of syntactic arguments native language, and their responses were excluded from
of the metaphorical verb; and (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) metaphorical element the analysis.
class, indicating whether the metaphor is expressed by The responses were corrected for grammatical
consisa noun, verb, or adjective. Subscript indices were used tency where necessary, including verb agreement,
mergto distinguish items when two elements shared the same ing of prepositions and articles, and the addition of
coplexical class. An example of a metaphor from each group ulas. Grammatically incorrect interpretations were
disis provided in Table 7 in Appendix A. carded. In total, 2,540 interpretations were collected, of
          </p>
          <p>
            The final dataset contains 140 metaphorical items, sys- which 2,117 were unique3. The distribution of
interpretematically balanced across syntactic patterns, valency, tations per metaphor was described using descriptive
statistics: mean (18.14), median (
            <xref ref-type="bibr" rid="ref17">17</xref>
            ), standard deviation
(4.57), minimum (
            <xref ref-type="bibr" rid="ref10">10</xref>
            ) and maximum (31).
1Oficial records consulted from the website of the Italian Chamber
of Deputies: https://www.camera.it/leg18/221
2The metaphor collection process involved using a database search
tool to identify lexical units in parliamentary debates by
querying word roots. Each occurrence whose metaphorical nature was
confirmed was subsequently added to our database.
          </p>
        </sec>
        <sec id="sec-4-1-2">
          <title>3This means that 0.83% of all collected interpretations consist of</title>
          <p>duplicates, that is, identical interpretations provided by diferent
participants in response to metaphors that tend to elicit higher
agreement.</p>
          <p>
            In addition, the conventionality of each metaphor was Topic. (ii) Opposite Metaphorical Distractors (OMD)
exevaluated on a scale of 1 to 5, how frequently the partici- press the opposite meaning of the most frequently given
pant hears the expression used with the same meaning. human interpretation. For example:
The adequacy of the context was also evaluated on the
same scale, measuring whether the provided sentence (
            <xref ref-type="bibr" rid="ref5">5</xref>
            ) Si intende che il risultato è molto importante come
context was suficient for understanding the metaphor. una briciola.
          </p>
          <p>
            The rating collection described above allowed us to ob- ‘It is meant that the result is very important, like a
tain an average conventionality score for each metaphor. crumb.’
This score reflects the degree of conventionality or nov- (
            <xref ref-type="bibr" rid="ref6">6</xref>
            ) Dicendo cassaforte di eccellenze si intende qualcosa
elty of the metaphor perceived by the participants. To che contiene cose di poco valore come una cassaforte.
illustrate, we report one metaphor rated as novel (e.g., ‘By saying safe of excellences, it is meant something
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ), with an average score of 2.40) and one rated as con- that contains things of little value, like a safe.’
ventional (e.g., (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ), with an average score of 4.86):
In (
            <xref ref-type="bibr" rid="ref5">5</xref>
            ), molto importante contradicts the typical
interpre(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) La Repubblica italiana con questo Governo sta diven- tation of briciola (small, insignificant). Similarly, in (
            <xref ref-type="bibr" rid="ref6">6</xref>
            ),
tando lo zampirone per l’impresa. cose di poco valore is the opposite of preziose, which was
‘The Italian Republic, with this Government, is be- the dominant human interpretation of the metaphorical
coming like a mosquito coil for businesses.’ cassaforte.
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) È un dramma determinato a sua volta dall’esplosione
demografica dell’Africa subsahariana.
‘It is a crisis caused in turn by the demographic
explosion in sub-Saharan Africa.’
3.1.3. Distractors
To create implausible interpretations for the collected
metaphors (i.e., distractors), inspiration was drawn from
the APL Medea test [15], a standardized tool designed to
assess pragmatic skills in children aged 5 to 14. One of
its subtests presents a figurative metaphor, and the child
must choose the image that best represents it among
one correct and three distractors. These include a literal
interpretation, a semantically related image, and one
showing elements of the sentence without integrating
them meaningfully.
          </p>
          <p>In this study, a similar approach was used: two
distractors were created for each of the 140 metaphors, totaling
280 distractors. They were based on alternative
completions of the sentences presented to human participants
(see Tab. 8), following two specific criteria: (i) Literal
Distractors (LD) are plausible only if the metaphorical
word is taken literally. For instance:</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Models</title>
        <sec id="sec-4-2-1">
          <title>We evaluated six autoregressive models based on</title>
          <p>three diferent architectures (LLaMA, GPT-2, Mistral),
trained on Italian data using two distinct approaches:
LLaMAntino-2-7b (adapted model) [16], and GePpeTto
[17] and Minerva (trained from scratch) [18]. Information
about the models’ architectures can be found in Table 2,
while their training data are summarized in Table 3.</p>
          <p>We also include several baselines for comparison. The
ifrst baseline is the accuracy level expected from
random selection among interpretations (0.33).
Additionally, we test two simple models based on input string
length: Longest String, which always selects the
interpretation with the highest number of characters, and
Shortest String, which chooses the interpretation with
the fewest characters. Furthermore, we adopted a model
based on the Gulpease index, a readability metric
designed to assess the complexity of Italian texts. The index
considers the number of sentences, letters, and words in a
given text segment [19]. This model consistently selects
the interpretation with the highest Gulpease score.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Data analysis</title>
        <sec id="sec-4-3-1">
          <title>These distractors use predicates or attributes that belong</title>
          <p>
            solely to the metaphor’s Vehicle and not the intended
(
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) Dei numeri aridi sono dei numeri che sono privi di
umidità.
‘Dry numbers are numbers that are devoid of
moisture.’
          </p>
          <p>
            This study uses log-likelihood as a measure
comparable to human preference, already employed in studies
on grammaticality and semantic plausibility judgments
[20, 21, 22], assuming that a model capable of
understand(
            <xref ref-type="bibr" rid="ref4">4</xref>
            ) Dicendo elefante burocratico si intende qualcosa che ing metaphorical expressions assigns a higher probability
ha una lunga proboscide come un elefante. to human-generated interpretations than to the two
dis‘By saying bureaucratic elephant, one means some- tractors. Autoregressive language models define a
probathing that has a long trunk, like an elephant.’ bility distribution over subsequent tokens conditioned on
the sequence of prior tokens. Consequently, the
probability of an entire sentence can be obtained by computing
the product of the conditional probabilities of each token
at its respective time step:
where
1() =
          </p>
          <p>
            (
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
{︃1 if  is true,
          </p>
          <p>0 otherwise.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>The comparison among the three strings, as illustrated</title>
          <p>
            by Equation 2, was therefore carried out for all
interpre˜ (1 . . .  ) = (1) ∏︁ ︀(  | 1 . . . − 1︀) (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) tations provided by human participants along with their
=2 corresponding distractors.
          </p>
          <p>We consider a metaphor  from the dataset of 140
metaphors, a set of interpretations of  produced by
participants denoted as , a literal distractor LD, and an
opposite metaphorical distractor OMD. For each
interpretation  belonging to , the log-likelihoods of the strings
* , LD* , and OMD* are extracted, where * indicates that
the metaphor is concatenated before each string4.
Accuracy is calculated by taking the ratio of the number
of cases in which the string * receives a log-likelihood
greater than or equal to the highest probability among
the two distractors, and the cardinality of .</p>
          <p>∑︁ 1{︀ ˜ (* ) ≥</p>
          <p>max[︀ ˜ (LD* ), ˜ (OMD* )]︀}
ACC = ∈</p>
          <p>||
4The existence of a significant diference between the proportions
of strings (interpretations and distractors) preferred by the models,
comparing the two conditions, presented in isolation versus
preceded by the metaphor, was confirmed through chi-square
tests, demonstrating the efectiveness of this manipulation and
ensuring the soundness of the experimental paradigm.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Results</title>
        <sec id="sec-4-4-1">
          <title>We report in Table 4 the accuracy values achieved by</title>
          <p>the models5, highlighting an improvement for the larger
models, with the exception of LLaMAntino-2-7b, which
achieves higher accuracy only compared to GePpeTto.</p>
          <p>A chi-square test revealed that all models exhibit
distributions that are significantly diferent from those
expected for the four baselines. As shown in Figure 1, there
is a trend within the Minerva family models to disfavor
OMDs, and this trend is directly proportional to the size
of the model. This makes it necessary to test whether,
in cases where this type of distractor does not receive a
higher probability, the choice between human
interpre5An additional metric, weighted accuracy, was computed using the
full set of 2,540 interpretations, including repeated responses from
multiple participants. This metric captures the model’s ability to
assign higher probabilities to more frequently produced
interpretations. Weighted accuracy increased by 0.02 points for all LLMs
except GePpeTto, which improved by 0.01, suggesting that retaining
repeated interpretations has minimal impact on model comparisons.
tations and LDs is due to chance or to one of the simple
strategies represented by the baselines.</p>
          <p>
            To analyze this hypothesis, an additional chi-square
test was conducted, excluding OMDs from the
observations. The results allow us to reject the hypothesis
that Minerva-350M randomly chooses between human
interpretations and LDs ( 2(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) = 14.618,  &lt; .001),
however this is not possible for any other model in
the same family. The same hypothesis can also be
rejected for LLaMAntino-2-7b ( 2(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) = 11.132,  &lt; .001)
and for GePpeTto ( 2(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) = 4.713,  &lt; .05). Yet,
only for Minerva-350M and GePpeTto is it true that
human interpretations are non-randomly favored, whereas
LLaMAntino-2-7b, in contrast, shows a stronger
preference for LDs.
          </p>
          <p>
            In addition to the inability to reject the hypothesis
of random choice between human interpretations and
LDs, for Minerva-7B it was not possible to reject the
hypothesis that the model always chooses the longer string
between LDs and OMDs. The opposite is true for the
smaller Minerva-3B model, whose results difer
significantly from the expected distribution of preferences
between the two distractors if it follows the "longer string"
strategy ( 2(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) = 18.833,  &lt; .001).
          </p>
        </sec>
        <sec id="sec-4-4-2">
          <title>The correlation analysis in Table 5 shows a positive rela</title>
          <p>tionship between metaphor conventionality and model
accuracy, confirming that models tend to achieve better
performance on more conventional metaphors.
However, the strength of this correlation varies across
models. Minerva-350M shows the highest correlation. Other
Minerva models follow a similar trend, with correlation
values gradually decreasing as model size increases, from
Minerva-1B to Minerva-7B.GePpeTto shows the lowest
and non-significant correlation, whereas
LLaMAntino-27b shows a weak but significant correlation, in line with
the larger Minerva models.</p>
          <p>The correlation analysis in Table 6 shows a positive
relationship between contextual appropriateness and
model accuracy, although the strength of this
correlation is very low or nearly negligible for some models.</p>
          <p>Minerva-350M exhibits the highest correlation,
suggestTable 6 of strings receiving the highest probability varies with
Correlation between model accuracy and context adequacy the conventionality of the metaphors. Whereas a
posiModel Pearson’s r sig. tive correlation between human interpretation
proporGePpeTto .055 tions (i.e., accuracy) and metaphor conventionality has
Minerva-350M-base-v1.0 .255 ** been previously observed across all models (albeit
nonMinerva-1B-base-v1.0 .191 * significant for Geppetto), a one-tailed test for negative
Minerva-3B-base-v1.0 .213 * correlation revealed a slight negative correlation between
Minerva-7B-base-v1.0 .160 average conventionality and the proportion of LDs that
LLaMAntino-2-7b-hf-ITA .079 received the highest probability across all models:
GeP*  &lt; .05, **  &lt; .01, ***  &lt; .001 peTto ( = − .176,  &lt; .05), Minerva-350M ( = − .184,
 &lt; .05), Minerva-1B ( = − .188,  &lt; .05),
Minerva3B ( = − .189,  &lt; .05), Minerva-7B ( = − .189,
 &lt; .05), and LLaMAntino-2-7b ( = − .168,  &lt; .05).
ing that this model benefits the most from more appro- Similar analyses were conducted to examine how the
priate context in determining correct interpretations. average contextual adequacy of metaphors relates to the
Minerva-1B and Minerva-3B show significant correla- distribution of preferences across the three interpretation
tions, indicating a positive but weaker efect compared options. Figure 3 illustrates the proportions of
interpretato Minerva-350M. Interestingly, the correlation observed tions that received the highest probability as contextual
for the larger model (3B) exceeds that of the smaller one adequacy varies. A one-tailed test for negative
correla(1B), representing an exception to the previously noted tion between contextual adequacy and the proportion of
trend in which larger models tend to be less sensitive to LDs with the highest probability revealed a significant
variables derived from human judgments. Minerva-7B relationship in both Minerva-1B ( = − .142,  &lt; .05)
does not reach the threshold for significance, suggesting and Minerva-3B ( = − .171,  &lt; .05). Both models also
that in larger models, the relationship between contex- show a positive correlation between contextual adequacy
tual relevance and accuracy may be less relevant. The and the proportion of human interpretations, suggesting
same holds for GePpeTto and LLaMAntino-2-7B with that these interpretations may gain preference at the
exnegligible correlations. pense of LDs, with minimal interference from OMDs. In</p>
          <p>The correlation between average conventionality and Minerva-350M, while the proportion of human
interpremodel accuracy ofers a solid foundation for investigating tations positively correlates with contextual adequacy
how preferences are distributed across the three string ( = .255,  &lt; .01), no significant negative correlation
types. It enables an analysis of how increasing conven- was found for either distractor type.
tionality afects the likelihood assigned to human inter- For further analysis, we report the accuracy of the
modpretations, to OMDs, and to LDs. els grouped by the syntactic pattern of the metaphors</p>
          <p>Figure 2 shows the trends in the percentages of sen- (see Fig. 4). Broadly speaking, the lowest performance
tences selected by the models, broken down by aver- was found in the group featuring a metaphorical
intranage conventionality. The chart illustrates how the share sitive verb combined with a literal subject. In contrast,
the highest accuracy was achieved on metaphors that textual information. Within the Minerva family, smaller
included both a metaphorical verb and a metaphorical models, such as Minerva-350M, appear more sensitive to
direct object. These trends provide evidence that specific these variables, whereas the sensitivity of larger models
syntactic configurations either disadvantage or support gradually decreases. This may indicate that larger
modthe models’ ability to understand metaphors. els are relatively less dependent on perceptual,
stimulusspecific variables than smaller ones, likely due to their
greater generalization capabilities.
4. Discussion Specifically, considering the results of the positive
correlation test between average conventionality and
Results highlight distinct preference patterns among lan- model accuracy, it emerges that for most models, as the
guage models when choosing between human interpre- metaphors become more conventional, human
interpretations and distractors. Notably, Minerva-350M and GeP- tations are favored while LDs are gradually penalized.
peTto show a statistically significant preference for hu- GePpeTto, however, does not follow this first trend, but
man interpretations over LDs, while LLaMAntino-2-7b only the second. This suggests that, when LDs are
exfavors LDs. Larger models in the Minerva family tend to cluded by this model, human interpretations and OMDs
disfavor OMDs, with some exhibiting behavior consistent exhibit a similar increasing trend, yet they are not equally
with simple baseline strategies. probable: human interpretations are generally assigned</p>
          <p>Moreover, model performance is influenced by the con- higher probabilities.
ventionality of the metaphor and the adequacy of con- The results regarding the correlation with the
adequacy of the sentential context in supporting the com- These results provide a nuanced picture of the
curprehension of the metaphorical expression show that, in rent capabilities and limitations of Italian-specific LLMs
larger models like Minerva-1B and Minerva-3B, higher in metaphor interpretation. They also underscore the
contextual adequacy is associated with a reduced pref- importance of linguistic diversity in model training and
erence for literal distractors, and a corresponding in- evaluation. Future work may benefit from expanding
crease in the selection of human interpretations. In con- the range of figurative phenomena studied and refining
trast, Minerva-350M shows a diferent pattern: while distractor generation to probe more deeply into
modthe proportion of human interpretations positively cor- els’ semantic representations. Additionally, collecting
relates with contextual adequacy, neither distractor type a broader set of psychometric judgments could provide
shows a significantly correlated decrease: when human- valuable insight into how these human factors correlate
generated interpretations are not selected, both distractor with model performance.
types contribute equally to the highest-probability
outcome.</p>
          <p>
            Furthermore, the observed performance diferences 6. Limitations
across syntactic patterns may reflect underlying biases
in the training data. One possible explanation for the
poor results on  ∼  constructions is the
overrepresentation in the training data of literal constructions
similar to the LDs, such as example (
            <xref ref-type="bibr" rid="ref7">7</xref>
            ).
(
            <xref ref-type="bibr" rid="ref7">7</xref>
            ) Dicendo dormire si intende riposare.
          </p>
          <p>‘By saying sleep, one means to rest’</p>
        </sec>
        <sec id="sec-4-4-3">
          <title>This over-representation may lead the model to favor</title>
          <p>
            literal readings, assigning higher probabilities to LDs.
Conversely, the higher accuracy on  ∼ 
constructions may be due to their idiomatic nature and the
presence in the training data of explanations that closely
resemble human interpretations:
(
            <xref ref-type="bibr" rid="ref8">8</xref>
            ) Dicendo fare lo struzzo si intende nascondersi.
‘By saying burying one’s head in the sand, one means
to hide.’
          </p>
        </sec>
        <sec id="sec-4-4-4">
          <title>These findings collectively underscore the importance of syntactic and idiomatic features in metaphor comprehension, while also pointing to potential limitations in training data diversity.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <sec id="sec-5-1">
        <title>This study explored the capacity of Italian-trained Large</title>
        <p>Language Models to interpret metaphorical expressions,
evaluating their performance based on their ability to
choose between human-produced interpretations and
systematically designed distractors. Our findings
indicate that, while no model fully replicates human-level
metaphor comprehension, smaller models, particularly
Minerva-350M and GePpeTto, demonstrate a statistically
significant preference for human-generated
interpretations over distractors.</p>
        <p>The observed correlations suggest that distributional
semantic representations, though not yet equivalent to
human inferential processes, are capable of capturing
ifgurative meaning, particularly for conventional
expressions.</p>
        <p>This study has several limitations. First, the dataset
includes only 140 metaphors, which may constrain the
generalizability of the results. Second, all metaphors were
drawn from parliamentary discourse, limiting coverage
of metaphor use in other domains. Third, conventionality
was assessed through subjective ratings, which reflect
perceived rather than actual frequency of use and should
therefore be considered only a proxy for true
conventionality. Finally, limited access to the models’ training
corpora prevents clear conclusions about whether model
performance reflects genuine interpretive ability or
memorization of previously seen patterns.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>A. Appendix A</title>
      <p>L’Italia ha bisogno di una politica estera trasparente, matura, lungimirante e programmatica.
‘Italy needs a transparent, mature, forward-looking, and strategic foreign policy’
Venezia è una perla che racchiude in se stessa quella che è l’identità del popolo veneto.
‘Venice is a pearl that embodies the identity of the Venetian people’
Il sostegno è necessario a chi oggi ha visto evaporare, da un giorno all’altro, il suo reddito.
‘Support is needed for those who saw their income evaporate overnight’
La disgustosa tappa odierna, di fatto, narcotizza il Parlamento.
‘Today’s disgraceful stage efectively narcotizes the Parliament’
Questa regione afonda le sue radici in una cultura profonda, in un senso civico importante.
‘This region sinks its roots into a deep culture and a strong civic spirit’
Interpretation to be completed
Dicendo giungla di burocrazia si intende qualcosa che . . . come una giungla
‘By saying jungle of bureaucracy, one means something that . . . like a jungle’
Dicendo elefante burocratico si intende qualcosa che . . . come un elefante
‘By saying bureaucratic elephant, one means something that . . . like an elephant’
Una politica estera trasparente è una politica estera che . . .
‘A transparent foreign policy is a foreign policy that . . . ’
Si intende che Venezia . . . come una perla
‘One means that Venice . . . like a pearl’
Dicendo evaporare si intende . . .
‘By saying evaporate, one means . . . ’
Dicendo narcotizzare il Parlamento si intende . . . il Parlamento
‘By saying narcotize the Parliament, one means . . . the Parliament’
Dicendo afondare le radici si intende . . .</p>
      <p>‘By saying sink the roots, one means . . . ’</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pustejovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Batiukova</surname>
          </string-name>
          , The Lexicon, Cambridge University Press, Cambridge,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lakof</surname>
          </string-name>
          , M. Johnson, Metaphors We Live By, University of Chicago Press, Chicago and London,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          , D. C.
          <article-title>Krakauer, The debate over understanding in AI's large language models</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>120</volume>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1073/pnas.2215907120.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          . doi:
          <volume>10</volume>
          .1145/3442188.3445922.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Campagnano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>DanteLLM: Let's Push Italian LLM Research Forward!</article-title>
          ,
          <source>in: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>4343</fpage>
          -
          <lpage>4355</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>388</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          , Distributional Semantics,
          <source>Studies in Natural Language Processing</source>
          , Cambridge University Press,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cambria</surname>
          </string-name>
          ,
          <article-title>A survey on computa-</article-title>
          <source>Proceedings of the 10th Italian Conference on Comtional metaphor processing techniques: From iden- putational Linguistics</source>
          ,
          <year>2024</year>
          , p.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          . URL: https: tification, interpretation, generation to application, //aclanthology.org/
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .77.pdf.
          <source>Artificial Intelligence Review</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>1829</fpage>
          -
          <lpage>1895</lpage>
          . [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lucisano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Piemontese</surname>
          </string-name>
          , Gulpease: una fordoi:
          <volume>10</volume>
          .1007/s10462-023
          <article-title>-10564-7. mula per la predizione della leggibilità di testi in</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ortony</surname>
          </string-name>
          , Beyond literal similarity, Psycho- lingua italiana,
          <source>Scuola e Città</source>
          (
          <year>1988</year>
          )
          <fpage>110</fpage>
          -
          <lpage>124</lpage>
          .
          <source>logical Review</source>
          <volume>86</volume>
          (
          <year>1979</year>
          )
          <fpage>161</fpage>
          -
          <lpage>180</lpage>
          . doi:
          <volume>10</volume>
          .1037/ [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Marvin</surname>
          </string-name>
          , T. Linzen,
          <source>Targeted syntactic evalu0033-295X.86.3</source>
          .161.
          <article-title>ation of language models</article-title>
          , in: E.
          <string-name>
            <surname>Rilof</surname>
          </string-name>
          , D. Chi-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ortony</surname>
          </string-name>
          (Ed.),
          <source>Metaphor and Thought</source>
          , 2 ed., ang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings Cambridge University Press,
          <year>1993</year>
          . doi:
          <volume>10</volume>
          .1017/ of the 2018
          <source>Conference on Empirical Methods in CBO9781139173865. Natural Language Processing</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1192</fpage>
          -
          <lpage>1202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. F.</given-names>
            <surname>Bowdle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gentner</surname>
          </string-name>
          , The career of metaphor,
          <source>doi:10.18653/v1/D18-1151. Psychological Review</source>
          <volume>112</volume>
          (
          <year>2005</year>
          )
          <fpage>193</fpage>
          -
          <lpage>216</lpage>
          . doi:10. [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chersoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fedorenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . A.
          <volume>1037</volume>
          /
          <fpage>0033</fpage>
          -
          <lpage>295X</lpage>
          .
          <year>112</year>
          .1.193.
          <string-name>
            <surname>Ivanova</surname>
          </string-name>
          ,
          <article-title>Log probabilities are a reliable estimate</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pedinotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Palma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cerini</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Lenci,
          <article-title>of semantic plausibility in base and instructionA howling success or a working sea? testing tuned language models, in: Proceedings of the 7th what bert knows about metaphors</article-title>
          , in: Proceed- BlackboxNLP
          <source>Workshop: Analyzing and Interpretings of the Fourth BlackboxNLP Workshop on ing Neural Networks for NLP</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>263</fpage>
          -
          <lpage>277</lpage>
          .
          <source>Analyzing and Interpreting Neural Networks for doi:10</source>
          .18653/v1/
          <year>2024</year>
          .blackboxnlp-
          <volume>1</volume>
          .18.
          <string-name>
            <surname>NLP</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>192</fpage>
          -
          <lpage>204</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ivanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rambelli</surname>
          </string-name>
          , E. Chersoni, blackboxnlp-
          <volume>1</volume>
          .13.
          <string-name>
            <given-names>J. S.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fedorenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Lenci,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Choenni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , E. Shutova,
          <article-title>Event knowledge in large language models: The Metaphor understanding challenge dataset for gap between the impossible and the unlikely, CogniLLMs</article-title>
          , in
          <source>: Proceedings of the 62nd Annual Meet- tive Science</source>
          <volume>47</volume>
          (
          <year>2023</year>
          )
          <article-title>e13386</article-title>
          . doi:
          <volume>10</volume>
          .1111/cogs. ing of the Association for Computational Linguis-
          <volume>13386</volume>
          . tics (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2024</year>
          , p.
          <fpage>3517</fpage>
          -
          <lpage>3536</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2403.11810.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Neubig, Testing the ability of language models to interpret figurative language</article-title>
          , in: M.
          <string-name>
            <surname>Carpuat</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-C. de Marnefe</surname>
            ,
            <given-names>I. V.</given-names>
          </string-name>
          <string-name>
            <surname>Meza Ruiz</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4437</fpage>
          -
          <lpage>4452</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .naacl-main.
          <volume>330</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Group</surname>
          </string-name>
          ,
          <article-title>MIP: A method for identifying metaphorically used words in discourse</article-title>
          ,
          <source>Metaphor and Symbol</source>
          <volume>22</volume>
          (
          <year>2007</year>
          )
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          . doi:
          <volume>10</volume>
          . 1080/10926480709336752.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>L. M. LoRusso</surname>
          </string-name>
          ,
          <string-name>
            <surname>APL-Medea - Abilità Pragmatiche Nel Linguaggio</surname>
          </string-name>
          , Giunti - OS Organizzazioni Speciali, Firenze,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          , G. Fiameni, G. Semeraro,
          <article-title>LLaMAntino: LLaMA 2 Models for Efective Text Generation in Italian Language, arXiv preprint (</article-title>
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ arXiv.2312.09993. arXiv:
          <volume>2312</volume>
          .
          <fpage>09993</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Mattei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guerini</surname>
          </string-name>
          ,
          <article-title>Geppetto carves italian into a language model, arXiv preprint (</article-title>
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .48550/ arXiv.
          <year>2004</year>
          .
          <volume>14253</volume>
          . arXiv:
          <year>2004</year>
          .14253.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-L. H.</given-names>
            <surname>Cabot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Barba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          , G. Fiameni,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , Minerva LLMs:
          <article-title>The First Family of Large Language Models Trained from Scratch on Italian Data</article-title>
          , in:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>