<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multimodal Chain-of-Thought Prompting for Metaphor Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sofia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lugli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Strapparava</string-name>
          <email>strappa@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper introduces an exploratory approach in the field of metaphorical and visual reasoning by proposing the Multimodal Chain-of-Thought Prompting for Metaphor Generation task aimed to generate metaphorical linguistic expressions from non-metaphorical images by using the multimodal LLaVA 1.5 model and the two-step approach of multimodal chain-ofthought prompting. The generated metaphors were evaluated in two ways: using BERTscore and by five human workers on Amazon Mechanical Turk. Concerning the automatic evaluation, each generated metaphorical expression was paired with a corresponding human metaphorical expressions. The overall BERTscore was the following: precision= 0.41, recall= 0.43, and F1= 0.42, suggesting that generated and human metaphors might not have captured the same semantic meaning. The human evaluation showed the model's ability to generate metaphorical expressions, as 92% of them were classified as metaphors by the majority of the workers. Additionally, the evaluation revealed interesting patterns in terms of metaphoricity, familiarity and appeal scores across the generated metaphors: as the metaphoricity and appeal scores increased, the familiarity score decreased, suggesting that the model exhibited a certain degree of creativity, as it has also generated novel or unconventional metaphorical expressions. It is important to acknowledge that this work is exploratory in nature and has certain limitations.</p>
      </abstract>
      <kwd-group>
        <kwd>metaphor generation</kwd>
        <kwd>large language models</kwd>
        <kwd>pragmatics</kwd>
        <kwd>creativity</kwd>
        <kwd>multimodality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The scope of this paper is to introduce an alterna- by human workers on Amazon Mechanical Turk. The
cilitate metaphor generation. The metaphors generated
by the model were evaluated through BERTscore [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and
results show the model’s ability to generate metaphorical
expressions, with 92% of the generated expressions being
classified as metaphors. Additionally, the evaluation
revealed interesting patterns in terms of the metaphoricity,
familiarity and appeal scores of the generated
expressions. Interestingly, as the metaphoricity score increases,
the familiarity score decreases while the appeal score
increases. This suggests that the model was able to
create novel or uncommon metaphorical expressions which
may difer from the more conventional metaphors, which
the evaluators might have been more familiar with.
Despite being less familiar, the metaphorical expressions
were preferred over the non-metaphorical ones. It is
important to acknowledge that this is an exploratory work,
which aims to ofer a diferent approach in multimodal
metaphor generation. As such, it is essential to point out
the presence of some limitations, in particular
concernCEUR
      </p>
      <p>
        ceur-ws.org
proach known as multimodal chain-of-thought prompt- human evaluation.
timodal model LLaVA 1.5 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and adopted a two-step ap- ing the choice of the visual inputs and the constraints of
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>For most people, metaphor is merely a rhetorical device</title>
        <p>
          restricted to poetic language; however, according to the
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Conceptual Metaphor Theory (CMT) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] metaphor is
perAttribution 4.0 International (CC BY 4.0).
2.2. Related Works
vasive in everyday language, playing a significant role in been less research in computational modelling of visual
communication, cognition and decision making. More and multimodal metaphors, in particular works
accountprecisely, we talk about conceptual metaphor and linguis- ing for metaphor localization, understanding and
generatic metaphor. Conceptual metaphors consist of systematic tion [
          <xref ref-type="bibr" rid="ref4 ref5">26, 27, 5, 4</xref>
          ]. In particular, [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] introduced MetaCLUE,
sets of mappings across conceptual domains, whereby a a collection of vision tasks on visual metaphor which
target domain, which is usually a more abstract and com- enables comprehensive evaluation and development of
plex concept, is partly structured in terms of a diferent visual metaphor research. Concerning metaphor
genersource domain, which usually defines a more concrete ation, [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] proposed a task that involves generating an
and common concept. Conceptual metaphors are then image that efectively conveys the metaphorical message
reflected in our everyday language by a wide variety provided as the text prompt; however, the generated
imof linguistic metaphors. For instance, ARGUMENT IS ages perform poorly compared to real images in
conveyWAR is a conceptual metaphor, where ARGUMENT is ing metaphorical messages. Additionally, [27] proposed
the target domain and WAR is the source domain; exam- an alternative task for generating visual metaphors from
ples of its linguistic metaphors are e.g. Your claims are linguistic metaphors using Chain-of-Thought prompting,
indefensible. He attacked every weak point in my argu- showing improvements in the quality of visual metaphors
ment. You disagree? Okay, shoot! [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Some of these generated by difusion-based text-to-image models.
Nevmetaphorical mappings can be defined as conventional ertheless, the common aspect across these studies is that
metaphors, as they are so deep-rooted in our everyday the metaphorical quality was already present either in
thought and language that they might have become the the textual or in the visual input employed.
Interestdominant way of framing a specific concept, and they ingly, [28] and [29] dealt with literal images and textual
represent the commonsense [10]; while other metaphor- metaphors; however their tasks focused on association
ical mappings, i.e. novel metaphors, are more creative, between the text and images, rather than on metaphor
and they are not (yet) used in everyday discourse, but generation. Therefore, this paper aims to propose an
may become conventionalized if frequently used. alternative approach involving generating
metaphorical linguistic expressions from non-metaphorical images,
which lack inherent metaphorical qualities.
        </p>
        <p>Over the past years, NLP research has been focusing on 2.3. Chain-of-Thought Prompting
literal and lower-level linguistic information, while
humans excels at high-level semantic task, involving also The advent of large language models has inevitably
the use of figurative language [ 11]. Moreover, statistical changed the NLP field [ 30], in particular they opened the
corpus analysis [12] indicates that in corpora, metaphors prospect to the new paradigm of ”prompt-based learning”
occur in approximately one-third of the sentence. There- [31]. [30] introduced the concept of chain-of-thought
fore, metaphor gradually became an important topic in (CoT) prompting, which improves the ability of large
computational linguistics and NLP. Numerous studies language models to perform complex reasoning tasks by
have been conducted to investigate metaphors, result- employing intermediate reasoning steps. They combined
ing in three main sub-tasks: metaphor identification this approach with few-shot prompting (Few-shot-CoT),
[11, 13, 14, 15], metaphor interpretation [16, 17, 18], and which enables the language model to generate chains
metaphor generation [19, 20, 21]. of thought when examples of those are provided.
An</p>
        <p>
          As human meaning representations rely not only on other approach, known as Zero-shot-CoT [32] consists
linguistic exposure, but also on perceptual system and in adding the simple prompt Let’s think step by step to
sensory-motor experience, [
          <xref ref-type="bibr" rid="ref2">2, 22</xref>
          ]; and as metaphors are the original prompt. The advantage of this method is
not merely a matter of language but also of thought that it eliminates the need for hand-crafted few-shot
and action [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], it became relevant to study metaphors examples, resulting in greater versatility. Recently, [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
through diferent modalities. In NLP, the shift towards introduced a multimodal chain-of-thought prompting
apmultimodality happened once computational approaches proach (Multimodal-CoT), which incorporates language
started adding sensory and contextual features which led (text) and vision (images) modalities into a two-stage
to a better performance in metaphor processing [23, 24]. framework. The rationale generation and answer
inferBecause of the grounded nature of metaphors, metaphors ence are separated in two diferent steps, allowing the
can occur in diferent modalities: visual and multimodal answer inference to benefit from well-generated
ratiometaphors are typically used in mass media communica- nales that are based on multimodal information.
tion (e.g., advertising, newspaper) [25]. Visual metaphors
are monomodal and expressed through vision, whereas
multimodal metaphors are expressed at least through two
modalities. Compared to textual metaphors, there has
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <sec id="sec-3-1">
        <title>All the data used and the complete results obtained are publicly available at the following repository: https:// github.com/SofiaLugli/Multi_COT_meta_gen.git.</title>
        <p>3.1. Model</p>
      </sec>
      <sec id="sec-3-2">
        <title>For the purpose of this study, we employed the new mul</title>
        <p>
          timodal model LLaVA 1.5 (Large Language and Vision
Assistant) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] which is the next iteration of LLaVA [33],
considered as the first attempt to use language-only
GPT4 to generate multimodal language-image
instructionfollowing data. LLaVA 1.5 is a end-to-end trained large
language model combining a pre-trained
CLIP-ViT-L336px visual encoder with an MLP projection [34] and
large language model Vicuna [35] for general purpose
visual and language understanding. The model achieved
new SoTA performance across 11 benchmarks, thanks
to new academic-task-oriented VQA data with simple
response formatting prompts. One of the main reason
for choosing this model is its impressive multimodal chat
abilities; additionally, it is worth noting it is the first
opensource project to GPT-V alternative. More precisely, we
used the llava-v1.5 13B-4bit and the parameters were set
as follows: temperature=0.2, max_new_tokens=1024.1
3.2. Dataset Collection
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>1https://github.com/haotian-liu/LLaVA 2https://metaphor.icsi.berkeley.edu</title>
      </sec>
      <sec id="sec-3-4">
        <title>In order to select the metaphors for our research, we</title>
        <p>
          retrieved 300 conceptual metaphors from the MetaNet In this section, we will provide an explanation of the task
Metaphor Wiki, 2 a comprehensive repository of concep- at hand. We propose an alternative approach for
multitual metaphors based on years of research on the Con- modal metaphor generation by using both language and
ceptual Metaphor Theory. These metaphors follow the non-metaphorical visual inputs. Our approach is based
standard format, where a target domain is compared to a on the multimodal CoT prompting technique [
          <xref ref-type="bibr" rid="ref8">8, 36</xref>
          ].
source domain, e.g., ACHIEVING POWER IS MOVING Our approach follows a two-step process, as shown in
UPWARDS, CANCER IS A JOURNEY, ENVIRONMEN- Fig.1. Firstly, the model is fed with the non-metaphorical
TAL HARM IS PHYSICAL INJURY. To ensure an efective image containing both the images of the target and
visual representation for the metaphors, we collected two source domains. The model’s task is to generate captions
images for each metaphor: one representing the target describing each of these images. We provide the prompt:
domain and the other representing the source domain. The image contains 2 separated images: one
Given the fact that ”LLaVA-1.5 is not yet capable of pro- image at the top and one image at the bottom.
cessing multiple images” [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], for each metaphor, the two First, caption the image at the top, and then
images corresponding to the two domains have been caption the image at the bottom. Remember:
pasted together in one image with the target domain the images are unrelated to each other and so
image at the top and the source domain image at the bot- are the captions. Once the content of the picture has
tom. The images were sourced from Google Image and been generated, it is then used as input for the second
they vary in style, ranging from realistic to cartoon-like prompt, which involves generating metaphorical
exprespictures. sions based on the source and target domains. For this,
we employ the following prompt: Context: Metaphors
consist of mappings between the source domain
and the target domain.The source domain is
the conceptual domain from which we draw the
metaphorical expression, while the target
domain is the conceptual domain that we try
5
4
3
2
1
        </p>
        <p>Wounded environment
House of thoughts
She is wearing a bandage on her heart
Climbing the stairs of success
Fighting the battle against cancer
The burden of the virus is weighing heavily
on the man’s shoulders
Digesting knowledge
Battle of words
Walking down a road to recovery
A financial heart attack
Embracing the warmth of friendship
Their love was as hot as the sun
Shaking hands over a book of contracts is like
a marriage of business and legal agreements
A family’s journey through life, with the man
as the guide and the woman and child as his
companions
A political body is like a human body</p>
        <p>ENVIRONMENTAL HARM IS PHYSICAL INJURY
MIND IS A BUILDING
PSYCHOLOGICAL HARM IS PHYSICAL INJURY
ACHIEVING POWER IS MOVING UPWARDS
CANCER PATIENT IS PHYSICAL COMBATANT
DISEASES ARE BURDENS
ACQUIRING IDEAS IS EATING
ARGUMENT IS WAR
CANCER IS A JOURNEY
ADDRESSING ECONOMIC PROBLEMS IS TREATING AN ILLNESS
AFFECTION IS WARMTH
PASSION IS HEAT
AGREEMENT IS PHYSICAL PROXIMITY
BEING IN A LOW SOCIAL CLASS IS BEING LOW ON A SCALE</p>
        <p>
          GOVERNMENT IS A PERSON
to understand. Task: Create one metaphorical and by five human workers through Amazon Mechanical
linguistic expression using the source domain Turk.
and the target domain represented in the Concerning the automatic metaphor evaluation trough
pictures. For instance, Fig. 1 provides a visual BERTscore [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], each generated metaphorical expression
representation of the task in the case of the conceptual (candidate) was paired with a corresponding human
metaphor ENVIRONMENTAL HARM IS PHYSICAL IN- metaphorical expression retrieved from MetaNet
(referJURY. In this example, the model was able to successfully ence), which provides real world examples of linguistic
generate two distinct captions for the target domain metaphors, sourced from various contexts (e.g.,
newspaimage and the source domain image. Subsequently, pers, books, etc.). However, the MetaNet does not
progiven the second prompt, the model was able to vide examples for all the metaphors in their repository, as
generate a corresponding metaphorical expression such 75 metaphors were excluded from this evaluation, as
such as wounded environment. Additionally, the model they lacked example references. Compared to traditional
provided a correct explanation of the new generated commonly used evaluation metrics [37, 38, 39], which
metaphor. To prove the utility of the method, the task relied on n-gram count, BERTscore [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] computes token
was performed on a subset of the dataset without using similarity using contextualized token embeddings, which
CoT prompting. In this case, only the second prompt have been shown to be efective for paraphrase detection
of generating the metaphor was used, without first [40]. It then calculates Recall and Precision, which are
the image captioning prompt. The results were less combined into an F1 score.
satisfactory. For instance, for the conceptual metaphor Concerning human evaluation, each generated
exENVIRONMENTAL HARM IS PHYSICAL INJURY, the pression was evaluated by five Amazon Mechanical
model generated the expression The sun shines brightly Turk workers from English speaking countries (Australia,
over the barren landscape, illuminating the industrial Canada, Ireland, New Zealand, United Kingdom, and
complex like a beacon of hope. This output, compared to United States). The workers were required to had an
the metaphor generated through CoT prompting (e.g., approval rate greater than 95% on 1000 prior approved
wounded environment), does not involve a metaphor and HITs; their reward was $0.12 per task. To ensure the
fails to consider the images of both source and target quality of the evaluation, the workers were given
backdomains. ground knowledge regarding the Conceptual Metaphor
Theory, as well as positive and negative examples for
3.4. Evaluation setup the task. The workers had to chose whether the
generated linguistic expression (e.g., Wounded environment)
The evaluation of the generated metaphorical expressions could be accepted as a linguistic metaphor for its
corhas been conducted in two ways: through BERTscore responding conceptual metaphor (e.g.,
ENVIRONMENTAL HARM IS PHYSICAL INJURY) with the following plete results are available in our repository. Furthermore,
Yes or No question: Can the linguistic expression be con- 108 expressions were considered as metaphors by four
sidered as a linguistic metaphor for the provided concep- workers and 76 expressions by three workers. Out of
tual metaphor?. Additionally, they were asked other two the 300 metaphors, only 24 generated expressions were
yes/no questions regarding the familiarity and appeal of not evaluated as metaphors as they were recognized as
the expressions: Have you encountered this linguistic ex- metaphors by either two (21 expressions) or only one
pression before? and Is this linguistic expression appealing worker (3 expressions). It is worth noting that none of
to you?. To consider an expression as metaphorical, it the expressions were evaluated as non metaphors by any
had to be evaluated as such by at least three out of the of the workers. These results can be considered as
posifve workers. It is worth noting that it was not mentioned itive, suggesting that LLaVA 1.5 successfully generated
that the metaphors were not human-generated in order metaphorical expressions from non-metaphorical visual
to prevent any potential bias. inputs.
        </p>
        <p>Now let us examine the remaining two criteria. In
terms of familiarity, the average score is 2.95, and 67% of
4. Results the expressions were considered as familiar by at least
three workers. Only 22 expressions were considered as
In this section, we present the results derived from the familiar by all five workers; for instance the expression A
automatic and the human evaluation. Regarding the au- journey through life for PROGRESSING THROUGH LIFE
tomatic evaluation, it is important to note that, overall IS MOVING ALONG A PATH. Additionally, 73 metaphors
the BERTscore between the generated and the human were familiar to four evaluators, while 106 expressions
metaphors was low, the average scores were the follow- were familiar to three evaluators. On the other hand,
ing precision= 0.41, recall= 0.43, and F1= 0.42. The high- there were 71 metaphors that were not familiar to all but
est score was achieved in the metaphor SAD IS DOWN, two workers, 24 that were only familiar to one worker,
where the generated metaphor feeling down in the dumps and 4 that were not familiar to any worker. In other
and the real-world example I’m feeling down achieved the words, out of 300 expressions, 99 expressions can indeed
scores precision= 0.67, recall= 0.84, and F1= 0.74. The low be considered unfamiliar, as they are only rated as
familBERTscore suggests that there is a discrepancy between iar by two or fewer workers. These findings regarding
the model’s generations and human examples, which may familiarity indicate that the model generated not only
indicate that the generated metaphors may not be captur- familiar expressions but also novel, or uncommon
exing the same semantic meaning as the human-generated pressions. This suggests that the model exhibits a certain
ones. Additionally, this might be due to the diference in degree of creativity in this task.
contexts. Human-generated metaphors often reference Moving on to the appeal criterion, the average score
real-world examples, including real people and events; is 3.32, and 78% of the generated expressions were liked
whereas the generated metaphors tend to be more generic by at least three workers. Among the expressions, 37
aMnodrleeosvsenru,aanncoetdhecromrepaasroendbtoehthinedhtuhmealno-wgeBnEeRraTtsecdoorneeiss. wtoerreecloivkeerdyb yfoarllCfivAeNwCoErkRerISs, Ae.gJ.O,UWRNalEkYin.gFudortwhneramrooared,
that, while robust, it might still have limitations in cap- 98 expressions appealed to four workers, 99 to three
turing the subtle and nuanced diferences and similarities workers, 57 to two workers and 9 to only one worker.
in metaphorical language, which are typically subjective These results indicate that the generated expressions
and context-dependent. were mostly appreciated.</p>
        <p>Concerning the human evaluation by five MTurk work- Let us now examine the distribution of the mean
agreeers, it was conducted on three criteria: metaphoricity, ment scores for familiarity and appeal in relation to the
familiarity and appeal of the generated linguistic expres- agreement scores for metaphoricity. As illustrated in Fig.
sions. First of all, the expressions obtained a metaphoric- 2, the observed pattern seems to suggest that the mean
ity mean score of 3.8, which means that, on average, the familiarity and appeal scores exhibit contrasting trends
generated expressions were considered as metaphorical across diferent metaphoricity scores. Interestingly, as
by the majority of the workers. A total of 92% of the the metaphoricity score increases, the familiarity score
linguistic expressions were evaluated as metaphors by at decreases while the appeal score increases.
Metaphoricleast three workers. Among these, 92 expressions were ity scores 5 and 1 represent the extremes, with distinct
unanimously recognized as metaphors by all five evalu- diferences in both familiarity and appeal. For the
generators, for instance Wounded environment generated for ated metaphorical expressions evaluated as such by all
the conceptual metaphor ENVIRONMENTAL HARM IS ifve workers, the mean score of familiarity is 2.92 and
PHYSICAL INJURY. Additional examples of the gener- of appeal is 3.6; whereas for the expressions considered
ated expressions and their corresponding metaphoricity metaphorical only by one worker, the mean familiarity
agreement scores can be found in Table 1, while the com- score is 3.67 and appeal is 3.0. With the exception of the
expressions with metaphoricity score 2, which registered
the lowest score (2.71) both for familiarity and appeal, the
pattern seems to indicate that metaphoric expressions
with higher metaphoricity scores tend to have lower
familiarity and higher appeal. This means that the
evaluators found the literal generated expressions
(metaphoricity scores 1 and 2) to be more familiar compared to the
metaphorical ones. Hence, the results suggest that the
model was able to create novel metaphorical expressions
which may difer from the more conventional metaphors,
which the evaluators might have been more familiar with.</p>
        <p>Despite being less familiar, the metaphorical expressions
were preferred over the non-metaphorical ones. These
ifndings show that the model exhibited a degree of
creativity in metaphor generation, as it generated novel or
unconventional metaphorical expressions which where
appreciated by human evaluators.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>This study aimed to explore an alternative approach for
multimodal metaphor generation using the new LLaVA
1.5 model and Multimodal-CoT prompting. The results
showed the model’s ability to generate metaphorical
expressions when provided with both linguistic and visual
inputs which lack inherent metaphorical qualities.
Additionally, the evaluation revealed interesting patterns
across the metaphoricity, familiarity and appeal scores of
the generated expressions. The model exhibited its
creativity, as it generated novel or unconventional
metaphorical expressions, which were also preferred over
nonmetaphorical ones. It is important to state again that this
is an exploratory work with some limitations. One
limitation to consider is the choice of the images used in the
study. As manually selected from Google Image, their
quality may influence the quality of the captions and
metaphors generated by the model. Another limitation
to consider is the subjectivity of the evaluation process,
it is possible that Amazon MTurk workers may lack the
necessary sensitivity and background knowledge to
accurately recognize and evaluate metaphorical expressions,
despite the instructions included background
information about metaphor. Future works should aim to address
these limitations by selecting more accurate images, as
well as incorporating more diverse and expert annotators.</p>
      <p>Despite these limitations, the task show promising
results for future research in the field of metaphorical
and visual reasoning.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <sec id="sec-5-1">
        <title>We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU.</title>
        <p>[10] E. Semino, Metaphor in Discourse, Metaphor in [24] E. Shutova, D. Kiela, J. Maillard, Black holes and
Discourse, Cambridge University Press, 2008. URL: white rabbits: Metaphor identification with visual
https://books.google.it/books?id=QT1uilVRDTYC. features, in: Proceedings of the 2016 conference of
[11] E. V. Shutova, Computational approaches to figu- the North American chapter of the association for
rative language, Technical Report, University of computational linguistics: Human language
techCambridge, Computer Laboratory, 2011. nologies, 2016, pp. 160–170.
[12] G. J. Steen, A. G. Dorst, J. B. Herrmann, A. A. Kaal, [25] C. Forceville, Pictorial metaphor in advertising,
T. Krennmayr, Metaphor in usage, Cognitive Lin- Routledge, 2002.</p>
        <p>guistics (2010). [26] D. Zhang, M. Zhang, H. Zhang, L. Yang, H. Lin,
[13] Y. Tsvetkov, L. Boytsov, A. Gershman, E. Nyberg, Multimet: A multimodal dataset for metaphor
unC. Dyer, Metaphor detection with cross-lingual derstanding, in: Proceedings of the 59th Annual
model transfer, in: Proceedings of the 52nd Annual Meeting of the Association for Computational
LinMeeting of the Association for Computational Lin- guistics and the 11th International Joint Conference
guistics (Volume 1: Long Papers), 2014, pp. 248–258. on Natural Language Processing (Volume 1: Long
[14] G. Gao, E. Choi, Y. Choi, L. Zettlemoyer, Neu- Papers), 2021, pp. 3214–3225.
ral metaphor detection in context, arXiv preprint [27] T. Chakrabarty, A. Saakyan, O. Winn,
arXiv:1808.09653 (2018). A. Panagopoulou, Y. Yang, M. Apidianaki,
[15] R. Mao, X. Li, M. Ge, E. Cambria, Metapro: A com- S. Muresan, I spy a metaphor: Large language
putational metaphor processing model for text pre- models and difusion models co-create visual
processing, Information Fusion 86 (2022) 30–43. metaphors, arXiv preprint arXiv:2305.14724 (2023).
[16] E. Shutova, Automatic metaphor interpretation as [28] G. Özbal, D. Pighin, C. Strapparava, et al., A proverb
a paraphrasing task, in: Human language tech- is worth a thousand words: learning to associate
nologies: the 2010 annual conference of the North images with proverbs, in: Proceedings of the 41st
American chapter of the association for computa- Annual Conference of the Cognitive Science Society
tional linguistics, 2010, pp. 1029–1037. (CogSci’19), Cognitive Science Society, 2019, pp.
[17] C. Su, S. Huang, Y. Chen, Automatic detection and 2515–2521.</p>
        <p>interpretation of nominal metaphor based on the [29] R. Yosef, Y. Bitton, D. Shahaf, Irfl: Image
recogtheory of meaning, Neurocomputing 219 (2017) nition of figurative language, arXiv preprint
300–311. arXiv:2303.15445 (2023).
[18] E. Liu, C. Cui, K. Zheng, G. Neubig, Testing the [30] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia,
ability of language models to interpret figurative E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought
language, arXiv preprint arXiv:2204.12632 (2022). prompting elicits reasoning in large language
mod[19] T. Veale, Round up the usual suspects: Knowledge- els, Advances in Neural Information Processing
based metaphor generation, in: Proceedings of the Systems 35 (2022) 24824–24837.</p>
        <p>Fourth Workshop on Metaphor in NLP, 2016, pp. [31] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G.
Neu34–41. big, Pre-train, prompt, and predict: A systematic
[20] Z. Yu, X. Wan, How to avoid sentences spelling survey of prompting methods in natural language
boring? towards a neural approach to unsupervised processing, ACM Computing Surveys 55 (2023)
metaphor generation, in: Proceedings of the 2019 1–35.</p>
        <p>Conference of the North American Chapter of the [32] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y.
IwaAssociation for Computational Linguistics: Human sawa, Large language models are zero-shot
reasonLanguage Technologies, Volume 1 (Long and Short ers, 2023. arXiv:2205.11916.</p>
        <p>Papers), 2019, pp. 861–871. [33] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction
[21] T. Chakrabarty, X. Zhang, S. Muresan, N. Peng, tuning, arXiv preprint arXiv:2304.08485 (2023).</p>
        <p>Mermaid: Metaphor generation with symbolism [34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
and discriminative decoding, arXiv preprint G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
arXiv:2103.06779 (2021). J. Clark, G. Krueger, I. Sutskever, Learning
transfer[22] M. M. Louwerse, Symbol interdependency in sym- able visual models from natural language
supervibolic and embodied cognition, Topics in Cognitive sion, 2021. arXiv:2103.00020.</p>
        <p>Science 3 (2011) 273–302. [35] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu,
[23] P. Turney, Y. Neuman, D. Assaf, Y. Cohen, Literal H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.
and metaphorical sense identification through con- Gonzalez, I. Stoica, E. P. Xing, Vicuna: An
opencrete and abstract context, in: Proceedings of the source chatbot impressing gpt-4 with 90%*
chat2011 Conference on Empirical Methods in Natural gpt quality, 2023. URL: https://lmsys.org/blog/
Language Processing, 2011, pp. 680–690. 2023-03-30-vicuna/.
[36] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal,</p>
        <p>S. Ma, T. Lv, L. Cui, O. K. Mohammed, Q. Liu,
et al., Language is not all you need: Aligning
perception with language models, arXiv preprint
arXiv:2302.14045 (2023).
[37] S. Banerjee, A. Lavie, Meteor: An automatic
metric for mt evaluation with improved correlation
with human judgments, in: Proceedings of the
acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or
summarization, 2005, pp. 65–72.
[38] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting
of the Association for Computational Linguistics,
2002, pp. 311–318.
[39] C.-Y. Lin, Rouge: A package for automatic
evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[40] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,</p>
        <p>Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint
arXiv:1810.04805 (2018).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lakof</surname>
          </string-name>
          , M. Johnson, Metaphors We Live By, University of Chicago Press,
          <year>2008</year>
          . URL: https://books. google.it/books?id=r6nOYYtxzUoC.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L. W.</given-names>
            <surname>Barsalou</surname>
          </string-name>
          , Grounded cognition,
          <source>Annu. Rev. Psychol</source>
          .
          <volume>59</volume>
          (
          <year>2008</year>
          )
          <fpage>617</fpage>
          -
          <lpage>645</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Akula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Driscoll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Narayana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Changpinyo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Damle</surname>
          </string-name>
          , G. Pruthi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guibas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. T.</given-names>
            <surname>Freeman</surname>
          </string-name>
          , et al.,
          <article-title>Metaclue: Towards comprehensive visual metaphors research</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>23201</fpage>
          -
          <lpage>23211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shwartz</surname>
          </string-name>
          ,
          <article-title>Memecap: A dataset for captioning and interpreting memes</article-title>
          ,
          <source>arXiv preprint arXiv:2305.13703</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naseriparsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Met-meme: A multimodal meme dataset rich in metaphors</article-title>
          ,
          <source>in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2887</fpage>
          -
          <lpage>2899</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakrabarty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shwartz</surname>
          </string-name>
          ,
          <article-title>It's not rocket science: Interpreting figurative language in narratives</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>589</fpage>
          -
          <lpage>606</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Improved baselines with visual instruction tuning</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>03744</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Karypis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <article-title>Multimodal chain-of-thought reasoning in language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.00923</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>09675</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>