1. Introduction

1613-0073

Multimodal Chain-of-Thought Prompting for Metaphor Generation

Sofia

Lugli

Carlo Strapparava

strappa@fbk.eu 0 0 Fondazione Bruno Kessler , Trento , Italy 1 University of Trento , Italy

This paper introduces an exploratory approach in the field of metaphorical and visual reasoning by proposing the Multimodal Chain-of-Thought Prompting for Metaphor Generation task aimed to generate metaphorical linguistic expressions from non-metaphorical images by using the multimodal LLaVA 1.5 model and the two-step approach of multimodal chain-ofthought prompting. The generated metaphors were evaluated in two ways: using BERTscore and by five human workers on Amazon Mechanical Turk. Concerning the automatic evaluation, each generated metaphorical expression was paired with a corresponding human metaphorical expressions. The overall BERTscore was the following: precision= 0.41, recall= 0.43, and F1= 0.42, suggesting that generated and human metaphors might not have captured the same semantic meaning. The human evaluation showed the model's ability to generate metaphorical expressions, as 92% of them were classified as metaphors by the majority of the workers. Additionally, the evaluation revealed interesting patterns in terms of metaphoricity, familiarity and appeal scores across the generated metaphors: as the metaphoricity and appeal scores increased, the familiarity score decreased, suggesting that the model exhibited a certain degree of creativity, as it has also generated novel or unconventional metaphorical expressions. It is important to acknowledge that this work is exploratory in nature and has certain limitations.

metaphor generation large language models pragmatics creativity multimodality

1. Introduction

The scope of this paper is to introduce an alterna- by human workers on Amazon Mechanical Turk. The cilitate metaphor generation. The metaphors generated by the model were evaluated through BERTscore [ 9 ] and results show the model’s ability to generate metaphorical expressions, with 92% of the generated expressions being classified as metaphors. Additionally, the evaluation revealed interesting patterns in terms of the metaphoricity, familiarity and appeal scores of the generated expressions. Interestingly, as the metaphoricity score increases, the familiarity score decreases while the appeal score increases. This suggests that the model was able to create novel or uncommon metaphorical expressions which may difer from the more conventional metaphors, which the evaluators might have been more familiar with. Despite being less familiar, the metaphorical expressions were preferred over the non-metaphorical ones. It is important to acknowledge that this is an exploratory work, which aims to ofer a diferent approach in multimodal metaphor generation. As such, it is essential to point out the presence of some limitations, in particular concernCEUR

ceur-ws.org proach known as multimodal chain-of-thought prompt- human evaluation. timodal model LLaVA 1.5 [ 7 ] and adopted a two-step ap- ing the choice of the visual inputs and the constraints of

2. Background For most people, metaphor is merely a rhetorical device

restricted to poetic language; however, according to the © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Conceptual Metaphor Theory (CMT) [ 1 ] metaphor is perAttribution 4.0 International (CC BY 4.0). 2.2. Related Works vasive in everyday language, playing a significant role in been less research in computational modelling of visual communication, cognition and decision making. More and multimodal metaphors, in particular works accountprecisely, we talk about conceptual metaphor and linguis- ing for metaphor localization, understanding and generatic metaphor. Conceptual metaphors consist of systematic tion [ 26, 27, 5, 4 ]. In particular, [ 3 ] introduced MetaCLUE, sets of mappings across conceptual domains, whereby a a collection of vision tasks on visual metaphor which target domain, which is usually a more abstract and com- enables comprehensive evaluation and development of plex concept, is partly structured in terms of a diferent visual metaphor research. Concerning metaphor genersource domain, which usually defines a more concrete ation, [ 3 ] proposed a task that involves generating an and common concept. Conceptual metaphors are then image that efectively conveys the metaphorical message reflected in our everyday language by a wide variety provided as the text prompt; however, the generated imof linguistic metaphors. For instance, ARGUMENT IS ages perform poorly compared to real images in conveyWAR is a conceptual metaphor, where ARGUMENT is ing metaphorical messages. Additionally, [27] proposed the target domain and WAR is the source domain; exam- an alternative task for generating visual metaphors from ples of its linguistic metaphors are e.g. Your claims are linguistic metaphors using Chain-of-Thought prompting, indefensible. He attacked every weak point in my argu- showing improvements in the quality of visual metaphors ment. You disagree? Okay, shoot! [ 1 ]. Some of these generated by difusion-based text-to-image models. Nevmetaphorical mappings can be defined as conventional ertheless, the common aspect across these studies is that metaphors, as they are so deep-rooted in our everyday the metaphorical quality was already present either in thought and language that they might have become the the textual or in the visual input employed. Interestdominant way of framing a specific concept, and they ingly, [28] and [29] dealt with literal images and textual represent the commonsense [10]; while other metaphor- metaphors; however their tasks focused on association ical mappings, i.e. novel metaphors, are more creative, between the text and images, rather than on metaphor and they are not (yet) used in everyday discourse, but generation. Therefore, this paper aims to propose an may become conventionalized if frequently used. alternative approach involving generating metaphorical linguistic expressions from non-metaphorical images, which lack inherent metaphorical qualities.

Over the past years, NLP research has been focusing on 2.3. Chain-of-Thought Prompting literal and lower-level linguistic information, while humans excels at high-level semantic task, involving also The advent of large language models has inevitably the use of figurative language [ 11]. Moreover, statistical changed the NLP field [ 30], in particular they opened the corpus analysis [12] indicates that in corpora, metaphors prospect to the new paradigm of ”prompt-based learning” occur in approximately one-third of the sentence. There- [31]. [30] introduced the concept of chain-of-thought fore, metaphor gradually became an important topic in (CoT) prompting, which improves the ability of large computational linguistics and NLP. Numerous studies language models to perform complex reasoning tasks by have been conducted to investigate metaphors, result- employing intermediate reasoning steps. They combined ing in three main sub-tasks: metaphor identification this approach with few-shot prompting (Few-shot-CoT), [11, 13, 14, 15], metaphor interpretation [16, 17, 18], and which enables the language model to generate chains metaphor generation [19, 20, 21]. of thought when examples of those are provided. An

As human meaning representations rely not only on other approach, known as Zero-shot-CoT [32] consists linguistic exposure, but also on perceptual system and in adding the simple prompt Let’s think step by step to sensory-motor experience, [ 2, 22 ]; and as metaphors are the original prompt. The advantage of this method is not merely a matter of language but also of thought that it eliminates the need for hand-crafted few-shot and action [ 1 ], it became relevant to study metaphors examples, resulting in greater versatility. Recently, [ 8 ] through diferent modalities. In NLP, the shift towards introduced a multimodal chain-of-thought prompting apmultimodality happened once computational approaches proach (Multimodal-CoT), which incorporates language started adding sensory and contextual features which led (text) and vision (images) modalities into a two-stage to a better performance in metaphor processing [23, 24]. framework. The rationale generation and answer inferBecause of the grounded nature of metaphors, metaphors ence are separated in two diferent steps, allowing the can occur in diferent modalities: visual and multimodal answer inference to benefit from well-generated ratiometaphors are typically used in mass media communica- nales that are based on multimodal information. tion (e.g., advertising, newspaper) [25]. Visual metaphors are monomodal and expressed through vision, whereas multimodal metaphors are expressed at least through two modalities. Compared to textual metaphors, there has

3. Experimental Setup All the data used and the complete results obtained are publicly available at the following repository: https:// github.com/SofiaLugli/Multi_COT_meta_gen.git.

3.1. Model

For the purpose of this study, we employed the new mul

timodal model LLaVA 1.5 (Large Language and Vision Assistant) [ 7 ] which is the next iteration of LLaVA [33], considered as the first attempt to use language-only GPT4 to generate multimodal language-image instructionfollowing data. LLaVA 1.5 is a end-to-end trained large language model combining a pre-trained CLIP-ViT-L336px visual encoder with an MLP projection [34] and large language model Vicuna [35] for general purpose visual and language understanding. The model achieved new SoTA performance across 11 benchmarks, thanks to new academic-task-oriented VQA data with simple response formatting prompts. One of the main reason for choosing this model is its impressive multimodal chat abilities; additionally, it is worth noting it is the first opensource project to GPT-V alternative. More precisely, we used the llava-v1.5 13B-4bit and the parameters were set as follows: temperature=0.2, max_new_tokens=1024.1 3.2. Dataset Collection

1https://github.com/haotian-liu/LLaVA 2https://metaphor.icsi.berkeley.edu In order to select the metaphors for our research, we

retrieved 300 conceptual metaphors from the MetaNet In this section, we will provide an explanation of the task Metaphor Wiki, 2 a comprehensive repository of concep- at hand. We propose an alternative approach for multitual metaphors based on years of research on the Con- modal metaphor generation by using both language and ceptual Metaphor Theory. These metaphors follow the non-metaphorical visual inputs. Our approach is based standard format, where a target domain is compared to a on the multimodal CoT prompting technique [ 8, 36 ]. source domain, e.g., ACHIEVING POWER IS MOVING Our approach follows a two-step process, as shown in UPWARDS, CANCER IS A JOURNEY, ENVIRONMEN- Fig.1. Firstly, the model is fed with the non-metaphorical TAL HARM IS PHYSICAL INJURY. To ensure an efective image containing both the images of the target and visual representation for the metaphors, we collected two source domains. The model’s task is to generate captions images for each metaphor: one representing the target describing each of these images. We provide the prompt: domain and the other representing the source domain. The image contains 2 separated images: one Given the fact that ”LLaVA-1.5 is not yet capable of pro- image at the top and one image at the bottom. cessing multiple images” [ 7 ], for each metaphor, the two First, caption the image at the top, and then images corresponding to the two domains have been caption the image at the bottom. Remember: pasted together in one image with the target domain the images are unrelated to each other and so image at the top and the source domain image at the bot- are the captions. Once the content of the picture has tom. The images were sourced from Google Image and been generated, it is then used as input for the second they vary in style, ranging from realistic to cartoon-like prompt, which involves generating metaphorical exprespictures. sions based on the source and target domains. For this, we employ the following prompt: Context: Metaphors consist of mappings between the source domain and the target domain.The source domain is the conceptual domain from which we draw the metaphorical expression, while the target domain is the conceptual domain that we try 5 4 3 2 1

Wounded environment House of thoughts She is wearing a bandage on her heart Climbing the stairs of success Fighting the battle against cancer The burden of the virus is weighing heavily on the man’s shoulders Digesting knowledge Battle of words Walking down a road to recovery A financial heart attack Embracing the warmth of friendship Their love was as hot as the sun Shaking hands over a book of contracts is like a marriage of business and legal agreements A family’s journey through life, with the man as the guide and the woman and child as his companions A political body is like a human body

ENVIRONMENTAL HARM IS PHYSICAL INJURY MIND IS A BUILDING PSYCHOLOGICAL HARM IS PHYSICAL INJURY ACHIEVING POWER IS MOVING UPWARDS CANCER PATIENT IS PHYSICAL COMBATANT DISEASES ARE BURDENS ACQUIRING IDEAS IS EATING ARGUMENT IS WAR CANCER IS A JOURNEY ADDRESSING ECONOMIC PROBLEMS IS TREATING AN ILLNESS AFFECTION IS WARMTH PASSION IS HEAT AGREEMENT IS PHYSICAL PROXIMITY BEING IN A LOW SOCIAL CLASS IS BEING LOW ON A SCALE

GOVERNMENT IS A PERSON to understand. Task: Create one metaphorical and by five human workers through Amazon Mechanical linguistic expression using the source domain Turk. and the target domain represented in the Concerning the automatic metaphor evaluation trough pictures. For instance, Fig. 1 provides a visual BERTscore [ 9 ], each generated metaphorical expression representation of the task in the case of the conceptual (candidate) was paired with a corresponding human metaphor ENVIRONMENTAL HARM IS PHYSICAL IN- metaphorical expression retrieved from MetaNet (referJURY. In this example, the model was able to successfully ence), which provides real world examples of linguistic generate two distinct captions for the target domain metaphors, sourced from various contexts (e.g., newspaimage and the source domain image. Subsequently, pers, books, etc.). However, the MetaNet does not progiven the second prompt, the model was able to vide examples for all the metaphors in their repository, as generate a corresponding metaphorical expression such 75 metaphors were excluded from this evaluation, as such as wounded environment. Additionally, the model they lacked example references. Compared to traditional provided a correct explanation of the new generated commonly used evaluation metrics [37, 38, 39], which metaphor. To prove the utility of the method, the task relied on n-gram count, BERTscore [ 9 ] computes token was performed on a subset of the dataset without using similarity using contextualized token embeddings, which CoT prompting. In this case, only the second prompt have been shown to be efective for paraphrase detection of generating the metaphor was used, without first [40]. It then calculates Recall and Precision, which are the image captioning prompt. The results were less combined into an F1 score. satisfactory. For instance, for the conceptual metaphor Concerning human evaluation, each generated exENVIRONMENTAL HARM IS PHYSICAL INJURY, the pression was evaluated by five Amazon Mechanical model generated the expression The sun shines brightly Turk workers from English speaking countries (Australia, over the barren landscape, illuminating the industrial Canada, Ireland, New Zealand, United Kingdom, and complex like a beacon of hope. This output, compared to United States). The workers were required to had an the metaphor generated through CoT prompting (e.g., approval rate greater than 95% on 1000 prior approved wounded environment), does not involve a metaphor and HITs; their reward was $0.12 per task. To ensure the fails to consider the images of both source and target quality of the evaluation, the workers were given backdomains. ground knowledge regarding the Conceptual Metaphor Theory, as well as positive and negative examples for 3.4. Evaluation setup the task. The workers had to chose whether the generated linguistic expression (e.g., Wounded environment) The evaluation of the generated metaphorical expressions could be accepted as a linguistic metaphor for its corhas been conducted in two ways: through BERTscore responding conceptual metaphor (e.g., ENVIRONMENTAL HARM IS PHYSICAL INJURY) with the following plete results are available in our repository. Furthermore, Yes or No question: Can the linguistic expression be con- 108 expressions were considered as metaphors by four sidered as a linguistic metaphor for the provided concep- workers and 76 expressions by three workers. Out of tual metaphor?. Additionally, they were asked other two the 300 metaphors, only 24 generated expressions were yes/no questions regarding the familiarity and appeal of not evaluated as metaphors as they were recognized as the expressions: Have you encountered this linguistic ex- metaphors by either two (21 expressions) or only one pression before? and Is this linguistic expression appealing worker (3 expressions). It is worth noting that none of to you?. To consider an expression as metaphorical, it the expressions were evaluated as non metaphors by any had to be evaluated as such by at least three out of the of the workers. These results can be considered as posifve workers. It is worth noting that it was not mentioned itive, suggesting that LLaVA 1.5 successfully generated that the metaphors were not human-generated in order metaphorical expressions from non-metaphorical visual to prevent any potential bias. inputs.

Now let us examine the remaining two criteria. In terms of familiarity, the average score is 2.95, and 67% of 4. Results the expressions were considered as familiar by at least three workers. Only 22 expressions were considered as In this section, we present the results derived from the familiar by all five workers; for instance the expression A automatic and the human evaluation. Regarding the au- journey through life for PROGRESSING THROUGH LIFE tomatic evaluation, it is important to note that, overall IS MOVING ALONG A PATH. Additionally, 73 metaphors the BERTscore between the generated and the human were familiar to four evaluators, while 106 expressions metaphors was low, the average scores were the follow- were familiar to three evaluators. On the other hand, ing precision= 0.41, recall= 0.43, and F1= 0.42. The high- there were 71 metaphors that were not familiar to all but est score was achieved in the metaphor SAD IS DOWN, two workers, 24 that were only familiar to one worker, where the generated metaphor feeling down in the dumps and 4 that were not familiar to any worker. In other and the real-world example I’m feeling down achieved the words, out of 300 expressions, 99 expressions can indeed scores precision= 0.67, recall= 0.84, and F1= 0.74. The low be considered unfamiliar, as they are only rated as familBERTscore suggests that there is a discrepancy between iar by two or fewer workers. These findings regarding the model’s generations and human examples, which may familiarity indicate that the model generated not only indicate that the generated metaphors may not be captur- familiar expressions but also novel, or uncommon exing the same semantic meaning as the human-generated pressions. This suggests that the model exhibits a certain ones. Additionally, this might be due to the diference in degree of creativity in this task. contexts. Human-generated metaphors often reference Moving on to the appeal criterion, the average score real-world examples, including real people and events; is 3.32, and 78% of the generated expressions were liked whereas the generated metaphors tend to be more generic by at least three workers. Among the expressions, 37 aMnodrleeosvsenru,aanncoetdhecromrepaasroendbtoehthinedhtuhmealno-wgeBnEeRraTtsecdoorneeiss. wtoerreecloivkeerdyb yfoarllCfivAeNwCoErkRerISs, Ae.gJ.O,UWRNalEkYin.gFudortwhneramrooared, that, while robust, it might still have limitations in cap- 98 expressions appealed to four workers, 99 to three turing the subtle and nuanced diferences and similarities workers, 57 to two workers and 9 to only one worker. in metaphorical language, which are typically subjective These results indicate that the generated expressions and context-dependent. were mostly appreciated.

Concerning the human evaluation by five MTurk work- Let us now examine the distribution of the mean agreeers, it was conducted on three criteria: metaphoricity, ment scores for familiarity and appeal in relation to the familiarity and appeal of the generated linguistic expres- agreement scores for metaphoricity. As illustrated in Fig. sions. First of all, the expressions obtained a metaphoric- 2, the observed pattern seems to suggest that the mean ity mean score of 3.8, which means that, on average, the familiarity and appeal scores exhibit contrasting trends generated expressions were considered as metaphorical across diferent metaphoricity scores. Interestingly, as by the majority of the workers. A total of 92% of the the metaphoricity score increases, the familiarity score linguistic expressions were evaluated as metaphors by at decreases while the appeal score increases. Metaphoricleast three workers. Among these, 92 expressions were ity scores 5 and 1 represent the extremes, with distinct unanimously recognized as metaphors by all five evalu- diferences in both familiarity and appeal. For the generators, for instance Wounded environment generated for ated metaphorical expressions evaluated as such by all the conceptual metaphor ENVIRONMENTAL HARM IS ifve workers, the mean score of familiarity is 2.92 and PHYSICAL INJURY. Additional examples of the gener- of appeal is 3.6; whereas for the expressions considered ated expressions and their corresponding metaphoricity metaphorical only by one worker, the mean familiarity agreement scores can be found in Table 1, while the com- score is 3.67 and appeal is 3.0. With the exception of the expressions with metaphoricity score 2, which registered the lowest score (2.71) both for familiarity and appeal, the pattern seems to indicate that metaphoric expressions with higher metaphoricity scores tend to have lower familiarity and higher appeal. This means that the evaluators found the literal generated expressions (metaphoricity scores 1 and 2) to be more familiar compared to the metaphorical ones. Hence, the results suggest that the model was able to create novel metaphorical expressions which may difer from the more conventional metaphors, which the evaluators might have been more familiar with.

Despite being less familiar, the metaphorical expressions were preferred over the non-metaphorical ones. These ifndings show that the model exhibited a degree of creativity in metaphor generation, as it generated novel or unconventional metaphorical expressions which where appreciated by human evaluators.

5. Conclusion

This study aimed to explore an alternative approach for multimodal metaphor generation using the new LLaVA 1.5 model and Multimodal-CoT prompting. The results showed the model’s ability to generate metaphorical expressions when provided with both linguistic and visual inputs which lack inherent metaphorical qualities. Additionally, the evaluation revealed interesting patterns across the metaphoricity, familiarity and appeal scores of the generated expressions. The model exhibited its creativity, as it generated novel or unconventional metaphorical expressions, which were also preferred over nonmetaphorical ones. It is important to state again that this is an exploratory work with some limitations. One limitation to consider is the choice of the images used in the study. As manually selected from Google Image, their quality may influence the quality of the captions and metaphors generated by the model. Another limitation to consider is the subjectivity of the evaluation process, it is possible that Amazon MTurk workers may lack the necessary sensitivity and background knowledge to accurately recognize and evaluate metaphorical expressions, despite the instructions included background information about metaphor. Future works should aim to address these limitations by selecting more accurate images, as well as incorporating more diverse and expert annotators.

Despite these limitations, the task show promising results for future research in the field of metaphorical and visual reasoning.

Acknowledgements We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU.

[10] E. Semino, Metaphor in Discourse, Metaphor in [24] E. Shutova, D. Kiela, J. Maillard, Black holes and Discourse, Cambridge University Press, 2008. URL: white rabbits: Metaphor identification with visual https://books.google.it/books?id=QT1uilVRDTYC. features, in: Proceedings of the 2016 conference of [11] E. V. Shutova, Computational approaches to figu- the North American chapter of the association for rative language, Technical Report, University of computational linguistics: Human language techCambridge, Computer Laboratory, 2011. nologies, 2016, pp. 160–170. [12] G. J. Steen, A. G. Dorst, J. B. Herrmann, A. A. Kaal, [25] C. Forceville, Pictorial metaphor in advertising, T. Krennmayr, Metaphor in usage, Cognitive Lin- Routledge, 2002.

guistics (2010). [26] D. Zhang, M. Zhang, H. Zhang, L. Yang, H. Lin, [13] Y. Tsvetkov, L. Boytsov, A. Gershman, E. Nyberg, Multimet: A multimodal dataset for metaphor unC. Dyer, Metaphor detection with cross-lingual derstanding, in: Proceedings of the 59th Annual model transfer, in: Proceedings of the 52nd Annual Meeting of the Association for Computational LinMeeting of the Association for Computational Lin- guistics and the 11th International Joint Conference guistics (Volume 1: Long Papers), 2014, pp. 248–258. on Natural Language Processing (Volume 1: Long [14] G. Gao, E. Choi, Y. Choi, L. Zettlemoyer, Neu- Papers), 2021, pp. 3214–3225. ral metaphor detection in context, arXiv preprint [27] T. Chakrabarty, A. Saakyan, O. Winn, arXiv:1808.09653 (2018). A. Panagopoulou, Y. Yang, M. Apidianaki, [15] R. Mao, X. Li, M. Ge, E. Cambria, Metapro: A com- S. Muresan, I spy a metaphor: Large language putational metaphor processing model for text pre- models and difusion models co-create visual processing, Information Fusion 86 (2022) 30–43. metaphors, arXiv preprint arXiv:2305.14724 (2023). [16] E. Shutova, Automatic metaphor interpretation as [28] G. Özbal, D. Pighin, C. Strapparava, et al., A proverb a paraphrasing task, in: Human language tech- is worth a thousand words: learning to associate nologies: the 2010 annual conference of the North images with proverbs, in: Proceedings of the 41st American chapter of the association for computa- Annual Conference of the Cognitive Science Society tional linguistics, 2010, pp. 1029–1037. (CogSci’19), Cognitive Science Society, 2019, pp. [17] C. Su, S. Huang, Y. Chen, Automatic detection and 2515–2521.

interpretation of nominal metaphor based on the [29] R. Yosef, Y. Bitton, D. Shahaf, Irfl: Image recogtheory of meaning, Neurocomputing 219 (2017) nition of figurative language, arXiv preprint 300–311. arXiv:2303.15445 (2023). [18] E. Liu, C. Cui, K. Zheng, G. Neubig, Testing the [30] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, ability of language models to interpret figurative E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought language, arXiv preprint arXiv:2204.12632 (2022). prompting elicits reasoning in large language mod[19] T. Veale, Round up the usual suspects: Knowledge- els, Advances in Neural Information Processing based metaphor generation, in: Proceedings of the Systems 35 (2022) 24824–24837.

Fourth Workshop on Metaphor in NLP, 2016, pp. [31] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neu34–41. big, Pre-train, prompt, and predict: A systematic [20] Z. Yu, X. Wan, How to avoid sentences spelling survey of prompting methods in natural language boring? towards a neural approach to unsupervised processing, ACM Computing Surveys 55 (2023) metaphor generation, in: Proceedings of the 2019 1–35.

Conference of the North American Chapter of the [32] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. IwaAssociation for Computational Linguistics: Human sawa, Large language models are zero-shot reasonLanguage Technologies, Volume 1 (Long and Short ers, 2023. arXiv:2205.11916.

Papers), 2019, pp. 861–871. [33] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction [21] T. Chakrabarty, X. Zhang, S. Muresan, N. Peng, tuning, arXiv preprint arXiv:2304.08485 (2023).

Mermaid: Metaphor generation with symbolism [34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, and discriminative decoding, arXiv preprint G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, arXiv:2103.06779 (2021). J. Clark, G. Krueger, I. Sutskever, Learning transfer[22] M. M. Louwerse, Symbol interdependency in sym- able visual models from natural language supervibolic and embodied cognition, Topics in Cognitive sion, 2021. arXiv:2103.00020.

Science 3 (2011) 273–302. [35] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, [23] P. Turney, Y. Neuman, D. Assaf, Y. Cohen, Literal H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. and metaphorical sense identification through con- Gonzalez, I. Stoica, E. P. Xing, Vicuna: An opencrete and abstract context, in: Proceedings of the source chatbot impressing gpt-4 with 90%* chat2011 Conference on Empirical Methods in Natural gpt quality, 2023. URL: https://lmsys.org/blog/ Language Processing, 2011, pp. 680–690. 2023-03-30-vicuna/. [36] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal,

S. Ma, T. Lv, L. Cui, O. K. Mohammed, Q. Liu, et al., Language is not all you need: Aligning perception with language models, arXiv preprint arXiv:2302.14045 (2023). [37] S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72. [38] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. [39] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81. [40] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,

Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).

[1]

Lakof , M. Johnson, Metaphors We Live By, University of Chicago Press, 2008 . URL: https://books. google.it/books?id=r6nOYYtxzUoC.

[2]

L. W.

Barsalou , Grounded cognition, Annu. Rev. Psychol . 59 ( 2008 ) 617 - 645 .

[3]

A. R.

Akula ,

Driscoll ,

Narayana ,

Changpinyo ,

Jia ,

Damle , G. Pruthi,

Basu ,

Guibas ,

W. T.

Freeman , et al., Metaclue: Towards comprehensive visual metaphors research , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023 , pp. 23201 - 23211 .

[4]

Hwang ,

Shwartz , Memecap: A dataset for captioning and interpreting memes , arXiv preprint arXiv:2305.13703 ( 2023 ).

[5]

Xu ,

Li ,

Zheng ,

Naseriparsa ,

Zhao ,

Lin ,

Xia , Met-meme: A multimodal meme dataset rich in metaphors , in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2022 , pp. 2887 - 2899 .

[6]

Chakrabarty ,

Choi ,

Shwartz , It's not rocket science: Interpreting figurative language in narratives , Transactions of the Association for Computational Linguistics 10 ( 2022 ) 589 - 606 .

[7]

Liu ,

Li ,

Y. J.

Lee , Improved baselines with visual instruction tuning , 2023 . arXiv: 2310 . 03744 .

[8]

Zhang ,

Li ,

Zhao ,

Karypis ,

Smola , Multimodal chain-of-thought reasoning in language models , arXiv preprint arXiv:2302.00923 ( 2023 ).

[9]

Zhang ,

Kishore ,

Wu ,

K. Q.

Weinberger ,

Artzi , Bertscore: Evaluating text generation with bert , arXiv preprint arXiv: 1904 . 09675 ( 2019 ).