1. Introduction

When Figures Speak with Irony: Investigating the Role of Rhetorical Figures in Irony Generation with LLMs

Pier Felice Balestrucci

pierfelice.balestrucci@unito.it 0

Michael Oliverio

michael.oliverio@unito.it 0

Soda Marem Lo

sodamarem.lo@unito.it 0

Luca Anselma

luca.anselma@unito.it 0

Valerio Basile

valerio.basile@unito.it 0

Alessandro Mazzei

alessandro.mazzei@unito.it 0

Viviana Patti

viviana.patti@unito.it 0 0 Computer Science Department, University of Turin , Italy

2025

Irony poses a persistent challenge for computational models because it depends on context, implicit meaning, and pragmatic cues. This study investigates the ability of Large Language Models (LLMs) to generate ironic content by focusing on rhetorical ifgures-pragmatic devices that may shape and signal ironic intent. Using two datasets, TWITTIRÒ-UD and the Italian subset of MultiPICo, we fine-tune multilingual LLMs for rhetorical figure classification and evaluate their capacity to generate ironic Italian texts. Our work addresses two main questions: (1) how accurately LLMs can classify rhetorical figures in ironic Italian texts, and (2) whether such training supports the generation of irony that reflects human-like rhetorical usage. Human evaluation shows that LLMs achieve fair agreement with annotators in rhetorical figure classification, indicating a partial but promising alignment with human judgment. By leveraging rhetorical figures as a bridge between irony detection and generation, our results suggest that such training improves the stylistic control and interpretability of LLM-generated ironic language.

eol>Rhetorical Figures Irony Generation Large Language Models

1. Introduction

• RQ1: To what extent can LLMs accurately clas

sify rhetorical figures in ironic Italian texts? • RQ2: Does fine-tuning LLMs on rhetorical figure classification lead to the generation of more human-like ironic replies, in terms of rhetorical that LLMs are capable of learning to produce ironic condevices? tent, and explore the possibility of linking irony generation to the socio-demographic characteristics of user profiles—such as generational groups—with the goal of generating personalized ironic content tailored to diferent age groups.

To address these questions, we fine-tune a set of mul

tilingual open-weight LLMs on rhetorical figure classification and assess their performance. We then enrich the Italian subset of MultiPICo with automatic annotations and conduct a human evaluation to validate a small sample extracted from that corpus. Finally, we use the best- 3. Datasets performing fine-tuned model to generate new replies to ironic posts in MultiPICo and carry out a linguistic analy- TWITTIRÒ-UD A collection of ironic Italian tweets sis of the model-generated replies, comparing them with annotated according to the Universal Dependencies human-written ones. framework. TWITTIRÒ-UD was created by enriching

This work contributes to (i) advancing the research a resource originally developed for the fine-grained aninto rhetorical figure classification using LLMs, by prov- notation of irony [ 17 ]. The original corpus consists of ing the efectiveness of Chain-of-Thought fine-tuning 1, 424 tweets, with a total of 28, 387 tokens [ 18 ]. Each strategy; (ii) improving the interpretability of LLMs in tweet in the corpus has been annotated with the correpragmatic text generation, showing that rhetorical figure- sponding rhetorical figure used to convey irony, such as aware models tend to create sentences stylistically more OXYMORON PARADOX, HYPERBOLE, or EUPHEMISM. The similar to human-written texts.1 treebank includes both the fine-grained annotation for ironic tweets introduced in Karoui et al. [ 4 ] and the morphological and syntactic information encoded in the UD 2. Related Works format.2 Figure 1 shows the distribution of rhetorical ifgures in the corpus.

MultiPICo The dataset consists of disaggregated multilingual posts and replies from social media, each annotated to indicate whether the reply is ironic given the post. The corpus includes 18, 778 post–reply pairs, collected from Reddit (8, 956) and Twitter (9, 822), and covers 9 diferent languages. A total of 506 annotators, with diferent sociodemographic information, carried out the annotations, producing 94, 342 individual labels (an average of 5.02 per conversation). Each annotation is accompanied by sociodemographic metadata about the annotator, including gender, age, ethnicity, student status, and employment status. For the Italian subset of the Rhetorical Figure Classification There are mainly two approaches to the automatic detection and classification of rhetorical figures in natural language: ontologybased methods and machine learning techniques [ 6, 7 ].

These approaches have shown efectiveness in supporting tasks such as sentiment analysis and intent classification [ 8, 9 ]. Several studies focus on their relationship with irony [ 10, 11 ], particularly in the context of irony detection. In this vein, Karoui et al. [ 4 ], drawing on wellestablished linguistic theories that explore the interplay between irony and rhetorical figures—such as oxymoron, paradox, false assertion, and analogy—propose an annotation schema for classifying these categories of irony in social media texts. Their work focuses on French, English, and Italian, highlighting the relevance of irony categories and markers for a linguistically informed approach to irony detection.

Irony Generation Irony generation remains a relatively underexplored area in Natural Language Generation. especially when compared to the growing literature on humor, puns, and sarcasm [ 12, 13 ]. Recent work has begun to model sarcasm through linguistic features such as valence reversal and contextual incongruity [ 14, 15 ], yet irony is still rarely addressed directly.

Among the more recent studies on irony generation, Balestrucci et al. [ 16 ] propose an approach that leverages LLMs to generate ironic text. The authors demonstrate 1All code and experimental results are publicly available at: https: 2https://github.com/UniversalDependencies/UD_ //github.com/MichaelOliverio/IronyDetection. Italian-TWITTIRO

5. Rhetorical Figure Classification

corpus, 24 annotators provided 4, 790 annotations on 1, 000 post–reply pairs [ 19 ].3

4. Methodology

In this section, we evaluate a set of LLMs for rhetorical ifgure classification. We fine-tune several open-weight, mid-sized LLMs using two diferent approaches on the original TWITTIRÒ-UD split (see Table 2). To highlight the impact of fine-tuning on rhetorical figure classification, we compare the performance of the fine-tuned models against two baselines: a random classifier and a zeroshot prompting approach. Our experiments involve five multilingual LLMs: Qwen2.5-7B-Instruct4 (referred to as Qwen2.5-7B), Llama-3.1-8B-Instruct5 (Llama-3.1-8B), Ministral-8B-Instruct-24106 (Ministral-8B), LLaMAntino3-ANITA-8B-Inst-DPO-ITA7 (LLaMAntino-3-8B), and Minerva-7B-instruct-v1.0 (Minerva-7B).8

To assess the ability of LLMs to analyze ironic Italian texts

and classify rhetorical figures, we adopted the annotation scheme proposed by Karoui et al. [ 4 ], which defines a set of rhetorical figures commonly used to convey irony (summarized in Table 1).

We selected several open-weight multilingual LLMs trained on Italian data and fine-tuned them on the TWITTIRÒ dataset for the task of rhetorical figure classification. Models’ performances were evaluated against two baselines: (i) a random classifier and (ii) a promptingbased approach. The best-performing model was then used to enrich the ironic Italian subset of the MultiPICo Table 2 dataset—aggregated by majority vote—with rhetorical Data split statistics for the TWITTIRÒ-UD dataset. ifgure annotations. To validate the model’s predictions, Train Dev Test we conducted a human evaluation on a small subset of #Tweets 1, 138 144 142 the annotated data. Avg. Tokens 20.77 20.80 20.96

Finally, to address the second research question, we focused on ironic post–reply pairs in Italian from MultiPICo, again selected via majority vote, and compared Fine-tuning was performed using two diferent prompt the distribution of rhetorical figures across three types of strategies, described below, both relying on Low-Rank replies: (i) automatically generated by an LLM fine-tuned Adaptation (LoRA) [ 20 ]. to recognize rhetorical figures, (ii) replies generated by the same model out-of-the-box, and (iii) written by hu- Instruction Fine-Tuning In this approach, which we mans. In addition to comparing the distributions, we refer to as FT, we trained all the models (training deconducted a linguistic analysis of these replies. A repre- tails are available in Appendix A), using the following sentative sample of the generated content was manually instruction: annotated to support this evaluation.

3https://huggingface.co/datasets/Multilingual-Perspectivist-NLU/

MultiPICo

Given the ironic sentence (INPUT),

identify and return the rhetorical figure

4https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

5https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct 6https://huggingface.co/mistralai/Ministral-8B-Instruct-2410 7https://huggingface.co/swap-uniba/ LLaMAntino-3-ANITA-8B-Inst-DPO-ITA 8https://huggingface.co/sapienzanlp/Minerva-7B-instruct-v1.0

Table 3 reports the evaluation results. The baselines used are: (i) a random classifier (Random), which assigns one of the eight possible labels uniformly at random to each input, and (ii) a zero-shot prompting approach. For the latter, we selected the best-performing model overall egories—especially EUPHEMISM, for which it made no linguistics—manually annotated a subset of 20 out of the correct predictions (0 out of 8). These results highlight a 278 ironic post-reply pairs. The annotators were tasked substantial margin for improvement in this task and sug- to specify the rhetorical figures used to express irony in gest the need for further investigation into the model’s the reply given the corresponding post, selecting one or behavior and the characteristics of under-represented or more labels from those reported in Table 1. more challenging rhetorical categories. The annotators achieved an average Cohen’s score [ 21 ] of 0.63 on a subset of 20 post–reply pairs, a value comparable to that reported by Karoui et al. [ 4 ] for the 6. MultiPICo Enrichment same task (0.60), indicating substantial agreement. Krippendorf’s [ 22 ] was also computed, yielding a score This section focuses on enriching the Italian MultiPICo of 0.60, which confirms a similarly substantial level of with annotations of rhetorical figures. To this end, we em- inter-annotator reliability. ploy the best-performing rhetorical figure classification We then compared the human annotations with the model (see Table 3), Ministral-8B with CoT-FT, to clas- predictions produced by our automatic model. The resify rhetorical figures in the Italian post-reply pairs. As sulting Krippendorf’s was 0.21, corresponding to a mentioned in Section 3, MultiPICo consists of both ironic fair level of agreement. and non-ironic post-reply pairs. Therefore, we extract To better understand this result, we examined the 14 only the ironic pairs from the dataset, using a majority out of 20 pairs where both annotators assigned the same vote approach to determine whether a post-reply pair label. In 3 of these cases, the model’s prediction matched is ironic, given the disaggregated nature of MultiPICo, the human annotation exactly. resulting in a subset of 278 ironic post-reply pairs.

For example, for the post: Due si candidano in quanto "ci vuole una donna" nel #Pd: #Schlein e #DeMicheli. Una sola domanda: perché?” (Two women are running for ofifce in the Democratic Party because ‘we need a woman’: Schlein and DeMicheli. One question: why?”) the reply: @USER Perché per un canguro è ancora presto.” (Because for a kangaroo it’s still too early.”) was labeled as CONTEXT SHIFT by both annotators and the model. The label was assigned due to the sudden change in topic, introducing an unexpected element (the kangaroo) that breaks coherence and signals irony.

In the remaining 11 cases where the model’s prediction did not match humans’ annotations, the model frequently Figure 3: Distribution of rhetorical figures extracted from the labeled replies as OXYMORON PARADOX when annotators Italian MultiPICo corpus. had chosen OTHER—this occurred in 6 out of the 11 pairs.

Consider the following example: “Salvini ripropone il ponte sullo stretto di Messina, opera imprescindibile per

We then use our model to classify the rhetorical figures lo sviluppo economico. Condivido e rilancio: contestualin this subset. As shown in Figure 3, the most frequently mente realizzerei anche il tunnel sottomarino Civitavecchia extracted rhetorical figures in the post–reply pairs are - Cagliari. Dai non facciamo come al solito la figura dei CONTEXT SHIFT (25.9%) and OXYMORON PARADOX barboni, pensiamo in grande” (“Salvini reintroduces the (21.9%), while the least frequent are EUPHEMISM and Strait of Messina bridge proposal, a crucial infrastructure HYPERBOLE (1.8% each). This distribution closely re- for economic development. I agree and raise: let’s also build sembles that of TWITTIRÒ, and the high frequency of the Civitavecchia–Cagliari submarine tunnel. Let’s not be CONTEXT SHIFT may be attributed to the nature of our usual broke selves—let’s think big!”) with the reply: “Si post–reply interactions, where replies often reframe or può proporre il ponte Palermo–Cagliari già che ci siamo. . . shift the meaning of the corresponding posts. Given the una spesa unica. . . compri uno, paghi tre. . . no com’è la dificulty in classifying some rhetorical figures, as high- storia?” (“We might as well propose a Palermo–Cagliari lighted in Table 2, we carry out a human evaluation in bridge while we’re at it. . . one payment for three projects. . . Section 6.1 to assess the quality of the model predictions. or how does it go again?”) Here, the model likely interpreted the absurdity of the 6.1. Human Evaluation reply as a rhetorical figure of type OXYMORON PARADOX, whereas human annotators labeled it as a case of sarcasm, Following the annotation guidelines in Karoui et al. [ 4 ], and thus as OTHER. two authors of this paper—both expert in computational An illustrative example of the remaining cases is the following: “Lo scrivo per tanti idioti che rispondono ai Twit- squats with the couch, push-ups using the ter come le pecore. Sono un Sovranista, non sono vaccinato, cofee table, and some presses with the cat!) non pagherò la multa e la mia Libertà non è in svendita.” (“I write this for all the idiots who respond to tweets like sheep. I’m a sovereignist, I’m unvaccinated, I won’t pay Table 4 the fine, and my freedom is not for sale.”) with the reply: Distribution of rhetorical figures in human and model“Lo scrivo per te . . . non bere più” (“I write this for you. . . generated ironic replies (rep.) from MultiPICo. CoT-FT refers stop drinking.”) to the fine-tuned model; Baseline to the non-fine-tuned ver

In this case, the model assigned the label ANALOGY, sion. possibly misled by the introductory phrase in the post, Human rep. Model rep. failing to capture the sarcastic tone of the reply. This CoT-FT Baseline example suggests that prompt design could be improved to better guide the model’s focus toward the reply and HAYNPAELROBGOYLE 450 538 421 its pragmatic intent. EUPHEMISM 5 9 6

This evaluation highlights the LLM’s ability to produce RHETORICAL QUESTION 45 34 64 overall reasonable outputs. Although its performance is OXYMORON PARADOX 61 67 51 not particularly high, it can still serve as a useful tool for CONTEXT SHIFT 72 62 52 silver annotation, thanks to the reasoning and explana- FALSE ASSERTION 32 35 34 tions it provides. OTHER 18 10 28

7. Irony Generation

Inspired by previous work on irony generation [ 16 ], we investigated whether a model trained to classify rhetorical figures would also be capable of employing them during generation—producing ironic outputs comparable to those written by humans in terms of rhetorical figures. To explore this hypothesis, we considered the 278 post–reply pairs selected in Section 6, using the posts as input to the best-performing model for rhetorical figure classification. The model was prompted to generate an ironic reply for each post, which was then compared to the original human-written reply. As a baseline, we used the same model in its non–fine-tuned version, applying the same prompting strategy. To illustrate this process, we provide the following example:

Instruction: Ti viene fornito in input (INPUT) un post estratto da conversazioni sui social media. Fornisci in output (OUTPUT) una risposta ironica in italiano. (You are given as input (INPUT) a post extracted from social media conversations. Provide as output (OUTPUT) an ironic reply in Italian.)

Input: Consigli su workout in casa in

questo periodo di palestre chiuse? (Any tips for home workouts during this period of gym closures?)

Table 4 presents the distribution of rhetorical figures in the ironic replies generated by humans, the fine-tuned model, and the baseline model, all classified by Ministral8B with CoT-FT. Overall, the diferences across distributions are not substantial, but some trends are worth noting.

The fine-tuned model produces slightly more ANALOGY and EUPHEMISM compared to humans, which may reflect the influence of the TWITTIRÒ training data, where these categories are relatively well represented. Conversely, CONTEXT SHIFT appears underrepresented in the model outputs compared to human replies, which could be due to either the complexity of capturing discourse-level phenomena.

Interestingly, the baseline model shows a notable increase in the use of RHETORICAL QUESTION and OTHER, suggesting a more generic or less targeted use of rhetorical strategies when the model is not fine-tuned. This may indicate that zero-shot generation leads to a reliance on broadly applicable or ambiguous rhetorical patterns, as already seen in Balestrucci et al. [ 16 ].

To better understand these patterns and assess the reliability of the automatic classification, we conducted a human evaluation on a subset of 20 model-generated replies from both systems.

Specifically, the same two annotators from Section 6.1 independently labeled the rhetorical figures predicted by the models. Inter-annotator agreement was substantial, with a Cohen’s of 0.68 and a Krippendorf’s of 0.65. In contrast, the Krippendorf’s between the annotators and the classifier was 0.26, confirming all the previous results. 7.1. Linguistic Analysis

8. Conclusions

Following the approach proposed by Balestrucci et Our study explored the extent to which rhetorical figal. [ 16 ], we also conducted a linguistic analysis focus- ures can serve as a bridge between the detection and ing on specific stylistic markers—namely, average token generation of ironic content in Italian. We showed that length, type-token ratio (TTR), and the use of interjec- fine-tuning LLMs on rhetorical figure classification entions and negations—across human-written replies and ables models to identify key linguistic devices involved model-generated outputs. in irony with reasonable accuracy. The best results were obtained using a CoT strategy, which guided models Table 5 to provide explanations before predicting the rhetorLinguistic analysis for human-written posts, human-written ical category. While the models performed well on replies, fine-tuned model generations (CoT-FT), and baseline frequently represented figures such as ANALOGY and generations (Baseline): average number of tokens (Tokens), RHETORICAL QUESTION, they struggled with more subtype/token ratio (TTR), and average occurrences of interjec- tle or under-represented categories like EUPHEMISM, sugtions (Interjections) and negations (Negations). gesting that further refinement and data augmentation

Human Model Replies may be needed.

Post Reply CoT-FT Baseline For the irony generation task, we observed that models Tokens 30.586 12.471 20.173 22.399 ifne-tuned on rhetorical figure classification produced TTR 0.924 0.956 0.938 0.935 ironic replies that more closely resembled human outInterjections 0.594 0.273 0.381 0.507 puts in terms of rhetorical devices and stylistic markers. Negations 0.050 0.072 0.410 0.982 Although the overall distribution of rhetorical figures remained similar across models, the fine-tuned version demonstrated a more balanced use of devices, reducing the over-reliance on rhetorical questions and interjections observed in the baseline. This suggests that rhetorical figure awareness acquired through classification can positively influence generation, even in the absence of explicit training on ironic text generation.

Manual evaluation confirmed the model’s ability to generate plausible annotations and replies, albeit with fair agreement compared to human annotators. Nonetheless, the consistency and interpretability of its outputs highlight its potential as a tool for silver annotation—particularly valuable in low-resource settings. Finally, our linguistic analysis showed that the fine-tuned model better preserved lexical diversity and pragmatic subtlety than its non-fine-tuned counterpart, indicating that rhetorical figure classification fine-tuning may also serve as a form of stylistic control. Taken together, these ifndings point to the value of leveraging rhetorical figures to enhance both the interpretability and expressiveness of LLMs in pragmatic language generation.

As future work, we plan to extend this study to other languages, such as French and English, with the goal of comparing the capacity of LLMs to classify rhetorical figures and generate ironic content across diferent linguistic contexts.

Moreover, a key research direction we intend to pursue concerns the perspectivist nature of the MultiPICo dataset. In particular, we aim to explore whether rhetorical figures function as shared cues in the perception of irony across diferent sociodemographic groups, thereby pointing to the existence of rhetorical devices that act as universal markers of ironic intent.

9. Limitations Despite the promising results, this work presents several

limitations that call for further investigation.

First, the rhetorical figure classification task was trained and evaluated on a relatively small dataset (TWITTIRÒ-UD), which may hinder the generalizability of the models—particularly for under-represented categories such as EUPHEMISM and HYPERBOLE. While finetuning contributes to improved performance, the models still struggle with these categories, likely due to data sparsity and the intrinsic ambiguity of certain rhetorical devices.

Second, the human evaluation was conducted on a relatively limited subset, which reduces the statistical robustness of the agreement scores. Although the results align with previous studies and provide qualitative insights into model behavior, a larger annotation efort would be needed to draw more conclusive findings—especially when distinguishing between closely related rhetorical categories. However, large-scale human annotation remains time-consuming and costly.

Finally, this study did not include a direct comparison with models explicitly fine-tuned for irony generation. Such a comparison would be necessary to better assess the specific contribution of rhetorical figure classification to the generation of ironic content, and to determine whether the observed improvements are attributable to rhetorical awareness or other factors.

Acknowledgments Michael Oliverio was partially

funded by the ‘Multilingual Perspective-Aware NLU’ project in partnership with Amazon Alexa.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

D. C.

Muecke , Irony and the Ironic, Methuen, London, 1970 .

[2]

Sravanthi ,

Doshi ,

Tankala ,

Murthy ,

Dabre ,

Bhattacharyya , Pub: A pragmatics understanding benchmark for assessing llms' pragmatics capabilities , in: Findings of the Association for Computational Linguistics ACL 2024 , 2024 , pp. 12075 - 12097 .

[3]

Wei ,

Wang ,

Schuurmans ,

Bosma ,

Xia ,

Chi ,

Q. V.

Le ,

Zhou , et al., Chain-of-thought prompting elicits reasoning in large language models , Advances in neural information processing systems 35 ( 2022 ) 24824 - 24837 .

[4]

Karoui ,

Benamara ,

Moriceau ,

Patti ,

Bosco , N. Aussenac-Gilles, Exploring the impact of pragmatic phenomena on irony detection in tweets: A multilingual corpus study , in: M. Lapata , P. Blunsom , A . Koller (Eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1 , Long

Papers

, Association for Computational Linguistics, Valencia, Spain, 2017 , pp. 262 - 272 . URL: https://aclanthology.org/E17-1025/.

[5]

Athanasiadou ,

H. L.

Colston , The Diversity of Irony , volume 65 , Walter de Gruyter GmbH & Co

, 2020 .

[6]

Mladenovic , Ontology-based recognition of rhetorical figures, Infotheca , Journal for Digital Humanities 16 ( 2016 ) 24 - 47 .

[7]

Z. L.

Chia ,

Ptaszynski ,

Masui , G. Leliwa,

Wroczynski , Machine learning and feature engineering-based study into sarcasm and irony classification with application to cyberbullying detection , Information Processing & Management 58 ( 2021 ) 102600 .

[8]

C. W.

Strommer , Using rhetorical figures and shallow attributes as a metric of intent in text ( 2011 ).

[9]

Dubremetz ,

Nivre , Rhetorical figure detection: Chiasmus, epanaphora, epiphora, Frontiers in Digital Humanities Volume 5 - 2018 ( 2018 ). URL: https://www.frontiersin.org/journals/ digital-humanities/articles/10.3389/fdigh. 2018 . 00010 . doi: 10 .3389/fdigh. 2018 . 00010 .

[10]

Neuhaus , On the relation of irony, understatement, and litotes , Pragmatics & Cognition 23 ( 2016 ) 117 - 149 .

[11]

Burgers , M. van Mulken ,

P. J.

Schellens , Type of evaluation and marking of irony: The role of perceived complexity and comprehension , Journal of Pragmatics 44 ( 2012 ) 231 - 242 .

[12]

Zhu ,

Yu ,

Wan , A neural approach to irony generation , ArXiv abs/ 1909 .06200 ( 2019 ). URL: https://api.semanticscholar.org/CorpusID: 202572954.

[13]

Tian ,

Sheth ,

Peng , A unified framework for pun generation with humor principles , in: Y. Goldberg , Z. Kozareva , Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022 , Association for Computational Linguistics , Abu Dhabi, United Arab Emirates, 2022 , pp. 3253 - 3261 . URL: https://aclanthology. org/ 2022 .findings-emnlp. 237 . doi: 10 .18653/v1/ 2022 .findings-emnlp. 237 .

[14]

Zeng ,

A.-R.

Li , A survey in automatic irony processing: Linguistic, cognitive, and multi-X perspectives , in: N. Calzolari , C.-R.

Huang , H.

Kim , J.

Pustejovsky , L.

Wanner , K.-S. Choi, P.-M. Ryu , H. - H. Chen , L.

Donatelli , H.

Ji , S.

Kurohashi , P.

Paggio , N.

Xue , S.

Kim , Y.

Hahm , Z.

He , T. K.

Lee , E.

Santus , F.

Bond , S.-H. Na (Eds.), Proceedings of the 29th International Conference on Computational Linguistics , International Committee on Computational Linguistics , Gyeongju, Republic of Korea, 2022 , pp. 824 - 836 . URL: https://aclanthology.org/ 2022 .coling- 1 . 69 .

[15]

Mishra ,

Tater ,

Sankaranarayanan , A mod- This appendix reports the hyperparameter configuraular architecture for unsupervised sarcasm gener- tion used during model fine-tuning. All experiments ation , in: K. Inui,

Jiang ,

Ng , X. Wan (Eds.), were performed using LoRA. Training was conducted Proceedings of the 2019 Conference on Empirical using the transformers and peft libraries. The taMethods in Natural Language Processing and the ble below summarizes the main parameters used in the 9th International Joint Conference on Natural Lan- TrainingArguments class and in the LoRA configuraguage Processing (EMNLP-IJCNLP), Association tion . for Computational Linguistics , Hong Kong, China, 2019 , pp. 6144 - 6154 . URL: https://aclanthology.org/ Table 6 D19 - 1636 . doi: 10 .18653/v1/ D19 -1636. Configuration of hyperparameters used in the LoRA-based

[16]

P. F.

Balestrucci ,

Casola ,

S. M.

Lo , V. Basile, fine-tuning process. A. Mazzei, I'm sure you're a real scholar yourself: Parameter Value Exploring ironic content generation by large language models , in: Y. Al-Onaizan , M. Bansal , Y.-N. LoRA configuration Chen (Eds.), Findings of the Association for Compu- LoRA rank () 64 tational Linguistics: EMNLP 2024, Association for LDorRopAoaultpphraobability 01 .61

Computational

Linguistics , Miami, Florida, USA, 2024 , pp. 14480 - 14494 . URL: https://aclanthology. TrainingArguments org/ 2024 .findings-emnlp. 847 /. doi: 10 .18653/v1/ Number of training epochs 5

[17] 2A0 . 2T4 ..Cfiginnadrienllag,sC-. eBmonslcpo.,8V4.P7a .tti, et al., Twittiro: EEBnnaaatcbbhlleesfbizpfe1166pettrrraaGiinnPiinnUggfor training FTarl1usee a social media corpus with a multi-layered anno- Batch size per GPU for evaluation 1 tation for irony , in: CEUR Workshop Proceedings, Gradient accumulation steps 1 volume 2006 , CEUR, 2017 , pp. 1 - 6 . Maximum gradient norm 0 . 3

[18]

Cignarella ,

Bosco ,

Patti , TWITTIRÒ: a So- Initial learning rate 2e−4 cial Media Corpus with a Multi-layered Annotation Weight decay 0 .001 for Irony, 2017 , pp. 101 - 106 . doi: 10 .4000/books. Optimizer adamw_ torch aaccademia.2382. Learning rate schedule cosine

[19]

Casola ,

Frenda ,

S. M.

Lo ,

Sezerer ,

Uva , Warmup ratio 0 .03

Basile ,

Bosco ,

Pedrani ,

Rubagotti ,

Patti , D. Bernardi, MultiPICo: Multilingual perspectivist irony corpus , in: L. -W. Ku , A. Martins , V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Bangkok, Thailand, 2024 , pp. 16008 - 16021 . URL: https://aclanthology. org/ 2024 . acl-long . 849 /. doi: 10 .18653/v1/ 2024 . acl-long . 849 .

[20]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Chen , Lora: Low-rank adaptation of large language models , 2021 . URL: https: //arxiv.org/abs/2106.09685. arXiv: 2106 . 09685 .

[21]

Cohen , A coeficient of agreement for nominal scales , Educational and Psychological Measurement 20 ( 1960 ) 37 - 46 . URL: https://api.semanticscholar. org/CorpusID:15926286.

[22]

Krippendorf , Computing krippendorf's alphareliability , 2011 .