Humour Classification by Fine-tuning LLMs: CYUT at CLEF 2024 JOKER Lab Subtask Humour Classification According to Genre and Technique Notebook for the CYUT Lab at CLEF 2024 Shih-Hung Wu1,*,† , Yu-Feng Huang2,† and Tsz-Yeung Lau3,† Chaoyang University of Technology, Taichung, Taiwan (R.O.C) Abstract This paper reports how we attend the CLEF 2024 JOKER lab, Humour classification according to genre and technique subtask. The system will classifying short texts of humor among the six classes such as irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating and wit-surprise. This year, CYUT team sent three runs based on 3 deep learning models. Run 1 is based on a fine-tuned Llama 3 model, run 2 is based on a fine-tuned RoBERTa model and run 3 is using the GPT4.0 api provided by OpenAI with a zero-shot and CoT prompt. During the system developing phrase, our Llama 3 model can achieve an 89.68% accuracy, however, the offical result is 69.78%. Keywords Deep Learning, Humour Classification, Large Language Models (LLMs), Llama 3, GPT-4 1. Introduction The subtask Humour Classification According to Genre and Technique of JOKER Track @ CLEF 2024 is a multiclass classification task. [1] The system automatically classify each given sentence into the following classes: irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating and wit-surprise. The organizers provide manually annotated training and test data from existing corpora, including the positive examples of the JOKER-2023 pun detection corpus as well as new data. Humor is a complex and ambiguous emotional concept unique to natural language[2]. Humorous language cannot exist independently, as language gains meaning only when accompanied by context, situation, and cultural background[3]. Discourse analysis has the capability to interpret humor. Lan- guage itself becomes the subject of humor[4]. Humor recognition is a challenging issue in natural language processing (NLP) for several reasons. Firstly, humor often stems from the use of figurative language, such as irony and sarcasm. Additionally, the sense of humor varies across different cultural and geographical groups. For instance, someone disinterested in political issues may find it difficult to understand political jokes. People with different background knowledge will react differently to the same joke. This variability makes it challenging for NLP researchers to detect humorous content[5]. Humor emotion analysis is an intriguing area of study as it reveals alternative ways of expressing human emotions. When people convey various emotions through their words and actions, it is often not straightforward but filled with humorous elements. This is where humor emotion analysis becomes valuable. Previous research has primarily focused on categorizing emotions as positive, negative, or neutral [6]. However, now we aim to delve deeper into the meanings behind humorous emotions in text. Such research not only helps us better understand the diversity of human emotional expression but also provides useful insights for the development of natural language processing and emotional intelligence. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ shwu@cyut.edu.tw (S. Wu); s11227615@gm.cyut.edu.tw (Y. Huang); s10927116@gm.cyut.edu.tw (T. Lau)  0000-0002-1769-0613 (S. Wu); 0009-0002-7904-2758 (T. Lau) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Thus, humor emotion analysis is not merely a study of textual emotions but an adventurous journey into the nature of human humor. This exploration will help us comprehensively understand the psychological mechanisms behind human speech and behavior, while also bringing more enjoyment and challenges to our technological advancements. In this study, we employ RoBERTa, GPT-4, and Llama 3-8B for humor classification. As a result, Llama 3-8B performed the best, achieving an accuracy of 89.68%. 2. Related Work 2.1. Large language models Large language models (LLMs), such as GPT-4 [7] and Llama 3 [8], have garnered attention due to their outstanding performance on various tasks. These models possess a vast number of parameters and can adapt to new tasks without additional training, a capability known as "in-context learning." [9] Recently, the emergence of ChatGPT, particularly its basis on GPT-3.5 [9] and further refinement through reinforcement learning from human feedback, has drawn significant attention[10, 11]. 2.2. Prompt Engineering Prompt engineering plays a significant role in the fields of artificial intelligence and machine learning[12]. It acts as a communication bridge, especially when using large language models like GPT-3 or GPT-4. We perform fine-tuning [13] to achieve better results, aiming for more accurate and targeted outputs [14, 15]. This concept is crucial in natural language processing (NLP) as it directly impacts the model’s performance and output quality. The basic idea of prompt engineering is to guide the model to provide the desired information or execute complex specific tasks through carefully designed prompts [16, 17]. Without clear instructions, the model might generate inaccurate or completely irrelevant responses. We can enhance the accuracy of prompts through several known practices, such as precise instructions, role assignment, give example(one-shot, few-shot)[9], iterative refinement and Chain of Thought (CoT)[18]. 3. Dataset The training dataset for this study was provided by the JOKER organizer and consists of a total of 1,742 entries. The humorous content is categorized into six types: IR (irony) with 210 entries, SC (sarcasm) with 356 entries, EX (exaggeration) with 125 entries, AID (incongruity-absurdity) with 231 entries, SD (self-deprecating) with 169 entries, and WS (wit-surprise) with 651 entries. The distribution of the training data is shown in Figure 1. The test set comprises a total of 722 entries, as illustrated in Figure 2. The results were evaluated by the JOKER organizer. 4. Method 4.1. Deep Learning Models 4.1.1. RoBERTa We utilize the enhanced BERT[19] model, RoBERTa[20], as our baseline. BERT, which stands for Bidirectional Encoder Representations from Transformers[19], was originally introduced by Google as an encoder-only transformer[21]-based model for natural language processing (NLP) tasks. BERT is pre-trained using the Masked Language Model (MLM) and Next Sentence Prediction (NSP) techniques. Unlike word2vec[22] and GloVe[23], which do not consider context, BERT leverages contextual infor- mation during inference, leading to superior performance[19]. In the RoBERTa paper, they mentioned that the BERT model was significantly undertrained[20]. To address this, they implemented several modifications: using larger batches, training the model for a longer duration, dropping the NSP training, Figure 1: Training set distribution Figure 2: Test set distribution training on longer sequences, and dynamically changing the masking pattern applied to the training data[20]. For the RoBERTa baseline model, we achieve an accuracy of 72.49%. 4.1.2. GPT-4 In this study, we utilized the GPT-4.0 model with zero-shot prompting and Chain-of-Thought (CoT) prompting to assist with the task. GPT-4, developed by OpenAI, is an advanced natural language processing model built upon its predecessor, GPT-3, with a significantly increased parameter count. This enhancement facilitates a deeper understanding and generation of complex sentence structures, enabling more nuanced responses and better handling of contextual language features such as irony and humor. GPT-4 is trained using autoregressive language modeling on a diverse dataset, allowing it to perform exceptionally across various NLP tasks like translation, summarization, and question answering.[24] The model operates by first converting input text into tokens, which are processed by transformer layers using attention mechanisms to evaluate relevance and context. These mechanisms generate intermediate representations of data, which are then decoded into human-readable text. GPT-4 incorpo- rates a randomness function influenced by temperature settings and top-k sampling, which dictate the randomness and determinism of the output, thus enhancing the model’s ability to produce contextually appropriate content. This process represents a significant evolution in language model capabilities, setting new benchmarks in language understanding and generation.[24] 4.1.3. Llama 3 Large Language Models (LLMs) are highly capable AI assistants that excel in complex reasoning tasks. They enable interaction with humans through intuitive chat interfaces, leading to rapid and widespread adoption among the general public.[25] Many different LLMs are publicly available, such as GPT-4[7], Mistral 7B[26], Gemma 7B[27], and the LLM we utilize in this study, Llama 3. Llama 3[8] is an open-source LLM utilizing the Transformer[21] architecture, developed by Meta. The Llama3 model is available in configurations with 8 billion and 70 billion parameters. Llama3 models have achieved state-of-the-art (SOTA) performance across a broad range of tasks due to extensive pre-training on over 15 trillion data tokens, making it the best-performing open-source model. In this study, we fine-tuned the Llama 3-8B model on a single GPU, utilizing 4-bit quantization with QLoRa[28] to reduce GPU RAM usage during training with unsloth[29]. As a result, the model achieved 89.68% accuracy. 5. System Development 5.1. Environment In our experiment, we utilized a GPU, NVIDIA GeForce RTX 3090 with 24GB of memory. The versions of all packages employed in the experiment will be thoroughly delineated in Table 1. Table 1 Packages Version Package Version Python 3.10.14 Pytorch 2.2.2 CUDA Toolkit 12.1 CUDA 8.6 Unsloth 2024.4 5.2. RoBERTa To fine-tune the RoBERTa model, we use 80% of the dataset as the training set and 20% as the test set. The hyperparameters we used for fine-tuning are shown in Table 2. 5.3. GPT 4.0 To evaluate the GPT-4 model, we use the entire dataset for self-testing. We found that direct classification did not yield satisfactory results, so we grouped similar types into broader categories before conducting finer classifications. First, we grouped IR and SC into Category C. Then, we divided the remaining five Figure 3: Process of GPT-4 classification with clustering. types (C, SD, EX, AID, WS) into two categories: Category A (AID, WS) and Category B (C, SD, EX). We then performed a binary classification within Category A to distinguish between AID and WS. For Category B, we conducted a three-way classification to separate C, SD, and EX. Finally, we performed a binary classification within Category C to differentiate between IR and SC. This approach allowed us to consolidate the results for all six types. The flowchart displayed in figure 3 illustrates this classification process. You can check the prompts we applied for each step in the appendix Table 13.14 . 5.3.1. Prompt Design First, we assign the model a specialized role to enhance its performance in handling complex tasks within a specific domain. Next, we utilize chain-of-thought (CoT) prompting to reduce model hallucinations and increase the probability of generating reasonable responses. We specify the task clearly, provide category names and definitions, and set output constraints. For example, we limit the output to no more than three tokens and restrict the model from producing responses outside the given requirements or repeating the question. Table 2 Hyperparameters for RoBERTa fine-tuning Hyperparameters Value BERT Model base Epochs 5 Batch Size 4 Optimizer Adam Learning Rate 1e-5 5.4. Llama 3 To fine-tune the Llama 3-8B model, we use 80% of the dataset as the training set and 20% as the test set. The hyperparameters we used for fine-tuning are shown in Table 4. Table 3 Stanford Alpaca Format Model Format Prompt Llama 3-8B Alpaca Format Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {} 5.4.1. Prompt Design To fine-tune Llama3, we utilize the Stanford Alpaca Format[30]. The Alpaca format is shown in Table 3. For the instruction, we first tell the model what to do: "Classify the following text into one of the classes." Then, we provide the six classes for classification with explanations: irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating humor, and wit-surprise. We simply utilize the explanations provided in the official JOKER guideline document here. Based on results from RoBERTa and GPT-4, we discovered that the model struggled to accurately classify irony and sarcasm. Therefore, we added the sequence: "You ought to focus more on classifying irony and sarcasm." Finally, we applied Chain of Thought (CoT) prompting[18] by adding the sequence: "Let’s think step by step.". Meanwhile, the sequence following "### Input:" denotes the text in need of classification, while "### Response:" following with one of six classes: irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating humor, and wit-surprise. During evaluation, we employ the same prompting technique. The only difference is that we refrain from adding any text after "### Response:" to allow the model to generate the response. The prompt elements are shown in Table 5. Table 4 Hyperparameters for Llama 3 fine-tuning Hyperparameters Value Parameters 8 billion Epochs 6 Gradient Accumulation 4 Optimizer AdamW 8bit Learning Rate 2e-4 QLoRa 4-bit quantization 6. Experiment Result 6.1. Self-Test Result Table 6 evaluates the performance of each model. The RoBERTa model achieved an accuracy of 71.63%, 0.64 Macro Average Precision (MAP), 0.65 Macro Average Recall (MAR) and 0.64 Macro Average F1-Score (MA-F1), serving as the baseline. The GPT-4 model chieved an accuracy of 36.24%, 0.37 MAP, 0.34 MAR and 0.34 MA-F1, representing a significant drop compared to the baseline model. The GPT-4 model with Clustering . chieved an accuracy of 38.23%, 0.39 MAP, 0.40 MAR and 0.34 MA-F1, also representing a significant drop compared to the baseline model. However, with clustering, the model performs slightly better. The Llama 3-8B model achieved an accuracy of 89.68%, 0.89 MAP, 0.87 MAR and 0.88 MA-F1, representing a 18.05%, 0.25, 0.23, 0.24 increase compared to the baseline model. Table 5 Llama 3-8B Prompt Model Prompt Element Prompt Llama 3-8B Instruction Classify the following text into one of the classes. Here are the six types of classes: Irony - Irony relies on a gap between the literal meaning and the intended meaning, creating a humorous twist or reversal. Sarcasm - Sarcasm involves using irony to mock, criticize, or convey contempt. Exaggeration - Exaggeration involves magnifying or overstating something beyond its normal or realistic proportions. Incongruity-Absurdity - Incongruity refers to unexpected or contradic- tory elements that are combined in a humorous way, and Absurdity involves presenting situations, events, or ideas that are inherently illogical, irrational, or nonsensical. Self-deprecating - Self-deprecating humor involves making fun of oneself or highlighting one’s own flaws, weaknesses, or embarrassing situations in a lighthearted manner. Wit-Surprise - Wit refers to clever, quick, and intelligent humor, and Surprise in humor involves introducing unexpected elements, twists, or punchlines that catch the audience off guard. You ought to focus more on classifying irony and sarcasm. Let’s think step by step. Input { text in need of classification from dataset. } Response { one of six classes from dataset: irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating humor, and wit-surprise. } *During the evaluation, leave the "Response" empty. Table 6 Models Performance of Self-Testing. Run Model Accuracy (%) MAP MAR MA-F1 1 Llama 3-8B 89.68 0.89 0.87 0.88 2 GPT-4 with Clustering 38.23 0.39 0.40 0.34 3 RoBERTa 71.63 0.64 0.65 0.64 - GPT-4 36.24 0.37 0.34 0.34 *MAP: Macro Average Precision *MAR: Macro Average Recall *MA-F1: Macro Average F1 Score 6.2. Official Result All of the models were evaluated by the JOKER organizer [31]. Table 10 presents the official results of each model. The RoBERTa model achieved an accuracy of 18.56%, 0.19 Macro Average Precision (MAP), 0.24 Macro Average Recall (MAR), and 0.21 Macro Average F1-Score (MA-F1). The RoBERTa model showed a significant drop in accuracy compared to our self-test results due to a mistake in our code. The data uploaded for the official result was fine-tuned on extra data containing IR and SC, leading to lower performance than expected. The GPT-4 model with clustering achieved an accuracy of 35.53%, 0.39 MAP, 0.40 MAR, and 0.34 MA-F1. The GPT-4 model with clustering produced results similar to our self-testing. The Llama 3-8B model used for evaluation is the same model fine-tuned with 80% of the dataset. It Table 7 Precision, Recall and F1-Score of each class and model. (Self-Testing) Model Class Precision Recall F1-Score IR 0.86 0.84 0.85 SC 0.86 0.89 0.88 EX 0.85 0.85 0.85 Llama 3-8B AID 0.90 0.79 0.84 SD 0.96 0.92 0.94 WS 0.92 0.96 0.94 IR 0.27 0.13 0.18 SC 0.63 0.13 0.21 EX 0.18 0.64 0.28 GPT-4 with Clustering AID 0.22 0.24 0.23 SD 0.50 0.74 0.60 WS 0.53 0.51 0.52 IR 0.47 0.33 0.38 SC 0.63 0.61 0.62 EX 0.52 0.52 0.52 RoBERTa AID 0.75 0.78 0.76 SD 0.60 0.76 0.68 WS 0.87 0.90 0.88 Table 8 Accuracy of GPT-4 with Clustering for each step. Model Step Classes Accuracy (%) 2 A (AID, WS), B (C(IR, SC), SD, EX) 78.18 3 AID, WS 50.34 GPT-4 with Clustering 3 C(IR, SC), SD, EX 32.10 4 IR, SC 41.95 Table 9 Precision, Recall and F1-Score of each class of Llama 3-8B. (Offical Result) Model Class Precision ↑ Recall ↑ F1-Score ↑ IR 0.63(-0.23) 0.60(-0.24) 0.62(-0.23) SC 0.67(-0.19) 0.68(-0.21) 0.67(-0.21) EX 0.52(-0.33) 0.41(-0.44) 0.46(-0.39) Llama 3-8B AID 0.86(-0.04) 0.88(+0.09) 0.87(+0.03) SD 0.70(-0.26) 0.69(-0.23) 0.70(-0.24) WS 0.44(-0.48) 0.63(-0.33) 0.52(-0.42) *Blue words represent the differences compared to self-testing. achieved an accuracy of 69.78%, 0.64 MAP, 0.65 MAR, and 0.64 MA-F1. The Llama 3-8B model exhibited a significant drop in accuracy compared to our self-test results, potentially due to differences in the data distribution between the training and test sets, as shown in Figure 4. However, as seen in Table 9, the model performed exceptionally well on the class AID, even with a small amount of training data. It appears that AID has distinctive features that the model can learn effectively. The model likely overfitted to the training set, impairing its performance on the test set. Balancing the data in the training set may help improve the model’s robustness. From the official results, it is evident that the Mistral-7B model performed the best overall in humor classification, achieving an accuracy of 76%, from team ORPAILLEUR[31]. Figure 4: Comparison of Training set and Test set Distributions Table 10 Models Performance of Official Result. Run Model Accuracy ↑(%) MAP ↑ MAR ↑ MA-F1 ↑ 1 Llama 3-8B 69.78(-19.90) 0.64(-0.25) 0.65(-0.22) 0.64(-0.24) 2 GPT-4 with Clustering 35.53(-02.70) 0.39(-0.00) 0.40(-0.00) 0.34(-0.00) 3 RoBERTa 18.56(-53.07) 0.19(-0.45) 0.24(-0.41) 0.21(-0.43) *MAP: Macro Average Precision *MAR: Macro Average Recall *MA-F1: Macro Average F1 Score *Blue words represent the differences compared to self-testing. 7. Discussion & Error Analysis 7.1. Discussion From the confusion matrix of RoBERTa and GPT-4, Figure 7 and Figure 6, it is evident that the models struggled to accurately classify between the categories AID and WS, as well as IR and SC. One reason for this difficulty is the existence of two distinct types of irony: verbal irony and situational irony. Verbal irony, often referred to as sarcasm, implies that IR includes SC[32]. Another reason is that identifying sarcasm in a sentence often requires contextual information[32]. Meanwhile, Llama 3-8B demonstrated significantly better performance in the areas where RoBERTa and GPT-4 exhibited weaknesses, as shown in Figure 8. GPT-4 with clustering shows a slight improvement compared to the vanilla GPT-4 model, from Figure 6. From Table 7, it is not hard to discover that fine-tuning LLMs is an effective method for humor classification. Llama 3 has significantly better performance compared to GPT-4, with substantial improvements in precision, recall, and F1-score for each class. Although non-tuned LLMs have great general performance, they might not excel in specialized tasks. Even if fine-tuning LLMs is not available, using smaller models like RoBERTa can also achieve acceptable performance. Figure 5: Confusion matrix of GPT4 with Figure 6: Confusion matrix of GPT4 self-testing. Clustering self-testing. Figure 7: Confusion matrix of RoBERTa self-testing. Figure 8: Confusion matrix of Llama 3-8B self-testing. 7.2. Error Analysis 7.2.1. Llama 3 The model may occasionally produce unexpected responses, which can be attributed to the pre-training data. For instance, if the input text is: "When negotiating whether to share your french fries, you have quite a few bargaining chips.", the model might respond with: "lunch." In self-testing, 12 samples produced unexpected outputs. Additional examples are provided in Table 11. This is the limitation of fine-tuning generative models such as LLMs. When employing the BERT model for classification, the [CLS] token is inputted into the Multilayer Perceptron (MLP) [33]. The model ensures the absence of unexpected output by maintaining a fixed output layer size and employing the softmax function[34] to determine the probability of each output. We take an additional step to test those errors with ten more opportunities. Some of these errors can be classified into one of the six classes. For instance, consider the input text 1: input text 1: "No longer a female as I refuse to wear heels ever again" Llama 3-8B give an unexpected responce: "twitter", but 1 out of 10 times, it give a responce "sarcasm". The same phenomenon occurred with input text 2: "The leopard tried creeping up on the tigers using its camouflage but it was seen.", which received a "wit-surprise" response 1 out of 10 times. Additionally, input text 8, "Doppelherz. The power of the Table 11 Samples with unexpected responce. Error Sample Input text Response 1 No longer a female as I refuse to wear heels ever again. twitter 2 The leopard tried creeping up on the tigers using its camouflage but it leopard was seen. 3 Time is important to fullbacks. They are always rushing. football 4 Your road to driving success. infographic 5 When negotiating whether to share your french fries, you have quite a lunch few bargaining chips. 6 Nothing can compare to picnicking on a French hillside and savoring food the bries. 7 Waiters are good at multiplication because they know their tables. table 8 Doppelherz. The power of the two hearts. wikipedia 9 Let yourself be transported with every bite. Introducing our new Desti- campaign nation Series. 10 Looking for a delicious way to stay cool now that it’s heating up? ice 11 Most Manchester United fans will only drink tea because they have all manchester the cups. 12 Where PR = Public Reactions. hashtag two hearts," elicited a "wit-surprise" response 8 out of 10 times. These examples are shown in Table 12. Meanwhile, other input texts remained unchanged. Table 12 Samples response with one out of six classes in ten trials. Error Sample Input text Response 1 No longer a female as I refuse to wear heels ever again. sarcasm 2 The leopard tried creeping up on the tigers using its camouflage but it wit-surprise was seen. 8 Doppelherz. The power of the two hearts. wit-surprise 8. Conclusion & Future Work 8.1. Conclusion In this study, we conducted humor classification using deep learning models(RoBERTa), including LLMs such as Llama 3-8B and GPT-4. The best performing model was Llama 3-8B, achieving an accuracy of 89.68% in self-testing and 69.78% in offical result through fine-tuning and prompt engineering. We also analyzed some unexpected responses from the LLMs to understand why they occurred. In summary, we found that fine-tuning LLMs can be very effective for humor classification. Addi- tionally, we discovered that clustering similar classes allows LLMs to achieve better performance. 8.2. Future Work For future work, we can observe through the confusion matrix that EX and SD can be grouped into a single category. This approach may improve overall accuracy. Additionally, the LLM could first score the humor type present in the sentences and then classify based on a set threshold. Furthermore, the clustering method can be applied to Llama 3-8B, which might also result in better performance. Acknowledgments This study was supported by the National Science and Technology Council under the grant number NSTC 113-2221-E-324-009. References [1] L. Ermakova, A.-G. Bosser, T. Miller, T. Thomas, V. M. P. Preciado, G. Sidorov, A. Jatowt, Clef 2024 joker lab: Automatic humour analysis, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 36–43. [2] Z. Li, J. Liu, Y. Wang, Performance analysis on deep learning models in humor detection task, in: 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE), IEEE, 2022. URL: http://dx.doi.org/10.1109/MLKE55170.2022.00023. doi:10.1109/mlke55170.2022. 00023. [3] P. Liang, Discourse analysis on humor, in: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), 2011, pp. 5002–5005. doi:10.1109/AIMSEC.2011.6011180. [4] P. Liang, Discourse analysis on humor, in: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), IEEE, 2011. URL: http: //dx.doi.org/10.1109/AIMSEC.2011.6011180. doi:10.1109/aimsec.2011.6011180. [5] Y. Guo, L. Kong, Classification and regression combined model on accessing humor score with explanatory feature, in: 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE), IEEE, 2022. URL: http://dx.doi.org/10.1109/MLKE55170.2022.00050. doi:10. 1109/mlke55170.2022.00050. [6] H. A. Sayyed, S. Rushikesh Sugave, S. Paygude, B. N Jazdale, Study and analysis of emotion classification on textual data, in: 2021 6th International Conference on Communication and Electronics Systems (ICCES), 2021, pp. 1128–1132. doi:10.1109/ICCES51350.2021.9489204. [7] OpenAI, Gpt-4 technical report, 2024. arXiv:2303.08774. [8] Meta, Introducing Meta Llama 3: The most capable openly available LLM to date — ai.meta.com, https://ai.meta.com/blog/meta-llama-3/, 2024. [Accessed 29-05-2024]. [9] OpenAI, Language models are few-shot learners, 2020. arXiv:2005.14165. [10] S. Pitis, M. R. Zhang, A. Wang, J. Ba, Boosted prompt ensembles for large language models, 2023. arXiv:2304.05970. [11] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian, Y. Yang, Connecting large language models with evolutionary algorithms yields powerful prompt optimizers, 2024. arXiv:2309.08532. [12] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt engineering in large language models: Techniques and applications, 2024. arXiv:2402.07927. [13] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021. arXiv:2107.13586. [14] Q. Ye, M. Axmed, R. Pryzant, F. Khani, Prompt engineering a prompt engineer, 2024. arXiv:2311.05661. [15] S. Ekin, Prompt engineering for chatgpt: A quick guide to techniques, tips, and best practices (2023). URL: http://dx.doi.org/10.36227/techrxiv.22683919. doi:10.36227/techrxiv.22683919. [16] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C. Schmidt, A prompt pattern catalog to enhance prompt engineering with chatgpt, 2023. arXiv:2302.11382. [17] X. Amatriain, Prompt design and engineering: Introduction and advanced methods, 2024. arXiv:2401.14423. [18] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, 2023. arXiv:2201.11903. [19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. arXiv:1810.04805. [20] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692. [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, 2023. arXiv:1706.03762. [22] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013. arXiv:1301.3781. [23] J. Pennington, R. Socher, C. Manning, GloVe: Global vectors for word representation, in: A. Mos- chitti, B. Pang, W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543. URL: https://aclanthology.org/D14-1162. doi:10.3115/v1/D14-1162. [24] B. Chen, Z. Zhang, N. Langrené, S. Zhu, Unleashing the potential of prompt engineering: a comprehensive review, 2024. arXiv:2310.14735. [25] H. T. et al., Llama 2: Open foundation and fine-tuned chat models, 2023. arXiv:2307.09288. [26] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825. [27] G. Team, Gemma: Open models based on gemini research and technology, 2024. arXiv:arXiv:2403.08295. [28] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient finetuning of quantized llms, 2023. arXiv:2305.14314. [29] D. Han, M. Han, H. H. Nguyen, Qubitium, Y. Belkada, Z, unslothai/unsloth, 2024. URL: https: //github.com/unslothai/unsloth. [30] R. Taori*, I. Gulrajani*, T. Zhang*, Y. Dubois*, X. Li*, C. Guestrin, P. Liang, T. B. Hashimoto, Alpaca: A strong, replicable instruction-following model, https://crfm.stanford.edu/2023/03/13/alpaca.html, 2021. [Accessed 29-05-2024]. [31] L. Ermakova, A.-G. Bosser, T. Miller, V. M. P. Preciado, G. Sidorov, A. Jatowt, Overview of the clef 2024 joker track automatic humour analysis, 2024. [32] E. Filatova, Irony and sarcasm: Corpus generation and analysis using crowdsourcing, in: N. Calzo- lari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, 2012, pp. 392–398. URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/661_Paper.pdf. [33] M.-C. Popescu, V. Balas, L. Perescu-Popescu, N. Mastorakis, Multilayer perceptron and neural networks, WSEAS Transactions on Circuits and Systems 8 (2009). [34] J. Bridle, Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters, in: D. Touretzky (Ed.), Advances in Neural Informa- tion Processing Systems, volume 2, Morgan-Kaufmann, 1989. URL: https://proceedings.neurips.cc/ paper_files/paper/1989/file/0336dcbab05b9d5ad24f4333c7658a0e-Paper.pdf. A. Appendix A.1. Our fine-tuned Llama 3-8B model for Humor Classification The fine-tuned Llama 3-8B model is available on Hugging Face. • Hugging Face A.2. Prompt of GPT-4 with clustering for each step. Table 13 Prompt of GPT-4 with clustering for each step. Model Step Clases Prompt GPT-4 with Clustering 2 A, B As a Humor Master, your task is to identify the type of humor from the following two categories . Take it step by step. This is a multi-category classification task. The aim is to automatically classify text according to the following classes: A,B. There are two humor types. Here are the two type of humour: A: These genres are primarily based on unexpected elements or clever twists for humorous effect. B: Usually involves exaggerating or distorting reality, or achieving humorous effects by teasing oneself or others. ###Limit number of words: no more than 3 tokens### ###Please answer directly without restating the question### ###Instructions: For each question, respond using only one of the following abbreviations :A,B. Do not reply with answers other than A,B.### 3 AID, WS As a Humor Master, your task is to identify the type of humor from the following two categories . Take it step by step. This is a multi classification task. The aim is to automatically classify text according to the following classes: WS,AID. There are two humor types. Here are the two type of humour: WS:Includes humor that uses intelligence and wit to elicit laughter through clever language or thought patterns. This type of humor may involve puns, quips, or logical deductions, allowing people to appreciate the author’s intelligence and creativity. AID:Includes humor that utilizes elements that defy common sense or logic, or combines unrelated things to create a sense of absurdity or incongruity. This type of humor often surprises and confuses people because it goes against our expectations. ###Limit number of words: no more than 3 tokens### ###Please answer directly without restating the question### ###Instructions: For each question, respond using only one of the following abbreviations:WS,AID. Do not reply with answers other than WS,AID.### Table 14 Prompt of GPT-4 with clustering for each step. Model Step Clases Prompt GPT-4 with Clustering 3 C, SD, EX As a Humor Master, your task is to identify the type of humor from the following three categories. Take it step by step. This is a multi classification task. The aim is to automatically classify text according to the following classes: IR,EX,SD. There are three humor types. Here are the three type of humour: IR:Includes irony, which relies on the gap between literal meaning and actual intent to create humor, and sarcasm, which is used specifically to mock, criticize, or express contempt. SD:Covers self-deprecating humor that amuses audiences by high- lighting personal flaws, weaknesses, or embarrassing situations in a light-hearted way. EX:Involves exaggerating something, exaggerating certain features beyond normal or realistic proportions to create a humorous effect. ###Limit number of words: no more than 3 tokens### ###Please answer directly without restating the question### ###Instructions: For each question, respond using only one of the following abbreviations:IR,SD,EX. Do not reply with answers other than IR,SD,EX.### 4 IR, SC As a Humor Master, your task is to identify the type of humor from the following two categories . Take it step by step. This is a multi classification task. The aim is to automatically classify text according to the following classes: IR,SC. There are two humor types. Here are the two type of humour: IR:Focuses on exploiting the discrepancy between literal meaning and actual intent to create humor, often by reversing or twisting expecta- tions. SC:Focus on the use of irony to ridicule, criticize, or express contempt, often with a certain sharpness or criticalness. ###Limit number of words: no more than 3 tokens### ###Please answer directly without restating the question### ###Instructions: For each question, respond using only one of the following abbreviations:IR,SC. Do not reply with answers other than IR,SC.###