ORPAILLEUR & SyNaLP at CLEF 2024 Task 2: Good Old Cross Validation for Large Language Models Yields the Best Humorous Detection

ORPAILLEUR & SyNaLP at CLEF 2024 Task 2: Good Old Cross Validation for Large Language Models Yields the Best Humorous Detection PierreEpron pierre.epron@loria.fr LORIA Université de Lorraine CNRS

France

GaëlGuibon gael.guibon@loria.fr LORIA Université de Lorraine CNRS

France

LIPN Université Sorbonne Paris Nord

France

MiguelCouceiro miguel.couceiro@loria.fr LORIA Université de Lorraine CNRS

France

INESC-ID IST Universidade de Lisboa

Portugal

Grenoble France

ORPAILLEUR & SyNaLP at CLEF 2024 Task 2: Good Old Cross Validation for Large Language Models Yields the Best Humorous Detection 1613-0073 13BDE1E997417244AC9C392FDAD12BBA GROBID - A machine learning software for extracting information from scholarly documents Humor genre classification Large language models Text classification

In the context of the JOKER 2024 Task 2 Challenge, this paper presents an emerging approach that leverages the latent representations derived from different Large Language Models (LLMs) to drive a classification mechanism. Our methodology involves exploiting the "knowledge" encoded in LLMs to effectively discriminate humor genres. Experimental results show promising results, demonstrating the effectiveness of our approach. However, inherent complexities remain, such as the proximity between certain classes and biases arising from the dataset distributions. These complexities warrant further investigation to refine the classification process and improve overall performance.

Introduction

In this paper, we propose to address the JOKER 2024 task 2, which concerns the classification of humor categories, with an approach that utilizes the hidden representations generated by different Large Language Models (LLMs) as input to a classification head. Our hypothesis is that the sophisticated representation capabilities of LLMs can facilitate more efficient and accurate classification of complex semantic categories, such as those associated with humor. Humor, with its various genres and subtle nuances, presents a unique challenge for Natural Language Processing (NLP) systems. By leveraging the deep, contextual embeddings produced by LLMs, we aim to improve the classification performance for humor genres. To investigate this, we address several detailed research questions:

• RQ1: Does the "knowledge of the world" embedded in LLMs facilitate more accurate classification of humor genres? We hypothesize that the extensive pretraining on a diverse range of texts enables LLMs to grasp subtle nuances of humor, potentially leading to more precise classification outcomes. • RQ2: Are the classification results consistent across different LLMs or do we actually see an improvement with the recency of the model? To this end, we compare the performance of Llama2, Llama3, and Mistral to ascertain whether newer models, such as Llama3, outperform their predecessors. • RQ3: How does the use of a QLoRA adapter on hidden layers affect classification results?

QLoRA adapters are capable of modifying the representations within the hidden layers, and this study aims to verify whether this adaptation enhances the classification accuracy.

• RQ4: Which hidden layer depth of LLMs yields the best classification results? We investigate whether certain layers of the network provide more relevant features for classification and if this optimal depth is consistent across different LLMs and humor categories.

By addressing these questions, our study aims to deepen the understanding of how LLMs can be effectively utilized for complex classification tasks and to identify the factors that contribute to their performance. This research addresses the applicability of LLMs to the complex challenge of humor classification.

Related Work

The necessity for automatic humor detection arises from the increasing influence of conversational agents and the omnipresence of social media platforms. In the digital realm, where interactions are increasingly mediated by algorithms, discerning humor has become crucial. This imperative extends to various applications, including chatbots, recommender systems, social media reputation management, and the crucial task of identifying and combating fake news and hate speech [1,2]. Early efforts in humor detection primarily focused on the intricate dynamics of wordplay. The seminal evaluation campaign explored tasks including pun detection, pun location, and pun interpretation [3]. However, a significant challenge in this domain has been the scarcity of appropriate training data, particularly evident for languages beyond English. Recent advances in automatic humor detection have been driven by the development of contextualized embeddings, which have facilitated a broader recognition of humor across diverse contexts [4]. Moreover, the development of multilingual models, which leverage the pre-trained BERT architecture [5], has expanded the scope of humor recognition to languages such as Chinese, Russian, and Spanish [6]. Additionally, there has been a notable shift towards addressing domain-specific tasks, examplified by endeavors to identify humorous queries within Q&A systems [7]. While the field of irony and sarcasm detection has received considerable attention [8,9,10], the area of automatic humor detection remains a vital and evolving area of research with implications that extend to various facets of human-computer interaction and online discourse [11,12].

The landscape of LLM development is undergoing a period of rapid evolution, with the introduction of new models such as LLaMA2, Mistral, and the GPT family models, including GPT-3 and GPT-4. Touvron et al. [13] introduced LLaMA2, which builds upon its predecessor by enhancing the model architecture and training methodologies, thereby achieving improved performance across a range of natural language processing tasks. Jiang et al. [14] presented Mistral, a model known for its efficiency and effectiveness, particularly in low-resource settings, demonstrating impressive capabilities in several benchmarks. Concurrently, the GPT family of models, developed by OpenAI, has made significant contributions to the field. GPT-3, introduced by Brown et al. [15] set new standards with its 175 billion parameters, enabling unprecedented performance in generating human-like text and performing complex language tasks with minimal prompt engineering. Building on this, GPT-4 [16] and Llama3 [17], as detailed in its model card, further enhanced these capabilities by incorporating more sophisticated training techniques and a larger training corpus, resulting in superior performance across a wider range of applications. These models have played a pivotal role in advancing the state of the art in natural language understanding and generation, solidifying their position as indispensable tools in the Natural Language Processing community [18,19,20].

Zero-shot and few-shot learning methods enable LLMs to perform tasks with minimal task-specific training data. Some studies [21,22] introduce the concept of using LLMs for zero-shot learning, demonstrating that models can generalize from pre-trained knowledge to new tasks without explicit training examples. This further enhances their versatility and application scope. Probing and featurebased fine-tuning involve using LLMs as classifiers by extracting and utilizing internal representations for specific tasks. A study [23] presented a method where prompts are augmented to probe LLMs for specific linguistic features, effectively turning them into classifiers for various natural language processing tasks. This technique demonstrates the adaptability of LLMs in understanding and categorizing complex linguistic patterns. LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) are techniques designed to enhance the efficiency of fine-tuning of LLMs by reducing the number of trainable parameters. Hu et al. [24] introduced LoRA, which introduces trainable rank-decomposition matrices into the layers of the transformer, significantly reducing the computational cost of fine-tuning. Dettmers et al. [25] subsequently optimized this approach with QLoRA, which incorporates quantization to further enhance efficiency while maintaining model performance.

Dataset

The dataset comes from JOKER 2024 shared task 2 [26,27]. It consists of 1,742 humorous texts labelled in 6 different categories.

• IR -Irony relies on a gap between the literal meaning and the intended meaning, creating a humorous twist or reversal. • SC -Sarcasm involves using irony to mock, criticize, or convey contempt.

• EX -Exaggeration involves magnifying or overstating something beyond its normal or realistic proportions. • AID -Incongruity refers to the unexpected or contradictory elements that are combined in a humorous way and Absurdity involves presenting situations, events, or ideas that are inherently illogical, irrational, or nonsensical. • SD -Self-deprecating humour involves making fun of oneself or highlighting one's own flaws, weaknesses, or embarrassing situations in a lighthearted manner. • WS -Wit refers to clever, quick, and intelligent humour and Surprise in humour involves introducing unexpected elements, twists, or punchlines that catch the audience off guard.

The primary challenge from the dataset is the imbalance in the number of examples for specific classes. For instance, the WS (Wit and Surprise) class contains 650 examples, while the EX (Exaggeration) class contains only 122 examples. The complete list of classes distribution is presented in Table 1. Another significant challenge is the proximity of certain classes, such as irony and sarcasm. Some definitions of irony include sarcasm as a form of irony [28]. For this corpus, the definition of irony aligns with the prevailing understanding of situational irony.

Table 1

The distribution of classes after removing duplicates and texts that are too long. "Count / 5" refers to the support of each class for 1 split since we use 5 splits for cross-validation.

Class

Count The majority of texts have a length between 20 and 40 tokens (Figure 1). However, some texts are very large, with a number of tokens exceeding 500. Due to the limitations of GPU memory, we have excluded examples with a number of tokens greater than 170 from the train corpus. Additionally, there are instances of duplication in the examples of the train set. Some of these instances are quasi-duplicates, as evidenced by the presence of the same example with the quotation mark. We removed the duplicates, but we could not identify nor remove the quasi-duplicates. At the conclusion of the cleaning process, we retained 1,704 examples from the original 1,742 set.

Methodology

Our objective is to explore the potential of advanced LLMs within a consistent methodological framework. We employed a 4-bit quantized version of three distinct LLMs: Llama2-7b1 , Mistral-7b2 , and Llama3-8b3 . These models are selected due to their varying characteristics, which allowed for a comparative analysis.

LLMs

The distinctions between the Mistral and Llama2 models are primarily due to several advanced optimization techniques employed by Mistral.

• Sliding Window Attention improves the efficiency of attention mechanisms by focusing on a moving window of tokens, rather than considering all tokens at once. This reduces the computational complexity and enhances the model's ability to handle longer sequences effectively. • Rolling Buffer Cache maintains a cache of recently processed data, enabling faster retrieval and processing of these data chunks, thus improving overall model performance. • Pre-fill and Chunking involve pre-processing and breaking down input data into manageable chunks, which can be processed more efficiently by the model, leading to better performance in terms of both speed and accuracy.

A significant difference between Llama2 and Llama3 lies in their tokenization strategies and vocabulary sizes.

• Llama2 and Mistral: Both models utilize the same tokenizer, which is based on Byte-Pair Encoding (BPE) and implemented using the sentencepiece approach. They share a vocabulary size of 32,000 tokens. Sentencepiece is a data-driven method that segments text into subword units, ensuring a balance between word-level and character-level tokenization. • Llama3: This model employs a new tokenizer with a vastly increased vocabulary size of 128,256 tokens. While it also employs BPE, it uses the tiktoken approach developed by OpenAI for their GPT models. The key distinction between tiktoken and other tokenizers is its capacity to bypass the BPE algorithm when a token already exists in the vocabulary, potentially enhancing efficiency and tokenization speed.

The differences between vocabulary size could not be explained by the tokenizer implementation only. The larger vocabulary in Llama3 likely reflects differences in the scope and variety of training data used, although specific details about these datasets are not publicly available.

Classification

The primary approach involved the addition of a Feed Forward (FF) layer on the final token of the LLM representation, situated atop the last hidden state. This method sought to leverage the LLMs' capacity to generate rich contextual embeddings for downstream tasks. The Causal Language Model (CLM) is extended in the following manner.

Consider a sequence of tokens 𝑋 = (𝑥 1 , 𝑥 2 , . . . , 𝑥 𝑇 ), where 𝑥 𝑡 represents the token at position 𝑡 in the sequence. The CLM models the probability of the next token given the previous tokens:

𝑃 (𝑋) = 𝑇 ∏︁ 𝑡=1 𝑃 (𝑥 𝑡 |𝑥 1 , 𝑥 2 , . . . , 𝑥 𝑡 − 1).

The CLM architecture we use in this paper can be streamlined as follows:

• Input Embedding Layer: Converts tokens into dense vector representations:

𝐸(𝑥 𝑡 ) = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝑥 𝑡 ).

• Positional Encoding: Adds positional information to the token embedding to retain the order of tokens: 𝑃 𝐸(𝑡) = 𝑃 𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑎𝑙𝐸𝑛𝑐𝑜𝑑𝑖𝑛𝑔(𝑡).

The input to the model at position 𝑡 is:

𝐻 0 𝑡 = 𝐸(𝑥 𝑡 ) + 𝑃 𝐸(𝑡).

• Attention Mechanism: Utilizes masked self-attention to ensure that the prediction for position 𝑡 only depends on positions 1 to 𝑡 − 1. The masked attention weights 𝛼 𝑖𝑗 are computed by:

𝛼 𝑖𝑗 = exp 𝑒 𝑖𝑗 ∑︀ 𝑗 𝑘=1 exp 𝑒 𝑗𝑘 ,

where 𝑒 𝑖𝑗 is the compatibility function (e.g., dot product of queries and keys) and the mask ensures that 𝑗 ≤ 𝑡 − 1. • Feed-Forward Layer: Applied after the attention mechanism to introduce non-linearity:

𝐻 𝑙 = 𝐹 𝐹 (𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝐻 𝑙−1 )).

• Final Hidden Layer: The last hidden state for each position 𝑡 is denoted as 𝐻 𝐿 𝑡 . To adapt the CLM for classification, we use the final hidden state of the last token 𝐻 𝐿 𝑇 as the input to a classification Feed-Forward Layer.

• Classification Feed-Forward Layer: Projects the last hidden state to the class probabilities:

𝑧 = 𝐹 𝐹 𝑐 (𝐻 𝐿 𝑇 ),

where 𝑧 is the logit vector representing unnormalized scores for each class 𝑐. • Output Layer: Applies a softmax function to obtain the class probabilities:

𝑃 (𝑦|𝑋) = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑧).

• Training Objective for Classification: The training objective is to minimize the cross-entropy loss between the predicted class probabilities and the true class labels:

𝐿 = − 𝑁 ∑︁ 𝑖=1 𝐶 ∑︁ 𝑐=1 𝑦 𝑖𝑐 log 𝑃 (𝑦 = 𝑐|𝑋 𝑖 ),

where 𝑁 is the number of training samples, 𝐶 is the number of classes, 𝑦 𝑖,𝑐 is a binary indicator (0 or 1) if class label 𝑐 is the correct classification for sample 𝑖.

• Inference for Classification: During inference, the model predicts the class label by selecting the class with the highest probability:

𝑦 ˆ= argmax 𝑐 𝑃 (𝑦 = 𝑐|𝑋).

In addition, we experimented with integrating a QLoRA adapter into the query and value components of the attention heads within the LLMs. QLoRA adapters are designed to enhance model performance by allowing fine-grained tuning during training.

Cross-validation

Given the limited number of examples in the training set, we decided to use the same set for testing and validation. We implemented a stratified cross-validation approach to ensure the evaluation of our model. Specifically, we used 5 stratified splits to perform cross-validation using a leave-one-out cross-validation. This means that we ran each experiment 5 times, using 4 splits for training and 1 split for evaluation. This method ensures that each class is equally represented in both the training and validation sets across all splits, thus providing a comprehensive assessment of model performance.

During the validation phase, our primary monitoring metric was the Matthews Correlation Coefficient (MCC) [29,30]. It is a performance metric for classification that considers all four components of the confusion matrix: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It is defined by:

𝑀 𝐶𝐶 = 𝑇 𝑃 • 𝑇 𝑁 − 𝐹 𝑃 • 𝐹 𝑁 √︀ (𝑇 𝑃 + 𝐹 𝑃 )(𝑇 𝑃 + 𝐹 𝑁 )(𝑇 𝑁 + 𝐹 𝑃 )(𝑇 𝑁 + 𝐹 𝑁 ) .

The MCC ranges from -1 to 1, with 1 indicating a perfect prediction, 0 indicating no better than a random prediction, and -1 indicating total disagreement between the prediction and the actual values. The MCC is particularly valuable for evaluating models on unbalanced datasets, as it is less susceptible to the limitations of traditional metrics like accuracy or F-Score, which may give high scores to models that are biased towards the majority class. The MCC, however, accounts for the balance ratios between the classes and the quality of the predictions for both classes. By considering all aspects of the confusion matrix, it provides a more comprehensive evaluation of the classifier's performance, making it robust against class imbalance and more reflective of the true predictive capability of the model.

Training Parameters

We used a batch size of 16 and a maximum of 10 epochs for training. These parameters were chosen to balance training efficiency and model performance. Two different learning rates were used: 1𝑒 − 3 for the Feed Forward layer and 1.5𝑒 − 4 for the QLoRA adapter when it was included. The learning rates were selected based on preliminary experiments and common practices in fine-tuning LLMs. The FF layer typically benefits from a higher learning rate due to its role in direct task-specific adaptation, while the QLoRA adapter requires a more conservative rate to ensure stability and effective integration. To further refine the training process, we experimented with a linear learning rate scheduler and gradient clipping. These techniques are known to enhance training stability and performance, especially in the context of large models. we conducted an experiment to determine whether weighting could improve performance, especially in scenarios where the data is imbalanced.

Two configurations of the QLoRA adapter were tested: 64 rank for 16 alpha and 16 rank for 64 alpha. These configurations were chosen because it was demonstrated in the original LoRA paper that minimal rank with high alpha performs better [24]. Conversely, the QLoRA paper demonstrated that high rank with low alpha performs better [25]. Both configurations incorporated a dropout rate of 0.1 to prevent overfitting.

Finally, we trained the model on each different hidden layer of each LLMs to see if there was a relationship between class performance and depth.

Additional Details

We submitted 9 test results using three distinct strategies to assess model performance: ensemble, high, and low. In the ensemble strategy, we employed a majority vote among the five models derived through cross-validation. For the high and low strategies, we selected the optimal and sub-optimal splits obtained from cross-validation.

The 36 experiments conducted as part of this study were executed on the Grid5000 platform, with an average runtime between 2h00 and 2h30 on a Nvidia A100 GPU (40 GiB). While these values should be interpreted with caution, they provide a general indication of the cost and resources required to perform this type of research.

Results

In this section, we first look at the results obtained from our various experiments. All these results should be interpreted with caution. The variance of the results across the 5 splits does not always allow us to conclude that one model performs better than another, as is often the case with Llama2 and Llama3. Furthermore, the small size of the dataset and the absence of a test set also limit the interpretation of the results. Secondly, we look at the results obtained on the test set provided by the Joker shared task's organizers and evaluated independently

Parameters Results

The results of our parameter-focused experiments are reported in Table 2. A subset of the Llama2 experiments is reported here, with the full set of results being accessible within the GitHub repository 4 . The observations presented below are equally applicable to the other experimented LLMs.

We can see that balancing cross-entropy did not lead to any significant improvements in the model's performance. Despite the theoretical advantages of mitigating class imbalance, our findings demonstrated that the impact on metrics was significantly worse than expected. Conversely, implementing a linear scheduling strategy for the learning rate yielded a notable enhancement in the model's performance. This approach permitted a more gradual adjustment of the learning rate, which in turn facilitated better convergence and reduced overfitting, as evidenced by lower validation loss and higher overall accuracy.

The application of QLoRA demonstrated substantial improvements across all tested setups. QLoRA consistently enhanced performance metrics, indicating its robustness. Notably, configurations utilizing a rank of 16 and an alpha of 64 demonstrated superior results in every experimental setup. This indicates that the combination of reduced parameter dimensionality with a strong scaling factor can effectively capture essential features and nuances in the data, thereby enhancing the model's performances.

LLMs Results

In our comparative analysis of LLMs (see Table 3), we observed that LLama2 consistently outperformed both Llama3 and Mistral in their best setups. This finding underscores that the performance of LLMs is influenced by factors beyond the mere chronological advancement of the model. Notwithstanding the more recent architectural developments and potential enhancements in Llama3 and Mistral, LLama2's superior performance serves to highlight the importance of specific optimizations and configurations that can play a critical role in achieving better results.

Classification Results

In the context of our classification task, we observed varying levels of difficulty among different classes. The results are reported in Table 4 and Figure 2. In general, it can be observed that the different LLMs exhibit comparable ease and difficulty across the various classes. This conclusion is supported by the observation that the confusion matrices are also highly similar.

It is notable that IR (Irony) and SC (Sarcasm) were challenging to differentiate, given their subtle distinctions and overlapping characteristics in textual expressions. Furthermore, EX (Exaggeration) frequently posed confusion, being misclassified as either irony or sarcasm due to their nuanced nature. Although instances of ambiguity between AID (Incongruity) and WS (Wit and Surprise) were less common, they still presented some classification challenges. Interestingly, the overall performance for WS (Wit and Surprise), SD (Self-Deprecating), and AID (Incongruity) was relatively robust. The high accuracy in classifying WS (Wit and Surprise) can be attributed to its status as the majority class, which naturally leads to a more substantial training set and better model performance. In contrast, the strong results for SD (Self-Deprecating) and AID (Incongruity) were somewhat unexpected, particularly for SD (Self-Deprecating), which is significantly underrepresented in the dataset. These findings suggest that while certain classes exhibit inherent complexities leading to misclassification, others (even with fewer examples) can achieve reliable identification when given appropriate model training.

Table 4

The optimal results for each LLM by classes. They were obtained using QLora with rank 16 and alpha 64. A linear scheduler was employed with a warmup of 1 epoch and a full training of 10 epochs. The support values are presented as intervals, as they may vary slightly depending on the split.

Qualitative Results

We have identified two common types of errors that are found regardless of the LLMs. The first type consists of confusing WS (Wit and Surprise) with AID (Incongruity) when the text consists of a question followed by an answer. Each text presents a question followed by a clever or unexpected answer, often relying on homophones, similar-sounding words, or humorous reinterpretations of common phrases. The humor is derived from the audience's recognition of the pun or wordplay, resulting in a light-hearted and amusing effect. The format is simple and straightforward, making it easy to deliver and understand, typical of classic joke telling. Here are some examples:

• What do you call a fish wearing a crown? King Cod! • What do you call a doctor who treats retired soldiers? A sawbone.

• Where do the pancakes live? In an apartment.

• What did the janitor say when he jumped out of the closet? Supplies! • How does a penguin catch a fish? It just waddles down to the grocery store!

The second type of common mistake is to wrongly predict some texts to be SD (Self-Deprecating) that use the first person. This is perfectly understandable, since most SD examples should use the first person, so the model is biased by this. Some examples below:

• My poo is green, how festive. • I'm mad at myself for not taking karate sooner. • My name is Bet. I am a cutter.

• I always pronounce one word wrong. Wrong.

Hidden Layer Analysis

Although there is no clear association between the hidden index and the score, it is evident that certain classes, such as IR (Irony), are sensitive to the "deepness" of the model. Indeed, the performance of IR on low layers is less optimal for each model. These observations can also be made for SD (Self-Deprecating), EX (Exageration), and SD (Self-Deprecating), although to a lesser extent. Conversely, there are some classes, such as WS (Wit and Surprise) and AID (Incongruity), that appear to be relatively stable regardless of the depth of the model. As it might be expected, these results align with the overall performance of each class.

Submission results

All results submitted for the shared task CLEF 2024 Joker shared task number 2 are presented in Table 8. The four submissions with the highest scores are ours, and all of our submissions are among the top 12. In summary, this indicates that the methodology employed is both effective and consistent. In terms of macro F1-score, our highest-scoring submission achieved a score of 0.70, while our lowest-scoring submission achieved a score of 0.604. The second-best approach, apart from ours, achieved a score of 0.638. It is also noteworthy that other approaches using LLM have been submitted. While we lack the information to properly compare them, it appears that the relatively simple approach we employed is the most effective.

The detailed results of our submissions are presented in Tables 5 and 7. The first observation that can be made is that the performance of the strategies is consistent across LLMs. Specifically, the ensemble strategy consistently outperforms the high strategy, which in turn outperforms the low strategy most of the times. This observation is particularly noteworthy as it suggests that the optimal training set identified through cross-validation is also the most effective when evaluated on the final test set. If the low-performing model had outperformed the high-performing model on the real test set, it would indicate that the optimal model identified through cross-validation may not be the optimal overall model.

As only three runs are submitted per model, it is not possible to definitively conclude that this is true for all five splits. Nevertheless, the observed trends offer valuable insights into the performance of the models and the reliability of cross-validation on this dataset.

The distribution of classes in the final test set differs from that of the train set. This is illustrated in Table 6. For example, there are 20% IR (Irony) examples in the test set where there are only 12% in the train set. Conversely, there are only 7% examples WS (Wit and Surprise) compared with 37% in the train set. Table 7 shows that for the AID (Incongruity) class, the F1-score increase of 1.2 compared with the train set. Furthermore, the poor scores obtained on the WS (Wit and Surprise) class are confirmed, 0.538 for the train set versus 0.521 for the test set. On the other hand, the IR (Irony) class results are particularly noteworthy, as they indicate a significant improvement over the train set results (+0.3 F1-score compared with the train set).

Conclusion

In conclusion, the application of LLM embeddings, a simple classification layer, and a cross-validation strategy yielded optimal performance on this task. This outcome suggests the potential utility of these embeddings in complex text classification tasks such as irony and humor categorization.

Our findings demonstrate the importance of considering more nuanced factors beyond the mere recency of the model. LLama2 emerges as the superior model, outperforming LLama3 and Mistral across the majority of configurations. This indicates that factors beyond chronological advancement are influencing the efficacy of the models, thus answering RQ2.

Furthermore, the integration of QLoRA consistently enhanced performance, regardless of the base model. The incorporation of lower ranks and higher alpha values yielded unexpected yet consistent improvements, leading to further interrogation in regards to RQ3. Notably, issues persisted between IR (Irony) and SC (Sarcasm). Qualitative analyses yielded intriguing insights, particularly regarding the impact of text related to the coronavirus on correctness. This was observed in IR, SC, and EX, which demonstrated sensitivity to such mentions. Fluctuations in correctness were observed, suggesting that different classes exhibited varying degrees of performance.

Further investigations into the different hidden layer variables revealed varying sensitivity among classes to the depth of analysis. In particular, WS (Wit and Surprise) and AID (Incongruity) demonstrated resilience, whereas others displayed sensitivity, partially answering RQ4.

In essence, our study highlights the multifaceted nature of language model performance on complex classification, underscoring the necessity for comprehensive evaluations encompassing both quantitative metrics and qualitative considerations to elucidate underlying mechanisms and optimize efficacy in text classification tasks. It also showed that the general knowledge embedded in LLMs facilitates more accurate classification of irony and humor genre, even though it is still far from being sufficient (RQ1).

Figure 1 :1Figure 1: Length distribution of training set for each LLM tokenizer

Figure 2 :2Figure 2: Confusion matrix aggregated over the five splits of the best run for each LLM.

Figure 3 :3Figure 3: The value of different scores over the hidden layer index for llama2-7b best setup

Figure 4 :4Figure 4: The value of different scores over the hidden layer index for mistral-7b best setup

Figure 5 :5Figure 5: The value of different scores over the hidden layer index for llama3-8b best setup

Table 22Parameter oriented results for Llama2-7b model. ↑ indicates that higher is better.Balanced ScheluderMacro F1↑ Weighted F1↑Accuracy↑MCC↑nono0.603 (0.057)0.672 (0.043)0.684 (0.034)0.592 (0.042)No QLoRAyesno0.608 (0.041)0.675 (0.030)0.679 (0.036)0.588 (0.046)noyes0.626 (0.046) 0.690 (0.040) 0.695 (0.043) 0.605 (0.054)nono0.630 (0.039)0.696 (0.028)0.700 (0.027)0.612 (0.033)64r, 16ayesno0.633 (0.036)0.700 (0.030)0.705 (0.034)0.616 (0.044)noyes0.632 (0.025) 0.700 (0.024) 0.706 (0.026) 0.618 (0.033)nono0.661 (0.031) 0.720 (0.030)0.722 (0.031)0.641 (0.040)16r, 64ayesno0.644 (0.024)0.706 (0.022)0.709 (0.021)0.623 (0.028)noyes0.654 (0.032)0.720 (0.024) 0.731 (0.020) 0.653 (0.023)

Table 33The optimal results for each LLM. They were obtained using QLora with rank 16 and alpha 64. A linear scheduler was employed with a warmup of 1 epoch and a full training of 10 epochs. ↑ indicates that higher is better.ModelMacro F1↑Weighted F1↑Accuracy↑MCC↑llama2-7b0.654 (0.032)0.720 (0.024) 0.731 (0.020) 0.653 (0.023)llama3-8b 0.663 (0.022) 0.721 (0.016)0.724 (0.014)0.643 (0.018)mistral-7b0.652 (0.030)0.717 (0.024)0.718 (0.023)0.636 (0.027)

Table 55Our submission results for each LLM. The best overall results are in bold. The best results for each LLM are underlined. ↑ indicates that higher is better.Model StrategyMacro↑Weighted↑Accuracy↑PRFPRFens0.684 0.672 0.659 0.737 0.738 0.7210.738llama2high0.632 0.646 0.635 0.711 0.711 0.7080.711low0.649 0.635 0.617 0.708 0.701 0.6830.701ens0.714 0.697 0.700 0.753 0.756 0.7490.756mistralhigh0.669 0.657 0.660 0.719 0.723 0.7190.723low0.650 0.606 0.604 0.694 0.673 0.6610.673ens0.675 0.652 0.659 0.724 0.727 0.7230.727llama3high0.630 0.611 0.614 0.689 0.701 0.6910.701low0.638 0.626 0.629 0.696 0.699 0.6950.699

Table 66Comparison of the distribution of classes between the cleaned training set and the test set.Train setTest setCount Ratio Count RatioIR2090.121470.20SC3500.21590.08AID2290.132700.37SD1620.10910.13EX1220.071060.15WS6320.37490.07Total170417221

Table 77Our submission results for each LLM by classes. The best overall results are in bold. The best results for each LLM are underlined.IR (support=147)SC (support=59)AID (support=270)PRFPRFPRFens0.596 0.782 0.676 0.649 0.814 0.722 0.891 0.911 0.901llama2high 0.669 0.592 0.628 0.524 0.729 0.610 0.885 0.885 0.885low0.553 0.741 0.634 0.570 0.831 0.676 0.860 0.889 0.874ens0.670 0.830 0.742 0.733 0.746 0.739 0.869 0.881 0.875mistralhigh 0.649 0.769 0.704 0.695 0.695 0.695 0.855 0.852 0.853low0.492 0.816 0.614 0.714 0.508 0.594 0.888 0.793 0.838ens0.592 0.721 0.650 0.765 0.661 0.709 0.903 0.893 0.898llama3high 0.563 0.667 0.611 0.731 0.644 0.685 0.874 0.896 0.885low0.539 0.653 0.591 0.678 0.678 0.678 0.881 0.881 0.881SD (support=91)EX (support=106)WS (support=49)PRFPRFPRFens0.773 0.824 0.798 0.643 0.255 0.365 0.550 0.449 0.494llama2high 0.755 0.813 0.783 0.510 0.491 0.500 0.450 0.367 0.404low0.823 0.714 0.765 0.629 0.208 0.312 0.457 0.429 0.442ens0.763 0.780 0.772 0.639 0.434 0.517 0.610 0.510 0.556mistralhigh 0.708 0.747 0.727 0.581 0.472 0.521 0.526 0.408 0.460low0.667 0.835 0.741 0.535 0.217 0.309 0.605 0.469 0.529ens0.762 0.846 0.802 0.483 0.406 0.441 0.543 0.388 0.452llama3high 0.748 0.846 0.794 0.448 0.368 0.404 0.414 0.245 0.308low0.774 0.791 0.783 0.494 0.387 0.434 0.462 0.367 0.409

https://huggingface.co/meta-llama/Llama-2-7b-chat-hf https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct https://github.com/PierreEpron/joker2024-task2

Acknowledgments

Experiments presented in this paper were carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and in cluding CNRS. RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).

A. Shared Task results

Error analysis in a hate speech detection task: The case of haspeede-tw at evalita CFrancesconi CBosco FPoletto MSanguinetti CEUR WORKSHOP PROCEEDINGS CEUR-WS 2018. 2019 2481 Multilingual fake news detection with satire GGuibon LErmakova HSeffih AFirsov GLNoé-Bienvenu International Conference on Computational Linguistics and Intelligent Text Processing Springer 2019 Semeval-2017 task 7: Detection and interpretation of english puns TMiller CFHempelmann IGurevych Proceedings of the 11th International Workshop on Semantic Evaluation the 11th International Workshop on Semantic Evaluation

SemEval-

2017. 2017 Humor detection: A transformer gets the last laugh OWeller KSeppi ArXiv abs/1909.00252 2019 BERT: pre-training of deep bidirectional transformers for language understanding JDevlin MChang KLee KToutanova CoRR abs/1810.04805 2018 Unified humor detection based on sentence-pair augmentation and transfer learning MWang HYang YQin SSun YDeng Proceedings of the 22nd Annual Conference of the European Association for Machine Translation the 22nd Annual Conference of the European Association for Machine Translation 2020 Humor detection in product question answering systems YZiser EKravi DCarmel Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval 2020 From humor recognition to irony detection: The figurative language of social media AReyes PRosso DBuscaldi 10.1016/j.datak.2012.02.005 Data & Knowledge Engineering 74 2012 Semeval-2018 task 3: Irony detection in english tweets CVHee ELefever VHoste Proceedings of the 12th international workshop on semantic evaluation the 12th international workshop on semantic evaluation 2018 Epic: Multi-perspective annotation of a corpus of irony SFrenda APedrani VBasile SLo ATCignarella RPanizzon CSMarco BScarlini VPatti CBosco DBernardi Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Long Papers the 61st Annual Meeting of the Association for Computational Linguistics 2023 1 The performance of humor in computer-mediated communication NKBaym Journal of computer-mediated communication 1 C123 1995 Effects of humor in task-oriented human-computer interaction and computer-mediated communication: A direct test of srct theory JMorkes HKKernal CNass Human-Computer Interaction 14 1999 HTouvron LMartin KRStone PAlbert AAlmahairi YBabaei NBashlykov SBatra PBhargava SBhosale DMBikel LBlecher CCFerrer MChen GCucurull DEsiobu JFernandes JFu WFu BFuller CGao VGoswami NGoyal ASHartshorn SHosseini RHou HInan MKardas VKerkez MKhabsa IMKloumann AVKorenev PSKoura M.-ALachaux TLavril JLee DLiskovich YLu YMao XMartinet TMihaylov PMishra IMolybog YNie APoulton JReizenstein RRungta KSaladi ASchelten RSilva EMSmith RSubramanian XTan BTang RTaylor AWilliams JXKuan PXu ZYan IZarov YZhang AFan MKambadur SNarang ARodriguez RStojnic SEdunov TScialom ArXiv abs/2307.09288 Llama 2: Open foundation and fine-tuned chat models 2023 <author> <persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Mensch</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Bamford</surname></persName> </author> <author> <persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>De Las Casas</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Bressand</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Lengyel</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Lample</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Saulnier</surname></persName> </author> <author> <persName><forename type="first">L</forename><forename type="middle">R</forename><surname>Lavaud</surname></persName> </author> <author> <persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Stock</surname></persName> </author> <author> <persName><forename type="first">T</forename><forename type="middle">L</forename><surname>Scao</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lavril</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Wang</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lacroix</surname></persName> </author> <author> <persName><forename type="first">W</forename><forename type="middle">E</forename><surname>Sayed</surname></persName> </author> <idno>ArXiv abs/2310.06825</idno> </analytic> <monogr> <title level="j">Mistral 7 2023 Language models are few-shot learners TBBrown BMann NRyder MSubbiah JKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DMZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei ArXiv abs/2005.14165 2020 OJAchiam SAdler SAgarwal LAhmad IAkkaya FLAleman DAlmeida JAltenschmidt SAltman SAnadkat RAvila IBabuschkin SBalaji VBalcom PBaltescu HBao MBavarian JBelgum IBello JBerdine GBernadett-Shapiro CBerner LBogdonoff OBoiko MBoyd A.-LBrakman GBrockman TBrooks MBrundage KButton TCai RCampbell ACann BCarey CCarlson RCarmichael BChan CChang FChantzis DChen SChen RChen JChen MChen BChess CCho CChu HWChung DCummings JCurrier YDai CDecareaux TDegry NDeutsch DDeville ADhar DDohan SDowling SDunning AEcoffet AEleti TEloundou DFarhi LFedus NFelix SPFishman JForte IFulford LGao EGeorges CGibson VGoel TGogineni GGoh RGontijo-Lopes JGordon MGrafstein SGray RGreene JGross SSGu YGuo CHallacy JHan JHarris YHe MHeaton JHeidecke CHesse AHickey WHickey PHoeschele BHoughton KHsu SHu XHu JHuizinga SJain SJain JJang AJiang RJiang HJin DJin SJomoto BJonn HJun TKaftan LKaiser AKamali IKanitscheider NSKeskar TKhan LKilpatrick JWKim CKim YKim HKirchner JRKiros MKnight DKokotajlo LKondraciuk AKondrich AKonstantinidis KKosic GKrueger VKuo MLampe ILan TLee JLeike JLeung DLevy CMLi RLim MLin SLin MLitwin TLopez RLowe PLue AAMakanju KMalfacini SManning TMarkov YMarkovski BMartin KMayer AMayne BMcgrew SMMckinney CMcleavey PMcmillan JMcneil DMedina AMehta JMenick LMetz AMishchenko PMishkin VMonaco EMorikawa DPMossing TMu MMurati OMurk DM'ely ANair RNakano RNayak ANeelakantan RNgo HNoh OLong CO'keefe JWPachocki APaino JPalermo APantuliano GParascandolo JParish EParparita APassos MPavlov APeng APerelman FDe Avila Belbute Peres MPetrov HPDe Oliveira Pinto MPokorny MPokrass VHPong TPowell APower BPower EProehl RPuri ARadford JRae ARamesh CRaymond FReal KRimbach CRoss BRotsted HRoussez NRyder MDSaltarelli TSanders SSanturkar GSastry HSchmidt DSchnurr JSchulman DSelsam KSheppard TSherbakov JShieh SShoker PShyam SSidor ESigler MSimens JSitkin KSlama ISohl BDSokolowsky YSong NStaudacher FPSuch NSummers ISutskever JTang NATezak MThompson PTillet ATootoonchian ETseng PTuggle NTurley JTworek JF CUribe AVallone AVijayvergiya CVoss CLWainwright JJWang AWang BWang JWard JWei CWeinmann AWelihinda PWelinder JWeng LWeng MWiethoff DWillner CWinter SWolrich HWong LWorkman SWu JWu MWu KXiao TXu SYoo KYu QYuan WZaremba RZellers CZhang MZhang SZhao TZheng JZhuang WZhuk BZoph arXiv:2303.08774 Gpt-4 technical report 2023 arXiv preprint AI@Meta, Llama 3 model card 2024 DHendrycks CBurns SBasart AZou MMazeika DXSong JSteinhardt ArXiv abs/2009.03300 Measuring massive multitask language understanding 2020 Think you have solved question answering? try arc, the ai2 reasoning challenge PClark ICowhey OEtzioni TKhot ASabharwal CSchoenick OTafjord ArXiv abs/1803.05457 2018 Hellaswag: Can a machine really finish your sentence? RZellers AHoltzman YBisk AFarhadi YChoi 10.18653/v1/P19-1472 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Florence, Italy

2019 Language models are unsupervised multitask learners ARadford JWu RChild DLuan DAmodei ISutskever 2019 OpenAI blog Prompt programming for large language models: Beyond the few-shot paradigm LReynolds KMcdonell Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems 2021 Prompt-augmented linear probing: Scaling beyond the limit of few-shot in-context learners HCho HJKim JKim S.-WLee SLee KMYoo TKim ArXiv abs/2212.10873 2022 JEHu YShen PWallis ZAllen-Zhu YLi SWang WChen ArXiv abs/2106.09685 Lora: Low-rank adaptation of large language models 2021 Qlora: Efficient finetuning of quantized llms TDettmers APagnoni AHoltzman LZettlemoyer ArXiv abs/2305.14314 2023 Overview of joker -clef-2024 track on automatic humor analysis LErmakova TMiller A.-GBosser VM PPreciado GSidorov AJatowt Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association LGoeuriot PMulhem GQuénot DSchwab LSoulier GM DNunzio PGaluščáková AG SDe Herrera GFaggioli NFerro

CLEF

2024. 2024 Clef 2024 joker lab: Automatic humour analysis LErmakova A.-GBosser TMiller TThomas-Young VM PPreciado GSidorov AJatowt 10.1007/978-3-031-56072-9_5 Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024 Lecture Notes in Computer Science NGoharian NTonellotto YHe ALipani GMcdonald CMacdonald IOunis

Glasgow, UK; Cham

Springer March 24-28. 2024 14613 Proceedings, Part VI A pattern-based approach for sarcasm detection on twitter MBouazizi TOOhtsuki IEEE Access 4 2016 HCramér 10.1515/9781400883868 doi: Mathematical Methods of Statistics (PMS-9)

Princeton

Princeton University Press 1946 9 Comparison of the predicted and observed secondary structure of t4 phage lysozyme BWMatthews Biochimica et biophysica acta 405 1975