1. Introduction

N. K. Baym, The performance of humor in computer-mediated communication, Journal of computer-mediated communication

ORPAILLEUR & SyNaLP at CLEF 2024 Task 2: Good Old Cross Validation for Large Language Models Yields the Best Humorous Detection

Pierre Epron

Gaël Guibon

1 2

Miguel Couceiro

0 2 0 INESC-ID, IST, Universidade de Lisboa , Portugal 1 LIPN, Université Sorbonne Paris Nord , France 2 LORIA, Université de Lorraine , CNRS , France

1995

1 1995

In the context of the JOKER 2024 Task 2 Challenge, this paper presents an emerging approach that leverages the latent representations derived from diferent Large Language Models (LLMs) to drive a classification mechanism. Our methodology involves exploiting the "knowledge" encoded in LLMs to efectively discriminate humor genres. Experimental results show promising results, demonstrating the efectiveness of our approach. However, inherent complexities remain, such as the proximity between certain classes and biases arising from the dataset distributions. These complexities warrant further investigation to refine the classification process and improve overall performance.

eol>Humor genre classification Large language models Text classification

1. Introduction

• RQ4: Which hidden layer depth of LLMs yields the best classification results? We investigate whether certain layers of the network provide more relevant features for classification and if this optimal depth is consistent across diferent LLMs and humor categories.

By addressing these questions, our study aims to deepen the understanding of how LLMs can be efectively utilized for complex classification tasks and to identify the factors that contribute to their performance. This research addresses the applicability of LLMs to the complex challenge of humor classification.

1.1. Related Work

The necessity for automatic humor detection arises from the increasing influence of conversational agents and the omnipresence of social media platforms. In the digital realm, where interactions are increasingly mediated by algorithms, discerning humor has become crucial. This imperative extends to various applications, including chatbots, recommender systems, social media reputation management, and the crucial task of identifying and combating fake news and hate speech [ 1, 2 ]. Early eforts in humor detection primarily focused on the intricate dynamics of wordplay. The seminal evaluation campaign explored tasks including pun detection, pun location, and pun interpretation [ 3 ]. However, a significant challenge in this domain has been the scarcity of appropriate training data, particularly evident for languages beyond English. Recent advances in automatic humor detection have been driven by the development of contextualized embeddings, which have facilitated a broader recognition of humor across diverse contexts [ 4 ]. Moreover, the development of multilingual models, which leverage the pre-trained BERT architecture [ 5 ], has expanded the scope of humor recognition to languages such as Chinese, Russian, and Spanish [ 6 ]. Additionally, there has been a notable shift towards addressing domain-specific tasks, examplified by endeavors to identify humorous queries within Q&A systems [ 7 ]. While the field of irony and sarcasm detection has received considerable attention [ 8, 9, 10 ], the area of automatic humor detection remains a vital and evolving area of research with implications that extend to various facets of human-computer interaction and online discourse [11, 12].

The landscape of LLM development is undergoing a period of rapid evolution, with the introduction of new models such as LLaMA2, Mistral, and the GPT family models, including GPT-3 and GPT-4. Touvron et al. [13] introduced LLaMA2, which builds upon its predecessor by enhancing the model architecture and training methodologies, thereby achieving improved performance across a range of natural language processing tasks. Jiang et al. [14] presented Mistral, a model known for its eficiency and efectiveness, particularly in low-resource settings, demonstrating impressive capabilities in several benchmarks. Concurrently, the GPT family of models, developed by OpenAI, has made significant contributions to the field. GPT-3, introduced by Brown et al. [15] set new standards with its 175 billion parameters, enabling unprecedented performance in generating human-like text and performing complex language tasks with minimal prompt engineering. Building on this, GPT-4 [16] and Llama3 [17], as detailed in its model card, further enhanced these capabilities by incorporating more sophisticated training techniques and a larger training corpus, resulting in superior performance across a wider range of applications. These models have played a pivotal role in advancing the state of the art in natural language understanding and generation, solidifying their position as indispensable tools in the Natural Language Processing community [18, 19, 20].

Zero-shot and few-shot learning methods enable LLMs to perform tasks with minimal task-specific training data. Some studies [21, 22] introduce the concept of using LLMs for zero-shot learning, demonstrating that models can generalize from pre-trained knowledge to new tasks without explicit training examples. This further enhances their versatility and application scope. Probing and featurebased fine-tuning involve using LLMs as classifiers by extracting and utilizing internal representations for specific tasks. A study [ 23] presented a method where prompts are augmented to probe LLMs for specific linguistic features, efectively turning them into classifiers for various natural language processing tasks. This technique demonstrates the adaptability of LLMs in understanding and categorizing complex linguistic patterns. LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) are techniques designed to enhance the eficiency of fine-tuning of LLMs by reducing the number of trainable parameters. Hu et al. [24] introduced LoRA, which introduces trainable rank-decomposition matrices into the layers of the transformer, significantly reducing the computational cost of fine-tuning. Dettmers et al. [25] subsequently optimized this approach with QLoRA, which incorporates quantization to further enhance eficiency while maintaining model performance.

2. Dataset

The dataset comes from JOKER 2024 shared task 2 [26, 27]. It consists of 1,742 humorous texts labelled in 6 diferent categories.

• IR - Irony relies on a gap between the literal meaning and the intended meaning, creating a humorous twist or reversal. • SC - Sarcasm involves using irony to mock, criticize, or convey contempt. • EX - Exaggeration involves magnifying or overstating something beyond its normal or realistic proportions. • AID - Incongruity refers to the unexpected or contradictory elements that are combined in a humorous way and Absurdity involves presenting situations, events, or ideas that are inherently illogical, irrational, or nonsensical. • SD - Self-deprecating humour involves making fun of oneself or highlighting one’s own flaws, weaknesses, or embarrassing situations in a lighthearted manner. • WS - Wit refers to clever, quick, and intelligent humour and Surprise in humour involves introducing unexpected elements, twists, or punchlines that catch the audience of guard.

The primary challenge from the dataset is the imbalance in the number of examples for specific classes. For instance, the WS (Wit and Surprise) class contains 650 examples, while the EX (Exaggeration) class contains only 122 examples. The complete list of classes distribution is presented in Table 1. Another significant challenge is the proximity of certain classes, such as irony and sarcasm. Some definitions of irony include sarcasm as a form of irony [28]. For this corpus, the definition of irony aligns with the prevailing understanding of situational irony.

The majority of texts have a length between 20 and 40 tokens (Figure 1). However, some texts are very large, with a number of tokens exceeding 500. Due to the limitations of GPU memory, we have excluded examples with a number of tokens greater than 170 from the train corpus. Additionally, there are instances of duplication in the examples of the train set. Some of these instances are quasi-duplicates, as evidenced by the presence of the same example with the quotation mark. We removed the duplicates, but we could not identify nor remove the quasi-duplicates. At the conclusion of the cleaning process, we retained 1,704 examples from the original 1,742 set.

3. Methodology

3.1. LLMs Our objective is to explore the potential of advanced LLMs within a consistent methodological framework. We employed a 4-bit quantized version of three distinct LLMs: Llama2-7b1, Mistral-7b2, and Llama3-8b3. These models are selected due to their varying characteristics, which allowed for a comparative analysis. The distinctions between the Mistral and Llama2 models are primarily due to several advanced optimization techniques employed by Mistral.

• Sliding Window Attention improves the eficiency of attention mechanisms by focusing on a moving window of tokens, rather than considering all tokens at once. This reduces the computational complexity and enhances the model’s ability to handle longer sequences efectively. • Rolling Bufer Cache maintains a cache of recently processed data, enabling faster retrieval and processing of these data chunks, thus improving overall model performance. • Pre-fill and Chunking involve pre-processing and breaking down input data into manageable chunks, which can be processed more eficiently by the model, leading to better performance in terms of both speed and accuracy.

A significant diference between Llama2 and Llama3 lies in their tokenization strategies and vocabulary sizes.

• Llama2 and Mistral: Both models utilize the same tokenizer, which is based on Byte-Pair Encoding (BPE) and implemented using the sentencepiece approach. They share a vocabulary size of 32,000 tokens. Sentencepiece is a data-driven method that segments text into subword units, ensuring a balance between word-level and character-level tokenization. • Llama3: This model employs a new tokenizer with a vastly increased vocabulary size of 128,256 tokens. While it also employs BPE, it uses the tiktoken approach developed by OpenAI for their GPT models. The key distinction between tiktoken and other tokenizers is its capacity to bypass the BPE algorithm when a token already exists in the vocabulary, potentially enhancing eficiency and tokenization speed.

The diferences between vocabulary size could not be explained by the tokenizer implementation only. The larger vocabulary in Llama3 likely reflects diferences in the scope and variety of training data used, although specific details about these datasets are not publicly available. 1https://huggingface.co/meta-llama/Llama-2-7b-chat-hf 2https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 3https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

3.2. Classification

The primary approach involved the addition of a Feed Forward (FF) layer on the final token of the LLM representation, situated atop the last hidden state. This method sought to leverage the LLMs’ capacity to generate rich contextual embeddings for downstream tasks. The Causal Language Model (CLM) is extended in the following manner.

Consider a sequence of tokens = (1, 2, . . . , ), where represents the token at position in the sequence. The CLM models the probability of the next token given the previous tokens: () = ∏︁ (|1, 2, . . . , − 1).

=1 The CLM architecture we use in this paper can be streamlined as follows: • Input Embedding Layer: Converts tokens into dense vector representations:

() = (). • Positional Encoding: Adds positional information to the token embedding to retain the order of tokens:

() = ().

The input to the model at position is: • Attention Mechanism: Utilizes masked self-attention to ensure that the prediction for position only depends on positions 1 to − 1. The masked attention weights are computed by: 0 = () + (). = ∑︀ =1 exp exp , where is the compatibility function (e.g., dot product of queries and keys) and the mask ensures that ≤ − 1. • Feed-Forward Layer: Applied after the attention mechanism to introduce non-linearity: = ((− 1)). • Final Hidden Layer: The last hidden state for each position is denoted as . To adapt the CLM for classification, we use the final hidden state of the last token as the input to a classification Feed-Forward Layer. • Classification Feed-Forward Layer: Projects the last hidden state to the class probabilities: where is the logit vector representing unnormalized scores for each class . • Output Layer: Applies a softmax function to obtain the class probabilities:

= (), (|) = (). • Training Objective for Classification: The training objective is to minimize the cross-entropy loss between the predicted class probabilities and the true class labels: = − ∑︁ ∑︁ log ( = |), =1 =1 where is the number of training samples, is the number of classes, , is a binary indicator (0 or 1) if class label is the correct classification for sample . • Inference for Classification: During inference, the model predicts the class label by selecting the class with the highest probability: ˆ = argmax ( = |).

In addition, we experimented with integrating a QLoRA adapter into the query and value components of the attention heads within the LLMs. QLoRA adapters are designed to enhance model performance by allowing fine-grained tuning during training.

3.3. Cross-validation

Given the limited number of examples in the training set, we decided to use the same set for testing and validation. We implemented a stratified cross-validation approach to ensure the evaluation of our model. Specifically, we used 5 stratified splits to perform cross-validation using a leave-one-out cross-validation. This means that we ran each experiment 5 times, using 4 splits for training and 1 split for evaluation. This method ensures that each class is equally represented in both the training and validation sets across all splits, thus providing a comprehensive assessment of model performance.

During the validation phase, our primary monitoring metric was the Matthews Correlation Coeficient (MCC) [29, 30]. It is a performance metric for classification that considers all four components of the confusion matrix: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It is defined by:

· − · = √︀( + )( + )( + )( + ) .

The MCC ranges from -1 to 1, with 1 indicating a perfect prediction, 0 indicating no better than a random prediction, and -1 indicating total disagreement between the prediction and the actual values. The MCC is particularly valuable for evaluating models on unbalanced datasets, as it is less susceptible to the limitations of traditional metrics like accuracy or F-Score, which may give high scores to models that are biased towards the majority class. The MCC, however, accounts for the balance ratios between the classes and the quality of the predictions for both classes. By considering all aspects of the confusion matrix, it provides a more comprehensive evaluation of the classifier’s performance, making it robust against class imbalance and more reflective of the true predictive capability of the model.

3.4. Training Parameters

We used a batch size of 16 and a maximum of 10 epochs for training. These parameters were chosen to balance training eficiency and model performance. Two diferent learning rates were used: 1 − 3 for the Feed Forward layer and 1.5 − 4 for the QLoRA adapter when it was included. The learning rates were selected based on preliminary experiments and common practices in fine-tuning LLMs. The FF layer typically benefits from a higher learning rate due to its role in direct task-specific adaptation, while the QLoRA adapter requires a more conservative rate to ensure stability and efective integration. To further refine the training process, we experimented with a linear learning rate scheduler and gradient clipping. These techniques are known to enhance training stability and performance, especially in the context of large models. we conducted an experiment to determine whether weighting could improve performance, especially in scenarios where the data is imbalanced.

Two configurations of the QLoRA adapter were tested: 64 rank for 16 alpha and 16 rank for 64 alpha. These configurations were chosen because it was demonstrated in the original LoRA paper that minimal rank with high alpha performs better [24]. Conversely, the QLoRA paper demonstrated that high rank with low alpha performs better [25]. Both configurations incorporated a dropout rate of 0.1 to prevent overfitting.

Finally, we trained the model on each diferent hidden layer of each LLMs to see if there was a relationship between class performance and depth.

3.5. Additional Details

We submitted 9 test results using three distinct strategies to assess model performance: ensemble, high, and low. In the ensemble strategy, we employed a majority vote among the five models derived through cross-validation. For the high and low strategies, we selected the optimal and sub-optimal splits obtained from cross-validation.

The 36 experiments conducted as part of this study were executed on the Grid5000 platform, with an average runtime between 2h00 and 2h30 on a Nvidia A100 GPU (40 GiB). While these values should be interpreted with caution, they provide a general indication of the cost and resources required to perform this type of research.

4. Results

In this section, we first look at the results obtained from our various experiments. All these results should be interpreted with caution. The variance of the results across the 5 splits does not always allow us to conclude that one model performs better than another, as is often the case with Llama2 and Llama3. Furthermore, the small size of the dataset and the absence of a test set also limit the interpretation of the results. Secondly, we look at the results obtained on the test set provided by the Joker shared task’s organizers and evaluated independently

4.1. Parameters Results

The results of our parameter-focused experiments are reported in Table 2. A subset of the Llama2 experiments is reported here, with the full set of results being accessible within the GitHub repository4. The observations presented below are equally applicable to the other experimented LLMs.

We can see that balancing cross-entropy did not lead to any significant improvements in the model’s performance. Despite the theoretical advantages of mitigating class imbalance, our findings demonstrated that the impact on metrics was significantly worse than expected. Conversely, implementing a linear scheduling strategy for the learning rate yielded a notable enhancement in the model’s performance. This approach permitted a more gradual adjustment of the learning rate, which in turn facilitated better convergence and reduced overfitting, as evidenced by lower validation loss and higher overall accuracy.

The application of QLoRA demonstrated substantial improvements across all tested setups. QLoRA consistently enhanced performance metrics, indicating its robustness. Notably, configurations utilizing a rank of 16 and an alpha of 64 demonstrated superior results in every experimental setup. This indicates that the combination of reduced parameter dimensionality with a strong scaling factor can efectively capture essential features and nuances in the data, thereby enhancing the model’s performances.

4.2. LLMs Results

In our comparative analysis of LLMs (see Table 3), we observed that LLama2 consistently outperformed both Llama3 and Mistral in their best setups. This finding underscores that the performance of LLMs is influenced by factors beyond the mere chronological advancement of the model. Notwithstanding the more recent architectural developments and potential enhancements in Llama3 and Mistral, LLama2’s superior performance serves to highlight the importance of specific optimizations and configurations that can play a critical role in achieving better results.

4.3. Classification Results

In the context of our classification task, we observed varying levels of dificulty among diferent classes. The results are reported in Table 4 and Figure 2. In general, it can be observed that the diferent LLMs

No QLoRA 64r, 16a 16r, 64a no yes no no yes no no yes no no no yes no no yes no no yes

Macro F1↑ exhibit comparable ease and dificulty across the various classes. This conclusion is supported by the observation that the confusion matrices are also highly similar.

It is notable that IR (Irony) and SC (Sarcasm) were challenging to diferentiate, given their subtle distinctions and overlapping characteristics in textual expressions. Furthermore, EX (Exaggeration) frequently posed confusion, being misclassified as either irony or sarcasm due to their nuanced nature. Although instances of ambiguity between AID (Incongruity) and WS (Wit and Surprise) were less common, they still presented some classification challenges. Interestingly, the overall performance for WS (Wit and Surprise), SD (Self-Deprecating), and AID (Incongruity) was relatively robust. The high accuracy in classifying WS (Wit and Surprise) can be attributed to its status as the majority class, which naturally leads to a more substantial training set and better model performance. In contrast, the strong results for SD (Self-Deprecating) and AID (Incongruity) were somewhat unexpected, particularly for SD (Self-Deprecating), which is significantly underrepresented in the dataset. These findings suggest that while certain classes exhibit inherent complexities leading to misclassification, others (even with fewer examples) can achieve reliable identification when given appropriate model training.

4.4. Qualitative Results

We have identified two common types of errors that are found regardless of the LLMs. The first type consists of confusing WS (Wit and Surprise) with AID (Incongruity) when the text consists of a question followed by an answer. Each text presents a question followed by a clever or unexpected answer, often relying on homophones, similar-sounding words, or humorous reinterpretations of common phrases. The humor is derived from the audience’s recognition of the pun or wordplay, resulting in a light-hearted and amusing efect. The format is simple and straightforward, making it easy to deliver and understand, typical of classic joke telling. Here are some examples: • What do you call a fish wearing a crown? King Cod! • What do you call a doctor who treats retired soldiers? A sawbone. • Where do the pancakes live? In an apartment. • What did the janitor say when he jumped out of the closet? Supplies! • How does a penguin catch a fish? It just waddles down to the grocery store!

The second type of common mistake is to wrongly predict some texts to be SD (Self-Deprecating) that use the first person. This is perfectly understandable, since most SD examples should use the first person, so the model is biased by this. Some examples below: • My poo is green, how festive. • I’m mad at myself for not taking karate sooner. • My name is Bet. I am a cutter.

• I always pronounce one word wrong. Wrong.

4.5. Hidden Layer Analysis

Although there is no clear association between the hidden index and the score, it is evident that certain classes, such as IR (Irony), are sensitive to the “deepness” of the model. Indeed, the performance of IR on low layers is less optimal for each model. These observations can also be made for SD (SelfDeprecating), EX (Exageration), and SD (Self-Deprecating), although to a lesser extent. Conversely, there are some classes, such as WS (Wit and Surprise) and AID (Incongruity), that appear to be relatively stable regardless of the depth of the model. As it might be expected, these results align with the overall performance of each class.

4.6. Submission results

All results submitted for the shared task CLEF 2024 Joker shared task number 2 are presented in Table 8. The four submissions with the highest scores are ours, and all of our submissions are among the top 12. In summary, this indicates that the methodology employed is both efective and consistent. In terms of macro F1-score, our highest-scoring submission achieved a score of 0.70, while our lowest-scoring submission achieved a score of 0.604. The second-best approach, apart from ours, achieved a score of 0.638. It is also noteworthy that other approaches using LLM have been submitted. While we lack the information to properly compare them, it appears that the relatively simple approach we employed is the most efective.

The detailed results of our submissions are presented in Tables 5 and 7. The first observation that can be made is that the performance of the strategies is consistent across LLMs. Specifically, the ensemble strategy consistently outperforms the high strategy, which in turn outperforms the low strategy most of the times. This observation is particularly noteworthy as it suggests that the optimal training set identified through cross-validation is also the most efective when evaluated on the final test set. If 0.684 0.632 0.649 0.714 0.669 0.650

5. Conclusion

In conclusion, the application of LLM embeddings, a simple classification layer, and a cross-validation strategy yielded optimal performance on this task. This outcome suggests the potential utility of these embeddings in complex text classification tasks such as irony and humor categorization.

Our findings demonstrate the importance of considering more nuanced factors beyond the mere recency of the model. LLama2 emerges as the superior model, outperforming LLama3 and Mistral across the majority of configurations. This indicates that factors beyond chronological advancement are influencing the eficacy of the models, thus answering RQ2.

Furthermore, the integration of QLoRA consistently enhanced performance, regardless of the base model. The incorporation of lower ranks and higher alpha values yielded unexpected yet consistent improvements, leading to further interrogation in regards to RQ3. Notably, issues persisted between IR (Irony) and SC (Sarcasm). Qualitative analyses yielded intriguing insights, particularly regarding the impact of text related to the coronavirus on correctness. This was observed in IR, SC, and EX, which demonstrated sensitivity to such mentions. Fluctuations in correctness were observed, suggesting that diferent classes exhibited varying degrees of performance.

Further investigations into the diferent hidden layer variables revealed varying sensitivity among classes to the depth of analysis. In particular, WS (Wit and Surprise) and AID (Incongruity) demonstrated resilience, whereas others displayed sensitivity, partially answering RQ4.

In essence, our study highlights the multifaceted nature of language model performance on complex classification, underscoring the necessity for comprehensive evaluations encompassing both quantitative metrics and qualitative considerations to elucidate underlying mechanisms and optimize eficacy in text classification tasks. It also showed that the general knowledge embedded in LLMs facilitates more accurate classification of irony and humor genre, even though it is still far from being suficient (RQ1).

6. Acknowledgments

Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and in cluding CNRS. RENATER and several Universities as well as other organizations (see https://www.grid5000.fr). S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, B. Zoph, Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023). [17] AI@Meta, Llama 3 model card, 2024. URL: https://github.com/meta-llama/llama3/blob/main/

MODEL_CARD.md. [18] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. X. Song, J. Steinhardt, Measuring massive multitask language understanding, ArXiv abs/2009.03300 (2020). [19] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, O. Tafjord, Think you have solved question answering? try arc, the ai2 reasoning challenge, ArXiv abs/1803.05457 (2018). [20] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a machine really finish your sentence?, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 4791–4800. URL: https://aclanthology.org/P19-1472. doi:10.18653/v1/P19-1472. [21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners, OpenAI blog (2019). [22] L. Reynolds, K. McDonell, Prompt programming for large language models: Beyond the few-shot paradigm, Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (2021). [23] H. Cho, H. J. Kim, J. Kim, S.-W. Lee, S. goo Lee, K. M. Yoo, T. Kim, Prompt-augmented linear probing: Scaling beyond the limit of few-shot in-context learners, ArXiv abs/2212.10873 (2022). [24] J. E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, W. Chen, Lora: Low-rank adaptation of large language models, ArXiv abs/2106.09685 (2021). [25] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Eficient finetuning of quantized llms, ArXiv abs/2305.14314 (2023). [26] L. Ermakova, T. Miller, A.-G. Bosser, V. M. P. Preciado, G. Sidorov, A. Jatowt, Overview of joker - clef-2024 track on automatic humor analysis, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [27] L. Ermakova, A.-G. Bosser, T. Miller, T. Thomas-Young, V. M. P. Preciado, G. Sidorov, A. Jatowt, Clef 2024 joker lab: Automatic humour analysis, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, Proceedings, Part VI, volume 14613 of Lecture Notes in Computer Science, Springer, Cham, 2024, pp. 36–43. doi:10. 1007/978-3-031-56072-9_5. [28] M. Bouazizi, T. O. Ohtsuki, A pattern-based approach for sarcasm detection on twitter, IEEE

Access 4 (2016) 5477–5488. [29] H. Cramér, Mathematical Methods of Statistics (PMS-9), Volume 9, Princeton University Press, Princeton, 1946. URL: https://doi.org/10.1515/9781400883868. doi:doi:10.1515/ 9781400883868. [30] B. W. Matthews, Comparison of the predicted and observed secondary structure of t4 phage lysozyme., Biochimica et biophysica acta 405 2 (1975) 442–51.

A. Shared Task results

Run ID * ORPAILLEUR_mistral-7b-ens * ORPAILLEUR_mistral-7b-high * ORPAILLEUR_llama2-7b-ens * ORPAILLEUR_llama3-8b-ens CYUT_llama3-fine-tuning * ORPAILLEUR_llama2-7b-high * ORPAILLEUR_llama3-8b-low * ORPAILLEUR_llama2-7b-low PunDerstand_DeBERTaSampled * ORPAILLEUR_llama3-8b-high PunDerstand_GuidedAnnotation * ORPAILLEUR_mistral-7b-low PunDerstand_DeBERTa DadJokers_bert_base_uncased NLPalma_BERTd CodingRangers_bert_uncased Code Rangers_roberta Demonteam_BERTM UAms_BERT_ft NLPalma_PREDCNN VayamSolveKurmaha_BERT NaiveNeuron_fastText NaiveNeuron_llama3:70b_rag-uae VayamSolveKurmaha_BERT NaiveNeuron_llama3:70b_rag DadJokers_RandomForest_MLP_Ensemble HumourInsights_Random Forest PunDerstand_GPT4oFewShot UBO_RubyAiYoungTeam team1_Petra_and_Regina_LogisticRegression Dajana&Kathy_Joker_LogisticRegression team1_FRANE_AND_ANDREA_LogisticRegression Tomislav&Rowan_SVM AB&DPV_MLP3000params DadJokers_RandomForest CYUT_GPT-4 Tomislav&Rowan_LogisticRegression AB&DPV_DecisionTreeClassifier CYUT_roBERTa-fine-tuning AB&DPV_RandomForestClassifier250 Tomislav&Rowan_NaiveBayes AB&DPV_RandomForestClassifier500 AB&DPV_GaussianNB AB&DPV_MLP2000 AB&DPV_MLP3000

SC SD

[1]

Francesconi ,

Bosco ,

Poletto ,

Sanguinetti , Error analysis in a hate speech detection task: The case of haspeede-tw at evalita 2018 , in: CEUR WORKSHOP PROCEEDINGS , volume 2481 , CEUR-WS , 2019 , pp. 1 - 6 .

[2]

Guibon ,

Ermakova ,

Sefih ,

Firsov ,

G. L.

Noé-Bienvenu , Multilingual fake news detection with satire , in: International Conference on Computational Linguistics and Intelligent Text Processing , Springer, 2019 , pp. 392 - 402 .

[3]

Miller ,

C. F.

Hempelmann , I. Gurevych , Semeval -2017 task 7: Detection and interpretation of english puns , in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , 2017 , pp. 58 - 68 .

[4]

Weller ,

Seppi , Humor detection: A transformer gets the last laugh , ArXiv abs/ 1909 .00252 ( 2019 ).

[5]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , CoRR abs/ 1810 .04805 ( 2018 ). URL: http://arxiv.org/abs/ 1810 .04805. arXiv: 1810 .04805.

[6]

Wang ,

Yang ,

Qin ,

Sun ,

Deng , Unified humor detection based on sentence-pair augmentation and transfer learning , in: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation , 2020 , pp. 53 - 59 .

[7]

Ziser ,

Kravi ,

Carmel , Humor detection in product question answering systems , Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval ( 2020 ).

[8]

Reyes ,

Rosso ,

Buscaldi , From humor recognition to irony detection: The figurative language of social media , Data & Knowledge Engineering 74 ( 2012 ) 1 - 12 . URL: https://www. sciencedirect.com/science/article/pii/S0169023X12000237. doi:https://doi.org/10.1016/j. datak. 2012 . 02 .005.

[9]

C. V.

Hee ,

Lefever ,

Hoste , Semeval -2018 task 3: Irony detection in english tweets , in: Proceedings of the 12th international workshop on semantic evaluation , 2018 , pp. 39 - 50 .

[10]

Frenda ,

Pedrani ,

Basile ,

Lo ,

A. T.

Cignarella ,

Panizzon ,

C. S.

Marco ,

Scarlini ,

Patti ,

Bosco ,

Bernardi , Epic: Multi-perspective annotation of a corpus of irony, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, 2023 , pp. 13844 - 13857 .