<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>N. K. Baym, The performance of humor in computer-mediated communication, Journal of
computer-mediated communication</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ORPAILLEUR &amp; SyNaLP at CLEF 2024 Task 2: Good Old Cross Validation for Large Language Models Yields the Best Humorous Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierre Epron</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaël Guibon</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Couceiro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INESC-ID, IST, Universidade de Lisboa</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIPN, Université Sorbonne Paris Nord</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LORIA, Université de Lorraine</institution>
          ,
          <addr-line>CNRS</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1995</year>
      </pub-date>
      <volume>1</volume>
      <issue>1995</issue>
      <abstract>
        <p>In the context of the JOKER 2024 Task 2 Challenge, this paper presents an emerging approach that leverages the latent representations derived from diferent Large Language Models (LLMs) to drive a classification mechanism. Our methodology involves exploiting the "knowledge" encoded in LLMs to efectively discriminate humor genres. Experimental results show promising results, demonstrating the efectiveness of our approach. However, inherent complexities remain, such as the proximity between certain classes and biases arising from the dataset distributions. These complexities warrant further investigation to refine the classification process and improve overall performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Humor genre classification</kwd>
        <kwd>Large language models</kwd>
        <kwd>Text classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• RQ4: Which hidden layer depth of LLMs yields the best classification results? We
investigate whether certain layers of the network provide more relevant features for classification and if
this optimal depth is consistent across diferent LLMs and humor categories.</p>
      <p>By addressing these questions, our study aims to deepen the understanding of how LLMs can be
efectively utilized for complex classification tasks and to identify the factors that contribute to their
performance. This research addresses the applicability of LLMs to the complex challenge of humor
classification.</p>
      <sec id="sec-1-1">
        <title>1.1. Related Work</title>
        <p>
          The necessity for automatic humor detection arises from the increasing influence of conversational
agents and the omnipresence of social media platforms. In the digital realm, where interactions are
increasingly mediated by algorithms, discerning humor has become crucial. This imperative extends to
various applications, including chatbots, recommender systems, social media reputation management,
and the crucial task of identifying and combating fake news and hate speech [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ]. Early eforts in
humor detection primarily focused on the intricate dynamics of wordplay. The seminal evaluation
campaign explored tasks including pun detection, pun location, and pun interpretation [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. However,
a significant challenge in this domain has been the scarcity of appropriate training data, particularly
evident for languages beyond English. Recent advances in automatic humor detection have been driven
by the development of contextualized embeddings, which have facilitated a broader recognition of
humor across diverse contexts [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Moreover, the development of multilingual models, which leverage
the pre-trained BERT architecture [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], has expanded the scope of humor recognition to languages such
as Chinese, Russian, and Spanish [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Additionally, there has been a notable shift towards addressing
domain-specific tasks, examplified by endeavors to identify humorous queries within Q&amp;A systems [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
While the field of irony and sarcasm detection has received considerable attention [
          <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
          ], the area of
automatic humor detection remains a vital and evolving area of research with implications that extend
to various facets of human-computer interaction and online discourse [11, 12].
        </p>
        <p>The landscape of LLM development is undergoing a period of rapid evolution, with the introduction
of new models such as LLaMA2, Mistral, and the GPT family models, including GPT-3 and GPT-4.
Touvron et al. [13] introduced LLaMA2, which builds upon its predecessor by enhancing the model
architecture and training methodologies, thereby achieving improved performance across a range of
natural language processing tasks. Jiang et al. [14] presented Mistral, a model known for its eficiency
and efectiveness, particularly in low-resource settings, demonstrating impressive capabilities in several
benchmarks. Concurrently, the GPT family of models, developed by OpenAI, has made significant
contributions to the field. GPT-3, introduced by Brown et al. [15] set new standards with its 175
billion parameters, enabling unprecedented performance in generating human-like text and performing
complex language tasks with minimal prompt engineering. Building on this, GPT-4 [16] and Llama3 [17],
as detailed in its model card, further enhanced these capabilities by incorporating more sophisticated
training techniques and a larger training corpus, resulting in superior performance across a wider range
of applications. These models have played a pivotal role in advancing the state of the art in natural
language understanding and generation, solidifying their position as indispensable tools in the Natural
Language Processing community [18, 19, 20].</p>
        <p>Zero-shot and few-shot learning methods enable LLMs to perform tasks with minimal task-specific
training data. Some studies [21, 22] introduce the concept of using LLMs for zero-shot learning,
demonstrating that models can generalize from pre-trained knowledge to new tasks without explicit
training examples. This further enhances their versatility and application scope. Probing and
featurebased fine-tuning involve using LLMs as classifiers by extracting and utilizing internal representations for
specific tasks. A study [ 23] presented a method where prompts are augmented to probe LLMs for specific
linguistic features, efectively turning them into classifiers for various natural language processing
tasks. This technique demonstrates the adaptability of LLMs in understanding and categorizing complex
linguistic patterns. LoRA (Low-Rank Adaptation) and QLoRA (Quantized Low-Rank Adaptation) are
techniques designed to enhance the eficiency of fine-tuning of LLMs by reducing the number of
trainable parameters. Hu et al. [24] introduced LoRA, which introduces trainable rank-decomposition
matrices into the layers of the transformer, significantly reducing the computational cost of fine-tuning.
Dettmers et al. [25] subsequently optimized this approach with QLoRA, which incorporates quantization
to further enhance eficiency while maintaining model performance.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>The dataset comes from JOKER 2024 shared task 2 [26, 27]. It consists of 1,742 humorous texts labelled
in 6 diferent categories.</p>
      <p>• IR - Irony relies on a gap between the literal meaning and the intended meaning, creating a
humorous twist or reversal.
• SC - Sarcasm involves using irony to mock, criticize, or convey contempt.
• EX - Exaggeration involves magnifying or overstating something beyond its normal or realistic
proportions.
• AID - Incongruity refers to the unexpected or contradictory elements that are combined in a
humorous way and Absurdity involves presenting situations, events, or ideas that are inherently
illogical, irrational, or nonsensical.
• SD - Self-deprecating humour involves making fun of oneself or highlighting one’s own flaws,
weaknesses, or embarrassing situations in a lighthearted manner.
• WS - Wit refers to clever, quick, and intelligent humour and Surprise in humour involves
introducing unexpected elements, twists, or punchlines that catch the audience of guard.</p>
      <p>The primary challenge from the dataset is the imbalance in the number of examples for specific classes.
For instance, the WS (Wit and Surprise) class contains 650 examples, while the EX (Exaggeration) class
contains only 122 examples. The complete list of classes distribution is presented in Table 1. Another
significant challenge is the proximity of certain classes, such as irony and sarcasm. Some definitions of
irony include sarcasm as a form of irony [28]. For this corpus, the definition of irony aligns with the
prevailing understanding of situational irony.</p>
      <p>The majority of texts have a length between 20 and 40 tokens (Figure 1). However, some texts are
very large, with a number of tokens exceeding 500. Due to the limitations of GPU memory, we have
excluded examples with a number of tokens greater than 170 from the train corpus. Additionally, there
are instances of duplication in the examples of the train set. Some of these instances are quasi-duplicates,
as evidenced by the presence of the same example with the quotation mark. We removed the duplicates,
but we could not identify nor remove the quasi-duplicates. At the conclusion of the cleaning process,
we retained 1,704 examples from the original 1,742 set.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>3.1. LLMs
Our objective is to explore the potential of advanced LLMs within a consistent methodological framework.
We employed a 4-bit quantized version of three distinct LLMs: Llama2-7b1, Mistral-7b2, and Llama3-8b3.
These models are selected due to their varying characteristics, which allowed for a comparative analysis.
The distinctions between the Mistral and Llama2 models are primarily due to several advanced
optimization techniques employed by Mistral.</p>
      <p>• Sliding Window Attention improves the eficiency of attention mechanisms by focusing
on a moving window of tokens, rather than considering all tokens at once. This reduces the
computational complexity and enhances the model’s ability to handle longer sequences efectively.
• Rolling Bufer Cache maintains a cache of recently processed data, enabling faster retrieval
and processing of these data chunks, thus improving overall model performance.
• Pre-fill and Chunking involve pre-processing and breaking down input data into manageable
chunks, which can be processed more eficiently by the model, leading to better performance in
terms of both speed and accuracy.</p>
      <p>A significant diference between Llama2 and Llama3 lies in their tokenization strategies and
vocabulary sizes.</p>
      <p>• Llama2 and Mistral: Both models utilize the same tokenizer, which is based on Byte-Pair
Encoding (BPE) and implemented using the sentencepiece approach. They share a vocabulary
size of 32,000 tokens. Sentencepiece is a data-driven method that segments text into subword
units, ensuring a balance between word-level and character-level tokenization.
• Llama3: This model employs a new tokenizer with a vastly increased vocabulary size of 128,256
tokens. While it also employs BPE, it uses the tiktoken approach developed by OpenAI for their
GPT models. The key distinction between tiktoken and other tokenizers is its capacity to bypass
the BPE algorithm when a token already exists in the vocabulary, potentially enhancing eficiency
and tokenization speed.</p>
      <p>The diferences between vocabulary size could not be explained by the tokenizer implementation
only. The larger vocabulary in Llama3 likely reflects diferences in the scope and variety of training
data used, although specific details about these datasets are not publicly available.
1https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
2https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
3https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct</p>
      <sec id="sec-3-1">
        <title>3.2. Classification</title>
        <p>The primary approach involved the addition of a Feed Forward (FF) layer on the final token of the LLM
representation, situated atop the last hidden state. This method sought to leverage the LLMs’ capacity
to generate rich contextual embeddings for downstream tasks. The Causal Language Model (CLM) is
extended in the following manner.</p>
        <p>Consider a sequence of tokens  = (1, 2, . . . ,  ), where  represents the token at position  in
the sequence. The CLM models the probability of the next token given the previous tokens:

 () = ∏︁  (|1, 2, . . . ,  − 1).</p>
        <p>=1
The CLM architecture we use in this paper can be streamlined as follows:
• Input Embedding Layer: Converts tokens into dense vector representations:</p>
        <p>() = ().
• Positional Encoding: Adds positional information to the token embedding to retain the order
of tokens:</p>
        <p>() =  ().</p>
        <p>The input to the model at position  is:
• Attention Mechanism: Utilizes masked self-attention to ensure that the prediction for position
 only depends on positions 1 to  − 1. The masked attention weights   are computed by:
0 = () +  ().
  = ∑︀
=1 exp 
exp 
,
where  is the compatibility function (e.g., dot product of queries and keys) and the mask ensures
that  ≤  − 1.
• Feed-Forward Layer: Applied after the attention mechanism to introduce non-linearity:
 =   ((− 1)).
• Final Hidden Layer: The last hidden state for each position  is denoted as . To adapt
the CLM for classification, we use the final hidden state of the last token  as the input to a
classification Feed-Forward Layer.
• Classification Feed-Forward Layer: Projects the last hidden state to the class probabilities:
where  is the logit vector representing unnormalized scores for each class .
• Output Layer: Applies a softmax function to obtain the class probabilities:</p>
        <p>=  (),
 (|) =  ().
• Training Objective for Classification: The training objective is to minimize the cross-entropy
loss between the predicted class probabilities and the true class labels:
 = −
 
∑︁ ∑︁  log  ( = |),
=1 =1
where  is the number of training samples,  is the number of classes, , is a binary indicator
(0 or 1) if class label  is the correct classification for sample .
• Inference for Classification: During inference, the model predicts the class label by selecting
the class with the highest probability:
ˆ = argmax  ( = |).</p>
        <p />
        <p>In addition, we experimented with integrating a QLoRA adapter into the query and value components
of the attention heads within the LLMs. QLoRA adapters are designed to enhance model performance
by allowing fine-grained tuning during training.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Cross-validation</title>
        <p>Given the limited number of examples in the training set, we decided to use the same set for testing
and validation. We implemented a stratified cross-validation approach to ensure the evaluation of
our model. Specifically, we used 5 stratified splits to perform cross-validation using a leave-one-out
cross-validation. This means that we ran each experiment 5 times, using 4 splits for training and 1 split
for evaluation. This method ensures that each class is equally represented in both the training and
validation sets across all splits, thus providing a comprehensive assessment of model performance.</p>
        <p>During the validation phase, our primary monitoring metric was the Matthews Correlation Coeficient
(MCC) [29, 30]. It is a performance metric for classification that considers all four components of the
confusion matrix: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
It is defined by:</p>
        <p>·   −   ·  
  = √︀(  +   )(  +   )(  +   )(  +   )
.</p>
        <p>The MCC ranges from -1 to 1, with 1 indicating a perfect prediction, 0 indicating no better than a
random prediction, and -1 indicating total disagreement between the prediction and the actual values.
The MCC is particularly valuable for evaluating models on unbalanced datasets, as it is less susceptible
to the limitations of traditional metrics like accuracy or F-Score, which may give high scores to models
that are biased towards the majority class. The MCC, however, accounts for the balance ratios between
the classes and the quality of the predictions for both classes. By considering all aspects of the confusion
matrix, it provides a more comprehensive evaluation of the classifier’s performance, making it robust
against class imbalance and more reflective of the true predictive capability of the model.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Training Parameters</title>
        <p>We used a batch size of 16 and a maximum of 10 epochs for training. These parameters were chosen to
balance training eficiency and model performance. Two diferent learning rates were used: 1 − 3 for
the Feed Forward layer and 1.5 − 4 for the QLoRA adapter when it was included. The learning rates
were selected based on preliminary experiments and common practices in fine-tuning LLMs. The FF
layer typically benefits from a higher learning rate due to its role in direct task-specific adaptation, while
the QLoRA adapter requires a more conservative rate to ensure stability and efective integration. To
further refine the training process, we experimented with a linear learning rate scheduler and gradient
clipping. These techniques are known to enhance training stability and performance, especially in the
context of large models. we conducted an experiment to determine whether weighting could improve
performance, especially in scenarios where the data is imbalanced.</p>
        <p>Two configurations of the QLoRA adapter were tested: 64 rank for 16 alpha and 16 rank for 64 alpha.
These configurations were chosen because it was demonstrated in the original LoRA paper that minimal
rank with high alpha performs better [24]. Conversely, the QLoRA paper demonstrated that high rank
with low alpha performs better [25]. Both configurations incorporated a dropout rate of 0.1 to prevent
overfitting.</p>
        <p>Finally, we trained the model on each diferent hidden layer of each LLMs to see if there was a
relationship between class performance and depth.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.5. Additional Details</title>
        <p>We submitted 9 test results using three distinct strategies to assess model performance: ensemble,
high, and low. In the ensemble strategy, we employed a majority vote among the five models derived
through cross-validation. For the high and low strategies, we selected the optimal and sub-optimal
splits obtained from cross-validation.</p>
        <p>The 36 experiments conducted as part of this study were executed on the Grid5000 platform, with an
average runtime between 2h00 and 2h30 on a Nvidia A100 GPU (40 GiB). While these values should
be interpreted with caution, they provide a general indication of the cost and resources required to
perform this type of research.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, we first look at the results obtained from our various experiments. All these results
should be interpreted with caution. The variance of the results across the 5 splits does not always allow
us to conclude that one model performs better than another, as is often the case with Llama2 and Llama3.
Furthermore, the small size of the dataset and the absence of a test set also limit the interpretation of
the results. Secondly, we look at the results obtained on the test set provided by the Joker shared task’s
organizers and evaluated independently</p>
      <sec id="sec-4-1">
        <title>4.1. Parameters Results</title>
        <p>The results of our parameter-focused experiments are reported in Table 2. A subset of the Llama2
experiments is reported here, with the full set of results being accessible within the GitHub repository4.
The observations presented below are equally applicable to the other experimented LLMs.</p>
        <p>We can see that balancing cross-entropy did not lead to any significant improvements in the model’s
performance. Despite the theoretical advantages of mitigating class imbalance, our findings
demonstrated that the impact on metrics was significantly worse than expected. Conversely, implementing
a linear scheduling strategy for the learning rate yielded a notable enhancement in the model’s
performance. This approach permitted a more gradual adjustment of the learning rate, which in turn
facilitated better convergence and reduced overfitting, as evidenced by lower validation loss and higher
overall accuracy.</p>
        <p>The application of QLoRA demonstrated substantial improvements across all tested setups. QLoRA
consistently enhanced performance metrics, indicating its robustness. Notably, configurations utilizing
a rank of 16 and an alpha of 64 demonstrated superior results in every experimental setup. This indicates
that the combination of reduced parameter dimensionality with a strong scaling factor can efectively
capture essential features and nuances in the data, thereby enhancing the model’s performances.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. LLMs Results</title>
        <p>In our comparative analysis of LLMs (see Table 3), we observed that LLama2 consistently outperformed
both Llama3 and Mistral in their best setups. This finding underscores that the performance of LLMs is
influenced by factors beyond the mere chronological advancement of the model. Notwithstanding the
more recent architectural developments and potential enhancements in Llama3 and Mistral, LLama2’s
superior performance serves to highlight the importance of specific optimizations and configurations
that can play a critical role in achieving better results.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Classification Results</title>
        <p>In the context of our classification task, we observed varying levels of dificulty among diferent classes.
The results are reported in Table 4 and Figure 2. In general, it can be observed that the diferent LLMs</p>
        <p>No QLoRA
64r, 16a
16r, 64a
no
yes
no
no
yes
no
no
yes
no
no
no
yes
no
no
yes
no
no
yes</p>
        <p>Macro F1↑
exhibit comparable ease and dificulty across the various classes. This conclusion is supported by the
observation that the confusion matrices are also highly similar.</p>
        <p>It is notable that IR (Irony) and SC (Sarcasm) were challenging to diferentiate, given their subtle
distinctions and overlapping characteristics in textual expressions. Furthermore, EX (Exaggeration)
frequently posed confusion, being misclassified as either irony or sarcasm due to their nuanced nature.
Although instances of ambiguity between AID (Incongruity) and WS (Wit and Surprise) were less
common, they still presented some classification challenges. Interestingly, the overall performance for
WS (Wit and Surprise), SD (Self-Deprecating), and AID (Incongruity) was relatively robust. The high
accuracy in classifying WS (Wit and Surprise) can be attributed to its status as the majority class, which
naturally leads to a more substantial training set and better model performance. In contrast, the strong
results for SD (Self-Deprecating) and AID (Incongruity) were somewhat unexpected, particularly for
SD (Self-Deprecating), which is significantly underrepresented in the dataset. These findings suggest
that while certain classes exhibit inherent complexities leading to misclassification, others (even with
fewer examples) can achieve reliable identification when given appropriate model training.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Qualitative Results</title>
        <p>We have identified two common types of errors that are found regardless of the LLMs. The first type
consists of confusing WS (Wit and Surprise) with AID (Incongruity) when the text consists of a question
followed by an answer. Each text presents a question followed by a clever or unexpected answer,
often relying on homophones, similar-sounding words, or humorous reinterpretations of common
phrases. The humor is derived from the audience’s recognition of the pun or wordplay, resulting in a
light-hearted and amusing efect. The format is simple and straightforward, making it easy to deliver
and understand, typical of classic joke telling. Here are some examples:
• What do you call a fish wearing a crown? King Cod!
• What do you call a doctor who treats retired soldiers? A sawbone.
• Where do the pancakes live? In an apartment.
• What did the janitor say when he jumped out of the closet? Supplies!
• How does a penguin catch a fish? It just waddles down to the grocery store!</p>
        <p>The second type of common mistake is to wrongly predict some texts to be SD (Self-Deprecating)
that use the first person. This is perfectly understandable, since most SD examples should use the first
person, so the model is biased by this. Some examples below:
• My poo is green, how festive.
• I’m mad at myself for not taking karate sooner.
• My name is Bet. I am a cutter.</p>
        <p>• I always pronounce one word wrong. Wrong.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Hidden Layer Analysis</title>
        <p>Although there is no clear association between the hidden index and the score, it is evident that certain
classes, such as IR (Irony), are sensitive to the “deepness” of the model. Indeed, the performance of
IR on low layers is less optimal for each model. These observations can also be made for SD
(SelfDeprecating), EX (Exageration), and SD (Self-Deprecating), although to a lesser extent. Conversely,
there are some classes, such as WS (Wit and Surprise) and AID (Incongruity), that appear to be relatively
stable regardless of the depth of the model. As it might be expected, these results align with the overall
performance of each class.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Submission results</title>
        <p>All results submitted for the shared task CLEF 2024 Joker shared task number 2 are presented in Table
8. The four submissions with the highest scores are ours, and all of our submissions are among the top
12. In summary, this indicates that the methodology employed is both efective and consistent. In terms
of macro F1-score, our highest-scoring submission achieved a score of 0.70, while our lowest-scoring
submission achieved a score of 0.604. The second-best approach, apart from ours, achieved a score of
0.638. It is also noteworthy that other approaches using LLM have been submitted. While we lack the
information to properly compare them, it appears that the relatively simple approach we employed is
the most efective.</p>
        <p>The detailed results of our submissions are presented in Tables 5 and 7. The first observation that can
be made is that the performance of the strategies is consistent across LLMs. Specifically, the ensemble
strategy consistently outperforms the high strategy, which in turn outperforms the low strategy most
of the times. This observation is particularly noteworthy as it suggests that the optimal training set
identified through cross-validation is also the most efective when evaluated on the final test set. If
0.684
0.632
0.649
0.714
0.669
0.650</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In conclusion, the application of LLM embeddings, a simple classification layer, and a cross-validation
strategy yielded optimal performance on this task. This outcome suggests the potential utility of these
embeddings in complex text classification tasks such as irony and humor categorization.</p>
      <p>Our findings demonstrate the importance of considering more nuanced factors beyond the mere
recency of the model. LLama2 emerges as the superior model, outperforming LLama3 and Mistral
across the majority of configurations. This indicates that factors beyond chronological advancement
are influencing the eficacy of the models, thus answering RQ2.</p>
      <p>Furthermore, the integration of QLoRA consistently enhanced performance, regardless of the base
model. The incorporation of lower ranks and higher alpha values yielded unexpected yet consistent
improvements, leading to further interrogation in regards to RQ3. Notably, issues persisted between IR
(Irony) and SC (Sarcasm). Qualitative analyses yielded intriguing insights, particularly regarding the
impact of text related to the coronavirus on correctness. This was observed in IR, SC, and EX, which
demonstrated sensitivity to such mentions. Fluctuations in correctness were observed, suggesting that
diferent classes exhibited varying degrees of performance.</p>
      <p>Further investigations into the diferent hidden layer variables revealed varying sensitivity among
classes to the depth of analysis. In particular, WS (Wit and Surprise) and AID (Incongruity) demonstrated
resilience, whereas others displayed sensitivity, partially answering RQ4.</p>
      <p>In essence, our study highlights the multifaceted nature of language model performance on complex
classification, underscoring the necessity for comprehensive evaluations encompassing both quantitative
metrics and qualitative considerations to elucidate underlying mechanisms and optimize eficacy in
text classification tasks. It also showed that the general knowledge embedded in LLMs facilitates more
accurate classification of irony and humor genre, even though it is still far from being suficient (RQ1).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgments</title>
      <p>Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a
scientific interest group hosted by Inria and in cluding CNRS. RENATER and several Universities as
well as other organizations (see https://www.grid5000.fr).
S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan,
W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, B. Zoph,
Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
[17] AI@Meta, Llama 3 model card, 2024. URL: https://github.com/meta-llama/llama3/blob/main/</p>
      <p>MODEL_CARD.md.
[18] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. X. Song, J. Steinhardt, Measuring
massive multitask language understanding, ArXiv abs/2009.03300 (2020).
[19] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, O. Tafjord, Think you have
solved question answering? try arc, the ai2 reasoning challenge, ArXiv abs/1803.05457 (2018).
[20] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a machine really finish your
sentence?, in: Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 4791–4800. URL:
https://aclanthology.org/P19-1472. doi:10.18653/v1/P19-1472.
[21] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised
multitask learners, OpenAI blog (2019).
[22] L. Reynolds, K. McDonell, Prompt programming for large language models: Beyond the few-shot
paradigm, Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing
Systems (2021).
[23] H. Cho, H. J. Kim, J. Kim, S.-W. Lee, S. goo Lee, K. M. Yoo, T. Kim, Prompt-augmented linear
probing: Scaling beyond the limit of few-shot in-context learners, ArXiv abs/2212.10873 (2022).
[24] J. E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, W. Chen, Lora: Low-rank adaptation of
large language models, ArXiv abs/2106.09685 (2021).
[25] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Eficient finetuning of quantized
llms, ArXiv abs/2305.14314 (2023).
[26] L. Ermakova, T. Miller, A.-G. Bosser, V. M. P. Preciado, G. Sidorov, A. Jatowt, Overview of
joker - clef-2024 track on automatic humor analysis, in: L. Goeuriot, P. Mulhem, G. Quénot,
D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro
(Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the
Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024.
[27] L. Ermakova, A.-G. Bosser, T. Miller, T. Thomas-Young, V. M. P. Preciado, G. Sidorov, A. Jatowt,
Clef 2024 joker lab: Automatic humour analysis, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani,
G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval: 46th European
Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, Proceedings, Part
VI, volume 14613 of Lecture Notes in Computer Science, Springer, Cham, 2024, pp. 36–43. doi:10.
1007/978-3-031-56072-9_5.
[28] M. Bouazizi, T. O. Ohtsuki, A pattern-based approach for sarcasm detection on twitter, IEEE</p>
      <p>Access 4 (2016) 5477–5488.
[29] H. Cramér, Mathematical Methods of Statistics (PMS-9), Volume 9, Princeton
University Press, Princeton, 1946. URL: https://doi.org/10.1515/9781400883868. doi:doi:10.1515/
9781400883868.
[30] B. W. Matthews, Comparison of the predicted and observed secondary structure of t4 phage
lysozyme., Biochimica et biophysica acta 405 2 (1975) 442–51.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Shared Task results</title>
      <p>Run ID
* ORPAILLEUR_mistral-7b-ens
* ORPAILLEUR_mistral-7b-high
* ORPAILLEUR_llama2-7b-ens
* ORPAILLEUR_llama3-8b-ens
CYUT_llama3-fine-tuning
* ORPAILLEUR_llama2-7b-high
* ORPAILLEUR_llama3-8b-low
* ORPAILLEUR_llama2-7b-low
PunDerstand_DeBERTaSampled
* ORPAILLEUR_llama3-8b-high
PunDerstand_GuidedAnnotation
* ORPAILLEUR_mistral-7b-low
PunDerstand_DeBERTa
DadJokers_bert_base_uncased
NLPalma_BERTd
CodingRangers_bert_uncased
Code Rangers_roberta
Demonteam_BERTM
UAms_BERT_ft
NLPalma_PREDCNN
VayamSolveKurmaha_BERT
NaiveNeuron_fastText
NaiveNeuron_llama3:70b_rag-uae
VayamSolveKurmaha_BERT
NaiveNeuron_llama3:70b_rag
DadJokers_RandomForest_MLP_Ensemble
HumourInsights_Random Forest
PunDerstand_GPT4oFewShot
UBO_RubyAiYoungTeam
team1_Petra_and_Regina_LogisticRegression
Dajana&amp;Kathy_Joker_LogisticRegression
team1_FRANE_AND_ANDREA_LogisticRegression
Tomislav&amp;Rowan_SVM
AB&amp;DPV_MLP3000params
DadJokers_RandomForest
CYUT_GPT-4
Tomislav&amp;Rowan_LogisticRegression
AB&amp;DPV_DecisionTreeClassifier
CYUT_roBERTa-fine-tuning
AB&amp;DPV_RandomForestClassifier250
Tomislav&amp;Rowan_NaiveBayes
AB&amp;DPV_RandomForestClassifier500
AB&amp;DPV_GaussianNB
AB&amp;DPV_MLP2000
AB&amp;DPV_MLP3000</p>
      <p>SC
SD</p>
      <p>WS</p>
      <p>EX</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Francesconi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Poletto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <article-title>Error analysis in a hate speech detection task: The case of haspeede-tw at evalita 2018</article-title>
          , in: CEUR
          <source>WORKSHOP PROCEEDINGS</source>
          , volume
          <volume>2481</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Guibon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sefih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Firsov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Noé-Bienvenu</surname>
          </string-name>
          ,
          <article-title>Multilingual fake news detection with satire</article-title>
          ,
          <source>in: International Conference on Computational Linguistics and Intelligent Text Processing</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>392</fpage>
          -
          <lpage>402</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Hempelmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Semeval
          <article-title>-2017 task 7: Detection and interpretation of english puns</article-title>
          ,
          <source>in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>58</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seppi</surname>
          </string-name>
          ,
          <article-title>Humor detection: A transformer gets the last laugh</article-title>
          , ArXiv abs/
          <year>1909</year>
          .00252 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>Unified humor detection based on sentence-pair augmentation and transfer learning</article-title>
          ,
          <source>in: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ziser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kravi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <article-title>Humor detection in product question answering systems</article-title>
          ,
          <source>Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Reyes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Buscaldi</surname>
          </string-name>
          ,
          <article-title>From humor recognition to irony detection: The figurative language of social media</article-title>
          ,
          <source>Data &amp; Knowledge Engineering</source>
          <volume>74</volume>
          (
          <year>2012</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . URL: https://www. sciencedirect.com/science/article/pii/S0169023X12000237. doi:https://doi.org/10.1016/j. datak.
          <year>2012</year>
          .
          <volume>02</volume>
          .005.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Hee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lefever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hoste</surname>
          </string-name>
          , Semeval
          <article-title>-2018 task 3: Irony detection in english tweets</article-title>
          ,
          <source>in: Proceedings of the 12th international workshop on semantic evaluation</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pedrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Cignarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Panizzon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Marco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Scarlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bernardi</surname>
          </string-name>
          ,
          <article-title>Epic: Multi-perspective annotation of a corpus of irony, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>13844</fpage>
          -
          <lpage>13857</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>