NLPalma Joker 2024: Yet, no Humor with Humorousness -
                                Task 2 Humour Classification According to Genre and
                                Technique⋆
                                Victor Manuel Palma-Preciado,1, 2∗,†, Carolina Palma-Preciado1,† and Grigori
                                Sidorov1

                                1 Instituto Politécnico Nacional (IPN), Centro de Investigacion en Computacion (CIC), Mexico City, Mexico

                                2 Université de Bretagne Occidentale, HCTI, France


                                                Abstract
                                                The following work aims to describe the team participation in JOKER 2024, which focuses on
                                                developing various methods for classifying text that exhibit different techniques and humorous
                                                intentions. Understanding such aspects of humor can often be challenging for human beings. By
                                                classifying humor into these categories, we aim to establish more robust methods for
                                                classification, which can be applied across various fields of study. Current models offer high
                                                potential for training and fine-tuning complex tasks like humor classification. This ranges from
                                                the traditional use of Convolutional Neural Networks (CNNs) to the widely utilized modern
                                                Transformer paradigm BERT-like models. The results were mixed, as different approaches were
                                                chosen. It is believed that, given their performance, the models can still be optimized and their
                                                accuracy improved. Overall, the results are satisfactory for a first approach using the usual BERT-
                                                like model and embeddings such a USE with a CNN.

                                                Keywords
                                                BERT, Natural language processing, Humour classification, Humour, Wordplays, Jokes.


                                1. Introduction
                                The main objective of this work is to find robust methods to achieve good results in Task 2
                                “Classification According to Genre and Technique “of JOKER CLEF 2024[5] for different
                                types of humor. In Task 2, the model must classify sentences containing a wide range of
                                humorous constructs into different classes. This task is based on the English dataset of
                                JOKER 2024 [2]. The primary goal is to accurately perform multiclass classification,
                                automatically categorizing text into the following classes: irony, sarcasm, exaggeration,
                                incongruity-absurdity, self-deprecating, and wit-surprise. The aim is to develop a model
                                capable of clearly identifying these classes.
                                   The study of humor is an underexplored topic, making resources such as corpora and
                                models trained in different kinds of humor difficult to obtain or non-existent. Humor is


                                CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                                ∗ Corresponding author.
                                † These authors contributed equally.

                                   victorpapre@gmail.com (A. 1); cpalmap2020@cic.ipn.mx (A. 2); sidorov@cic.ipn.mx (A. 3)
                                    0000-0001-8711-1106 (A. 1); 0000-0003-3253-4464 (A. 2); 0000-0003-3901-3522 (A. 3)
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
believed to be imbued with cultural characteristics, increasing the complexity of
understanding humorous expressions and making humor subjective and challenging to
tackle.
   Since humans find it difficult to generalize humor, certain features can help machines
understand it in a specific way, as their ability to compute similarities is stronger than that
of humans. Consequently, humor detection methods can infer some aspects of humor better
than humans, who interpret it subjectively. Access to a sufficiently robust dataset and
baseline provides a measure to scale studies in the field of humor [10].
   The main objective of this work is to find robust methods to achieve good results in Task
2 for different types of humor. In Task 2, the model must classify sentences containing a
wide range of humorous constructs into different classes. This task is based on the English
dataset of JOKER 2024 [2]. Where the primary goal is to accurately perform multiclass
classification, automatically categorizing text into the following classes: irony, sarcasm,
exaggeration, incongruity-absurdity, self-deprecating, and wit-surprise. The aim is to
develop a model capable of clearly identifying them.
   Adapting various methods and models to achieve the desired results for each class
should be the main approach. It is important to note that each class contains a different
number of examples. Taking this into account can ease the training process by maintaining
balance among the classes. Additionally, we need to consider whether there is sufficient
data to train the model. If not, we should increase the data using other methods to ensure
better results.
   Classifying humorous sentences presents a unique challenge because some sentences
unintentionally resemble others, causing confusion. For example, irony and sarcasm can
often blur together, as can incongruity-absurdity and exaggeration, making them tricky to
interpret even for humans. Therefore, we need a method that excels at differentiating these
nuances. Understanding the types of sentences our model needs to discern between each
class is crucial, as the similarity between them can hinder and confuse the model,
introducing noise that must be addressed.

2. Materials and Methods
This section details the three key stages of the experiments. First, it describes how the
dataset for the classification task was composed and process. Next, it outlines the selection
process, workflow, and configuration of the models. Finally, it details the resources used.

2.1. Dataset
The provided dataset for Task 2, as previously described, is a multiclassification dataset as
it contains text labeled in six different class labels: irony (IR), sarcasm (SC), exaggeration
(EX), incongruity-absurdity (AID), self-deprecating (SD), and wit-surprise (WS).
    Since the aim is to compare different approaches made by the participants, the
organizers have prepared a train and test sub dataset for this purpose. For the training
dataset, 1,742 samples were given, and their distribution among the classes can be seen in
(Figure 1), where class WS has the most samples and EX has the fewest.
   The models were trained on the English corpus consisting wordplays, which was further
divided into an internal training and validation set with a 70:30 (1,219:523) for a hold-out
stratify validation method. Then each model was then used to evaluate the test set provided
by JOKER, comprising 6,642 sentences.


                       Figure 1: Train dataset frequency by classes.

2.2. Model selection
As part of the model selection different approaches were taken the structure of the
treatments is as follows:

   •   Transformers Paradigm Models (BERT, BERT multilingual, DistilBERT, among
       others).
   •   CNN (Convolutional Neural Network) + Universal Sentence Encoder.

Since experimentation can be time-consuming, it is often necessary to repeat them to
improve results. Specifically, generating embeddings can take a long time. Therefore, using
increasingly powerful computers becomes a necessity to reduce computational times and
push the models further, allowing multiple runs of experimentation simultaneously. In this
case, the Google Colab Pro environment allows the utilization of GPU resources without
runtime limits, facilitating the experimentation stage.
   The best yield results of our methods are taken for submission, the firsts treatments
were based one of the most know constructs the transformers BERT-like, and the second
one based on a quite simple CNN structure paired with a quite powerful representation of
embeddings (USE).
   A multilingual model such as multilingual BERT, pre-trained in 104 languages, was
initially considered for training [8]. However, with the purpose of finding the best model
during evaluation, it was decided to keep the model but apply a separate approach. As a
result, two models were trained: one specifically for English and one with multilingual
capabilities, both utilizing the same BERT architecture [6].
   The models were trained with BERT [4] and multilingual BERT [3], loaded from the
Hugging Face transformers module with the help of the Ktrain wrapper [1], which
facilitated the process of loading and fine-tuning the models quickly and simply.
   Since transformers do not require extensive preprocessing [9], the training process was
relatively straightforward. However, during the fine-tuning phase, it was necessary to
experiment with different parameters to achieve the best results in validation accuracy and
loss. The best models were saved in Keras H5 format for future reference. For the training
of the BERT-like models, the following parameters were utilized: a batch size of 32, 8
training epochs, and a learning rate of 5e-5.
   In the case of the CNN, KERAS was used, as simple and easy form to deploy this structure,
for the embedding (USE) due to changes on the Tensorhub platform, Kaggle was used to
handle the embeddings obviously with a different checkpoint as the one on Tensorhub, in
this case 512 characteristics were taken from the USE model and then utilizing a wide
variety of optimizers for the multiclass classification. The creation and approach of the
network was influence from the work of [7], of course with our own twist was taken, in this
case lower blocks of CNN, with two and three stacked convolutional blocks after a few
attempts with different architectures, two blocks of simple connected convolutional
network yield the best results, using a learning rate of 3e-1 and 5e-6. We obtained mixed
results but in comparation with the BERT-like model, were not as good.

2.3. Resources
Among the resources used to train and evaluate the models, Google's Colab environment
was employed. This platform enables Python programming and execution, while also
providing easier use of GPU. The use of GPU resources allowed for faster execution in the
performance of the described tasks, by enabling multiple simultaneous computations. The
server used has the following specifications: GPU NVIDIA-SMI 525.85.12, CUDA v12.0, and
25 of RAM. As part of the resources, BERT-like models that are less demanding in the use of
resources, it is necessary to consider the data volume and the desired width of tokenization.

3. Results
This section presents the results of the internal evaluation, which used a 30% split of the
training dataset for validation with both the CNN+USE and BERT models. The
evaluation of the test dataset is described in the task overview document provided by the
organizers.
3.1. CNN+USE
The results of the CNN neural network together with USE for text representations are
presented in Figure 2. Compared to the BERT-like model paradigm, the classifier has lower
performance but still achieves some acceptable results. The network's overall performance
had a weighted average accuracy of 47%, which is quite low in general.
   The model struggled to distinguish between incongruity-absurdity with exaggeration
and irony. This suggests that the embeddings did not capture the subtle characteristics that
differentiate these types of humor.


                   Figure 2: Confusion matrix for the CNN+USE model.

3.2. BERT model
In the case of the BERT model, two runs were conducted: one with the base model and
another with the multilingual version, both had similar results, with the multilingual model
showing only slight improvement.
   Figure 3 shows the results obtained, exhibiting behavior very similar to the CNN+USE
model. Exaggeration, irony, and sarcasm are the most conflicting classes, so it is expected
that these classes have slightly lower performance. The class wit-surprise have, similar
results in both models, as it has the best overall performance. The weighted average
accuracy of the model was 70.%.
                      Figure 3: Confusion matrix for the BERT model.

3.3. Best Results
Table 1 showcases the best results achieved with the BERT model. it's noteworthy that the
CNN+USE network's performance was inferior to that of BERT-like models and its analysis
is not provide in this study. Each class is represented by a sentence that embodies its distinct
characteristics. To illustrate our findings, we'll utilize the top result for each one.
    As depicted below, the table presents probabilities for each class alongside the
corresponding text (joke). It seems that the most effective methods yield results favoring
longer, more anecdotal humorous sentences, as all of them have a confidence probability
above 90%. This observation is somewhat corroborated when compared with Table 2,
where the worst cases are presented. This contrasts with the majority of the poorer results,
which still maintain a relatively high probability, typically around 40%.
Table 1
Best classify wordplay by class with BERT
                         Text                               Class           Probability
 Yogi had a whiskey, water, and tea drink every              IR               0.9944
 night. He was a toddy bear.
 How did the hipster burn his mouth? He ate his               SC              0.9864
 pizza before it was cool.
 Covid 19 coronavirus: Women are claiming 'boobs              EX              0.9577
 get bigger' after having Pfizer jab
 Someone, please help me! I’m way too young to be            AID              0.9818
 this old already.
 I've always known #Bez is my spirit animal but               SD              0.9871
 seeing the mess he makes on confirms it 100%
 what a legend that man is.
 it's all about women in stem struggles. what about          WS               0.9732
 women in interactive media struggles: i no longer
 win against my friends in smash because they
 major in goddamn video game.

While some sentences vividly exemplify their class, others pose ambiguity. This variability
in humor structures poses a challenge for accurate sentence-joke identification. The model
may occasionally struggle to discern highly similar sentences, an issue that appears
somewhat neglected.

Table 2
Worst classify wordplay by class with BERT
                          Text                              Class           Probability
 I can't believe today is the last day we can be gay.
                                                              IR              0.4298
 I started a band called 999 megabytes. We haven’t
                                                             SC               0.5050
 gotten a gig yet.
 Children of Karen's don't get autism because they
 weren't vaccinated. They do however have
                                                             EX               0.4932
 hearing problems from listening to their moms
 scream at managers.
 I put my phone on vibrate. An hour later, I finally
                                                             AID              0.4845
 received a text message.
 I don’t know anything about Coronavirus other
 than if you have it; you get an undeniable urge to          SD               0.3680
 go the airport.
 The satellite went into orbit on January 1st
                                                             WS               0.3714
 causing a new year's revolution.
It appears that sentence length influences classification. Notably, poorly classified instances
tend to be concise one-liners. Hence, detailed descriptions and contextual information could
prove beneficial for each class. Additionally, BERT seems to utilize question marks and
exclamation points for differentiation.
    For instance, it's widely recognized that sentences employing exaggerated humor
typically feature magnifying adjectives. In the case of irony and sarcasm, the fine line
between them is determined by the underlying context. Thus, we posit that considering
these nuances could mitigate misclassifications and errors.

4. Conclusions
The results obtained through the proposed classification approaches show promise, yet
further refinement could strengthen and improve them. Specifically, in the classification of
humorous classes, employing BERT-like models yielded favorable outcomes; however,
there remains room for improvement through more effective fine-tuning and exploration of
diverse variations in BERT-like architectures. In essence, this study has yielded encouraging
findings, demonstrating the potential of transformer-based models in multiclass
classification tasks.
   Although a detailed performance analysis with evaluation metrics is yet to be provided,
the analysis has demonstrated that the model exhibits strong confidence in classifying each
category, even when confronted with a heavily imbalanced original dataset. Addressing this
imbalance by augmenting the data or increasing the sample size for each class could
potentially enhance the model's performance.
   In conclusion, while this study primarily utilized deep learning models such as neural
networks like CNNs and transformers like BERT, it's worth noting that other architectures
remain unexplored and warrant investigation. The field of machine learning continually
evolves, and exploring diverse models could lead to further insights and advancements.

References
[1] Arun S. Maiya (2020). ktrain: A Low-Code Library for Augmented Machine Learning.
     arXiv preprint arXiv:2004.10703. BigScience Workshop. (2022). BLOOM (Revision
     4ab0472)
[2] Ermakova, L., Miller, T., Bosser, A-G., Palma, V., Sidorov, G., and Jatowt, A. 2024. CLEF
     2024 JOKER Lab: Automatic Humour Analysis. In: Goharian, N., et al. Advances in
     Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14613.
     Springer, Cham. https://doi.org/10.1007/978-3-031-56072-9_5
[3] google-bert/bert-base-multilingual-uncased · Hugging Face. (2001, march
     11). https://huggingface.co/google-bert/bert-base-multilingual-uncased
[4] google-bert/bert-base-uncased         ·   Hugging      Face.     (2001,     march     11).
     https://huggingface.co/google-bert/bert-base-uncased
[5] Liana Ermakova, Anne-Gwenn Bosser, Tristan Miller, Victor Manuel Palma Preciado,
    Grigori Sidorov, and Adam Jatowt. Overview of JOKER @ CLEF-2024: Automatic humour
    analysis. In Lorraine Goeuriot, Philippe Mulhem, Georges Quénot, Didier Schwab, Laure
    Soulier, Giorgio Maria Di Nunzio, Petra Galuščáková, Alba García Seco de Herrera,
    Guglielmo Faggioli, and Nicola Ferro, editors, Experimental IR Meets Multilinguality,
    Multimodality, and Interaction: Proceedings of the Fifteenth International Conference
    of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Cham, 2024.
    Springer.

[6] Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers
     for Language Understanding. CoRR, abs/1810.04805. Retrieved from
     http://arxiv.org/abs/1810.04805
[7] Peng-Yu Chen and Von-Wun Soo. 2018. Humor Recognition Using Deep Learning.
     In Proceedings of the 2018 Conference of the North American Chapter of the Association
     for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers),
     pages 113–117, New Orleans, Louisiana. Association for Computational Linguistics.
[8] Pires, T., Schlinger, E., & Garrette, D. (2019). How Multilingual is Multilingual BERT?
     https://doi.org/10.18653/v1/p19-1493
[9] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., &
     Polosukhin, I. (2017). Attention is All you Need. En arXiv (Cornell University) (Vol. 30,
     pp. 5998- 6008). Cornell University. https://arxiv.org/pdf/1706.03762v5
[10] Victor Manuel Palma Preciado, Grigori Sidorov, Carolina Palma Preciado Assessing
     WordplayPun classification from JOKER dataset with pretrained BERT humorous
     models, JokeR: Automatic Wordplay and Humour Translation, pages (1828-1833),
     CLEF (2022) [6] Devlin, J., Chang, M.-W.,