NLPalma Joker 2024: Yet, no Humor with Humorousness -Task 2 Humour Classification According to Genre and Technique ⋆

NLPalma Joker 2024: Yet, no Humor with Humorousness -Task 2 Humour Classification According to Genre and Technique ⋆ VictorManuelPalma-Preciado victorpapre@gmail.com Centro de Investigacion en Computacion (CIC) Instituto Politécnico Nacional (IPN)

Mexico City Mexico

Université de Bretagne Occidentale

HCTI France

CarolinaPalma-Preciado Centro de Investigacion en Computacion (CIC) Instituto Politécnico Nacional (IPN)

Mexico City Mexico

GrigoriSidorov sidorov@cic.ipn.mx Centro de Investigacion en Computacion (CIC) Instituto Politécnico Nacional (IPN)

Mexico City Mexico

NLPalma Joker 2024: Yet, no Humor with Humorousness -Task 2 Humour Classification According to Genre and Technique ⋆ 1613-0073 128DEC90B9F433462073B44D6869A221 GROBID - A machine learning software for extracting information from scholarly documents BERT Natural language processing Humour classification Humour Wordplays Jokes

The following work aims to describe the team participation in JOKER 2024, which focuses on developing various methods for classifying text that exhibit different techniques and humorous intentions. Understanding such aspects of humor can often be challenging for human beings. By classifying humor into these categories, we aim to establish more robust methods for classification, which can be applied across various fields of study. Current models offer high potential for training and fine-tuning complex tasks like humor classification. This ranges from the traditional use of Convolutional Neural Networks (CNNs) to the widely utilized modern Transformer paradigm BERT-like models. The results were mixed, as different approaches were chosen. It is believed that, given their performance, the models can still be optimized and their accuracy improved. Overall, the results are satisfactory for a first approach using the usual BERTlike model and embeddings such a USE with a CNN.

Introduction

The main objective of this work is to find robust methods to achieve good results in Task 2 "Classification According to Genre and Technique "of JOKER CLEF 2024 [5] for different types of humor. In Task 2, the model must classify sentences containing a wide range of humorous constructs into different classes. This task is based on the English dataset of JOKER 2024 [2]. The primary goal is to accurately perform multiclass classification, automatically categorizing text into the following classes: irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating, and wit-surprise. The aim is to develop a model capable of clearly identifying these classes.

The study of humor is an underexplored topic, making resources such as corpora and models trained in different kinds of humor difficult to obtain or non-existent. Humor is believed to be imbued with cultural characteristics, increasing the complexity of understanding humorous expressions and making humor subjective and challenging to tackle.

Since humans find it difficult to generalize humor, certain features can help machines understand it in a specific way, as their ability to compute similarities is stronger than that of humans. Consequently, humor detection methods can infer some aspects of humor better than humans, who interpret it subjectively. Access to a sufficiently robust dataset and baseline provides a measure to scale studies in the field of humor [10].

The main objective of this work is to find robust methods to achieve good results in Task 2 for different types of humor. In Task 2, the model must classify sentences containing a wide range of humorous constructs into different classes. This task is based on the English dataset of JOKER 2024 [2]. Where the primary goal is to accurately perform multiclass classification, automatically categorizing text into the following classes: irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating, and wit-surprise. The aim is to develop a model capable of clearly identifying them.

Adapting various methods and models to achieve the desired results for each class should be the main approach. It is important to note that each class contains a different number of examples. Taking this into account can ease the training process by maintaining balance among the classes. Additionally, we need to consider whether there is sufficient data to train the model. If not, we should increase the data using other methods to ensure better results.

Classifying humorous sentences presents a unique challenge because some sentences unintentionally resemble others, causing confusion. For example, irony and sarcasm can often blur together, as can incongruity-absurdity and exaggeration, making them tricky to interpret even for humans. Therefore, we need a method that excels at differentiating these nuances. Understanding the types of sentences our model needs to discern between each class is crucial, as the similarity between them can hinder and confuse the model, introducing noise that must be addressed.

Materials and Methods

This section details the three key stages of the experiments. First, it describes how the dataset for the classification task was composed and process. Next, it outlines the selection process, workflow, and configuration of the models. Finally, it details the resources used.

Dataset

The provided dataset for Task 2, as previously described, is a multiclassification dataset as it contains text labeled in six different class labels: irony (IR), sarcasm (SC), exaggeration (EX), incongruity-absurdity (AID), self-deprecating (SD), and wit-surprise (WS).

Since the aim is to compare different approaches made by the participants, the organizers have prepared a train and test sub dataset for this purpose. For the training dataset, 1,742 samples were given, and their distribution among the classes can be seen in (Figure 1), where class WS has the most samples and EX has the fewest.

The models were trained on the English corpus consisting wordplays, which was further divided into an internal training and validation set with a 70:30 (1,219:523) for a hold-out stratify validation method. Then each model was then used to evaluate the test set provided by JOKER, comprising 6,642 sentences.

Model selection

As part of the model selection different approaches were taken the structure of the treatments is as follows:

• Transformers Paradigm Models (BERT, BERT multilingual, DistilBERT, among others). • CNN (Convolutional Neural Network) + Universal Sentence Encoder.

Since experimentation can be time-consuming, it is often necessary to repeat them to improve results. Specifically, generating embeddings can take a long time. Therefore, using increasingly powerful computers becomes a necessity to reduce computational times and push the models further, allowing multiple runs of experimentation simultaneously. In this case, the Google Colab Pro environment allows the utilization of GPU resources without runtime limits, facilitating the experimentation stage.

The best yield results of our methods are taken for submission, the firsts treatments were based one of the most know constructs the transformers BERT-like, and the second one based on a quite simple CNN structure paired with a quite powerful representation of embeddings (USE).

A multilingual model such as multilingual BERT, pre-trained in 104 languages, was initially considered for training [8]. However, with the purpose of finding the best model during evaluation, it was decided to keep the model but apply a separate approach. As a result, two models were trained: one specifically for English and one with multilingual capabilities, both utilizing the same BERT architecture [6].

The models were trained with BERT [4] and multilingual BERT [3], loaded from the Hugging Face transformers module with the help of the Ktrain wrapper [1], which facilitated the process of loading and fine-tuning the models quickly and simply.

Since transformers do not require extensive preprocessing [9], the training process was relatively straightforward. However, during the fine-tuning phase, it was necessary to experiment with different parameters to achieve the best results in validation accuracy and loss. The best models were saved in Keras H5 format for future reference. For the training of the BERT-like models, the following parameters were utilized: a batch size of 32, 8 training epochs, and a learning rate of 5e-5.

In the case of the CNN, KERAS was used, as simple and easy form to deploy this structure, for the embedding (USE) due to changes on the Tensorhub platform, Kaggle was used to handle the embeddings obviously with a different checkpoint as the one on Tensorhub, in this case 512 characteristics were taken from the USE model and then utilizing a wide variety of optimizers for the multiclass classification. The creation and approach of the network was influence from the work of [7], of course with our own twist was taken, in this case lower blocks of CNN, with two and three stacked convolutional blocks after a few attempts with different architectures, two blocks of simple connected convolutional network yield the best results, using a learning rate of 3e-1 and 5e-6. We obtained mixed results but in comparation with the BERT-like model, were not as good.

Resources

Among the resources used to train and evaluate the models, Google's Colab environment was employed. This platform enables Python programming and execution, while also providing easier use of GPU. The use of GPU resources allowed for faster execution in the performance of the described tasks, by enabling multiple simultaneous computations. The server used has the following specifications: GPU NVIDIA-SMI 525.85.12, CUDA v12.0, and 25 of RAM. As part of the resources, BERT-like models that are less demanding in the use of resources, it is necessary to consider the data volume and the desired width of tokenization.

Results

This section presents the results of the internal evaluation, which used a 30% split of the training dataset for validation with both the CNN+USE and BERT models. The evaluation of the test dataset is described in the task overview document provided by the organizers.

CNN+USE

The results of the CNN neural network together with USE for text representations are presented in Figure 2. Compared to the BERT-like model paradigm, the classifier has lower performance but still achieves some acceptable results. The network's overall performance had a weighted average accuracy of 47%, which is quite low in general.

The model struggled to distinguish between incongruity-absurdity with exaggeration and irony. This suggests that the embeddings did not capture the subtle characteristics that differentiate these types of humor.

BERT model

In the case of the BERT model, two runs were conducted: one with the base model and another with the multilingual version, both had similar results, with the multilingual model showing only slight improvement.

Figure 3 shows the results obtained, exhibiting behavior very similar to the CNN+USE model. Exaggeration, irony, and sarcasm are the most conflicting classes, so it is expected that these classes have slightly lower performance. The class wit-surprise have, similar results in both models, as it has the best overall performance. The weighted average accuracy of the model was 70.%.

Best Results

Table 1 showcases the best results achieved with the BERT model. it's noteworthy that the CNN+USE network's performance was inferior to that of BERT-like models and its analysis is not provide in this study. Each class is represented by a sentence that embodies its distinct characteristics. To illustrate our findings, we'll utilize the top result for each one.

As depicted below, the table presents probabilities for each class alongside the corresponding text (joke). It seems that the most effective methods yield results favoring longer, more anecdotal humorous sentences, as all of them have a confidence probability above 90%. This observation is somewhat corroborated when compared with Table 2, where the worst cases are presented. This contrasts with the majority of the poorer results, which still maintain a relatively high probability, typically around 40%.

Table 1

Best classify wordplay by class with BERT While some sentences vividly exemplify their class, others pose ambiguity. This variability in humor structures poses a challenge for accurate sentence-joke identification. The model may occasionally struggle to discern highly similar sentences, an issue that appears somewhat neglected. For instance, it's widely recognized that sentences employing exaggerated humor typically feature magnifying adjectives. In the case of irony and sarcasm, the fine line between them is determined by the underlying context. Thus, we posit that considering these nuances could mitigate misclassifications and errors.

Conclusions

The results obtained through the proposed classification approaches show promise, yet further refinement could strengthen and improve them. Specifically, in the classification of humorous classes, employing BERT-like models yielded favorable outcomes; however, there remains room for improvement through more effective fine-tuning and exploration of diverse variations in BERT-like architectures. In essence, this study has yielded encouraging findings, demonstrating the potential of transformer-based models in multiclass classification tasks.

Although a detailed performance analysis with evaluation metrics is yet to be provided, the analysis has demonstrated that the model exhibits strong confidence in classifying each category, even when confronted with a heavily imbalanced original dataset. Addressing this imbalance by augmenting the data or increasing the sample size for each class could potentially enhance the model's performance.

In conclusion, while this study primarily utilized deep learning models such as neural networks like CNNs and transformers like BERT, it's worth noting that other architectures remain unexplored and warrant investigation. The field of machine learning continually evolves, and exploring diverse models could lead to further insights and advancements.

Figure 1 :1Figure 1: Train dataset frequency by classes.

Figure 2 :2Figure 2: Confusion matrix for the CNN+USE model.

Figure 3 :3Figure 3: Confusion matrix for the BERT model.

Table 22Worst classify wordplay by class with BERTIt appears that sentence length influences classification. Notably, poorly classified instances tend to be concise one-liners. Hence, detailed descriptions and contextual information could prove beneficial for each class. Additionally, BERT seems to utilize question marks and exclamation points for differentiation.TextClassProbabilityYogi had a whiskey, water, and tea drink everyIR0.9944night. He was a toddy bear.How did the hipster burn his mouth? He ate hisSC0.9864pizza before it was cool.Covid 19 coronavirus: Women are claiming 'boobsEX0.9577get bigger' after having Pfizer jabSomeone, please help me! I'm way too young to beAID0.9818this old already.I've always known #Bez is my spirit animal butSD0.9871seeing the mess he makes on confirms it 100%what a legend that man is.it's all about women in stem struggles. what aboutWS0.9732women in interactive media struggles: i no longerwin against my friends in smash because theymajor in goddamn video game.TextClassProbabilityI can't believe today is the last day we can be gay.IR0.4298I started a band called 999 megabytes. We haven't gotten a gig yet.SC0.5050Children of Karen's don't get autism because theyweren't vaccinated. They do however have hearing problems from listening to their momsEX0.4932scream at managers.I put my phone on vibrate. An hour later, I finally received a text message.AID0.4845I don't know anything about Coronavirus otherthan if you have it; you get an undeniable urge toSD0.3680go the airport.The satellite went into orbit on January 1st causing a new year's revolution.WS0.3714

ktrain: A Low-Code Library for Augmented Machine Learning SArun Maiya arXiv:2004.10703 2020. 2022 arXiv preprint BigScience Workshop. BLOOM (Revision 4ab0472 CLEF 2024 JOKER Lab: Automatic Humour Analysis LErmakova TMiller A-GBosser VPalma GSidorov AJatowt 10.1007/978-3-031-56072-9_5 Advances in Information Retrieval. ECIR 2024 Lecture Notes in Computer Science NGoharian

Cham

Springer 2024 14613 google-bert/bert-base-multilingual-uncased • Hugging Face 2001. march 11 google-bert/bert-base-uncased • Hugging Face 2001. march 11 Overview of JOKER @ CLEF-2024: Automatic humour analysis LianaErmakova Anne-GwennBosser TristanMiller VictorManuel PalmaPreciado GrigoriSidorov AdamJatowt Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024) Lecture Notes in Computer Science LorraineGoeuriot PhilippeMulhem GeorgesQuénot DidierSchwab LaureSoulier GiorgioMaria DiNunzio PetraGaluščáková AlbaGarcía Seco De Herrera GuglielmoFaggioli NicolaFerro

Cham

Springer 2024 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding KLee KToutanova CoRR, abs/1810.04805 2018 Humor Recognition Using Deep Learning Peng-YuChen Von-WunSoo Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

New Orleans, Louisiana

Association for Computational Linguistics 2018 2 Short Papers How Multilingual is Multilingual BERT TPires ESchlinger DGarrette 10.18653/v1/p19-1493 2019 Attention is All you Need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LKaiser IPolosukhin 2017 30 Cornell University ; Cornell University En arXiv VictorManuel PalmaPreciado GrigoriSidorov ; Devlin JChang M.-W Carolina Palma Preciado Assessing WordplayPun classification from JOKER dataset with pretrained BERT humorous models 2022 CLEF