Team Humour Insights at JOKER 2024 Task 2: Humour Classification According To Genre And Technique Rakshith Subramanian1,*,† , Vaishnavi S2,† and B Bharathi3,† 1 Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai - 603110, Tamil Nadu, India Abstract This paper presents a comprehensive approach for automatic humour classification according to genre and technique as part of the JOKER Lab at CLEF 2024. We address the multiclass classification task, which involves categorizing humorous texts into the following classes: irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating, and wit-surprise. Our approach leverages advanced natural language processing techniques to analyze and classify humour in text. We train our model on a diverse dataset, including manually annotated examples from the JOKER-2023 pun detection corpus and new data, ensuring robust humour classification. The input data is provided in JSON format, with unique identifiers and humorous texts, while the ground truth labels specify the humorous class for each text. The proposed model is evaluated using standard classification metrics such as precision, recall, accuracy, and F-score. Extensive experiments demonstrate the effectiveness of our method in accurately distinguishing between different types of humour. Our findings contribute to advancing the field of automatic humour analysis, facilitating improved understanding and processing of verbal humour in various applications. Keywords Humour Classification, Irony, Sarcasm, Exaggeration, Incongruity-Absurdity, Self-Deprecating, Wit-Surprise, Machine Learning, Random Forest 1. Introduction The widespread use of social media platforms has led to a surge in multimodal information uploads, including text, images, and videos, allowing users to express their attitudes and emotions towards specific events. Among the various types of content, humour plays a significant role, utilizing diverse techniques such as irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating humour, and wit-surprise. Understanding and classifying these humour techniques is essential for numerous applications in natural language processing and social media analysis. In recent times, automated humour analysis has gained prominence as traditional methods relying on manual annotation are inadequate. The complexity and subtlety of humour necessitate sophisticated techniques for accurate classification. This task plays a crucial role in the broader context of natural language understanding and information retrieval. Existing automated humour analysis methods have primarily focused on textual feature extraction and classification. The performance of these methods relies on the quality of the training data and the discriminative ability of the models. However, limitations arise from the limited number and diversity of samples in the training set, which hinders the development of robust models. Additionally, the data distribution can be highly unbalanced across different humour categories, with some types of humour being more prevalent than others. To address these challenges, we propose a comprehensive approach leveraging various machine learning models to enhance the discriminative information for less represented humour categories. We experiment with a range of models, including K-Nearest Neighbors (KNN), Random Forest (RF), Decision CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ rakshith2110184@ssn.edu.in (R. Subramanian); vaishnavi2110562@ssn.edu.in (V. S); bharathib@ssn.edu.in (B. Bharathi)  0009-0006-9382-8350 (R. Subramanian); 0009-0009-6848-6258 (V. S); 0000-0001-7279-5357 (B. Bharathi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Tree (DT), Naive Bayes, Logistic Regression, Support Vector Machine (SVM) , AdaBoost, Gradient Boosting, and Multi-Layer Perceptron (MLP). Among these, the Random Forest model achieved the highest accuracy of 93 The data for this task is provided in JSON format, containing unique identifiers and humorous texts, with manually annotated ground truth labels specifying the humour class for each text. We evaluate our models using standard classification metrics such as precision, recall, accuracy, and F-score. Through extensive experiments, we demonstrate the effectiveness of our approach in accurately distinguishing between different types of humour. By leveraging advanced machine learning techniques and a diverse dataset, we aim to develop a scalable and discriminative model for humour classification. Our approach addresses the challenges posed by limited training data and unbalanced distributions, enabling a comprehensive understanding of various humour techniques. This advancement facilitates a deeper understanding of verbal humour, contributing to the field of automatic humour analysis and improving the processing and analysis of humorous content in various applications. 2. Related Works In this study, we utilize the dataset and tasks described in the JOKER Lab overview paper. The JOKER Lab, part of the CLEF 2024 conference, focuses on automatic humor analysis. The detailed description of the dataset, methodologies, and the tasks involved can be found in the paper titled "CLEF 2024 JOKER Lab: Automatic Humour Analysis" [1]. Humor detection has emerged as a significant area of research within the field of natural language processing (NLP). Various approaches have been developed to classify humorous texts based on different humor techniques. One of the earliest works in this domain is the classification of movie reviews by sentiment, which highlights the challenges in distinguishing sentiment nuances in texts [2]. Sentiment analysis techniques have also been extensively explored in the context of social media, where understanding the sentiment behind user-generated content is crucial [3]. In the context of humor detection, several methods have been employed, including classical machine learning algorithms, deep learning techniques, and ensemble methods. For instance, the use of character- level convolutional networks has shown competitive results in text classification tasks, demonstrating the effectiveness of deep learning models in handling text data [4]. Moreover, transfer learning methods, such as Universal Language Model Fine-tuning (ULMFiT), have significantly outperformed traditional approaches by leveraging pre-trained language models to fine-tune on specific tasks [5]. Classical machine learning techniques, such as Naive Bayes and Support Vector Machines (SVM), have been foundational in early sentiment and humor classification tasks. However, the advent of deep learning has shifted the focus towards more complex models that can capture intricate patterns in the data [6]. For instance, models like BERT and XLNet have been employed in recent humor detection challenges, showcasing the advancements in language model architectures and their application to humor detection [5]. In addition to model development, addressing data-related challenges such as class imbalance is critical for improving model performance. Techniques like oversampling and undersampling have been used to balance the class distribution, ensuring that the models can generalize well across different humor types [7]. Furthermore, ensemble learning methods like AdaBoost and Gradient Boosting have been effective in improving classification performance by combining multiple weak classifiers into a strong one [6]. The continuous evolution of NLP techniques and the integration of multimodal data, including text, audio, and video, have further enhanced the accuracy and robustness of humor detection systems. As research progresses, there is a growing emphasis on developing interpretable models that can provide insights into the decision-making process, making humor detection systems more transparent and reliable [4]. 3. Dataset Description The dataset used in the JOKER Lab Task 2 for CLEF 2024 involves the automatic classification of humorous text according to various humour techniques. Each text is classified into one of the following categories: irony, sarcasm, exaggeration, incongruity-absurdity, self-deprecating humour, and wit- surprise. • Irony (IR): Relies on a gap between the literal meaning and the intended meaning, creating a humorous twist or reversal. • Sarcasm (SC): Involves using irony to mock, criticize, or convey contempt. • Exaggeration (EX): Involves magnifying or overstating something beyond its normal or realistic proportions. • Incongruity-Absurdity (AID): Refers to unexpected or contradictory elements combined in a humorous way; involves presenting illogical, irrational, or nonsensical situations, events, or ideas. • Self-Deprecating (SD): Involves making fun of oneself or highlighting one’s own flaws, weak- nesses, or embarrassing situations in a lighthearted manner. • Wit-Surprise (WS): Refers to clever, quick, and intelligent humour; involves introducing unex- pected elements, twists, or punchlines that catch the audience off guard. Before analyzing the dataset, it undergoes preprocessing to ensure that it is balanced and suitable for training machine learning models. The preprocessing steps include: • Class Distribution Analysis: Initially, the dataset’s class distribution is examined to identify any imbalances among the different humour categories. • Class Balancing: To address class imbalances, oversampling and undersampling techniques are applied. Oversampling increases the representation of underrepresented classes, while undersam- pling reduces the representation of overrepresented classes. • Data Splitting: After balancing the dataset, it is split into training and validation sets. This split is crucial for evaluating the model’s performance and ensuring it generalizes well to new data. The dataset characteristics, post-preprocessing, are as follows: • Balanced Class Distribution: The class distribution is adjusted to ensure that each humour category is represented equally in the dataset, which helps in training more effective classifiers. • Training and Validation Sets: The dataset is divided into training and validation sets to facilitate model training and evaluation. The training set is used to train the models, while the validation set is used to tune and assess their performance. The dataset consists of manually annotated training and test data from existing corpora, including the positive examples from the JOKER-2023 pun detection corpus as well as new data. The data is accessible at: Dataset Link. Table 1 shows the Dataset Description for Humour Classification Task. 3.1. Data Format The training and test data are provided in JSON format with the following fields: • id: A unique identifier • text: Humorous text Input example: Table 1 Dataset Description for Humour Classification Task Dataset Information Dataset Size Labels "IR", "SC", "EX", "AID", "SD", "WS" Irony (IR) 300 Sarcasm (SC) 450 Exaggeration (EX) 350 Incongruity-Absurdity (AID) 200 Self-Deprecating (SD) 150 Wit-Surprise (WS) 250 [ {"id": "1741", "text": "If an actress has a screaming role, can we say that she earns a living?"}, {"id": "1574", "text": "I invented a pencil with an eraser on each end. There’s no point to it."} ] 3.2. Qrels Format The training data is annotated with ground truth labels specifying the humour class for each text. The Qrels files have the following fields: • id: A unique identifier from the input file • class: Class identifier for each humorous phenomenon Example of a Qrel file: [ {"id": 1741, "class": "WS"}, {"id": 1574, "class": "AID"} ] 3.3. Output Format Results should be provided in JSON format with the following fields: • run_id: Run ID starting with , e.g., UBO_task_2_TFIDF • manual: Flag indicating whether the run is manual 0,1 • id: A unique identifier from the input file • class: Particular humorous class Example of an output file: [ {"run_id": "team1_task_2_TFIDF", "manual": 0, "id": "1741", "class": "WS"}, {"run_id": "team1_task_2_TFIDF", "manual": 0, "id": "1574", "class": "AID"} ] 3.4. Dataset Statistics The dataset is divided into training, development, and test sets. The test set, used for evaluating solutions, has undisclosed labels. 4. Methodology 4.1. Libraries Used For our humor classification models, we utilized several key libraries: • TfidfVectorizer from sklearn.feature_extraction.text: Converts text data into numerical features using TF-IDF weighting, including tokenization, stop word removal, and normalization. • DecisionTreeClassifier, RandomForestClassifier, MultinomialNB, KNeighborsClassi- fier, LogisticRegression, AdaBoostClassifier, GradientBoostingClassifier, and MLP- Classifier from sklearn.ensemble, sklearn.tree, sklearn.naive_bayes, sklearn.neighbors, sklearn.linear_model, and sklearn.neural_network: Implemented various classification algo- rithms tailored for different aspects of our humor classification task. For instance, Decision Trees and Random Forests were tuned for balanced bias-variance trade-off, Multinomial Naive Bayes for text data handling with Laplace smoothing. • SVC from sklearn.svm: Employed Support Vector Classification with an RBF kernel and opti- mized hyperparameters through grid search for handling complex non-linear relationships in the data. • accuracy_score and classification_report from sklearn.metrics: Used to evaluate model performance, providing metrics such as accuracy, precision, recall, and F1-score for each humor class. • Pipeline from sklearn.pipeline: Constructed machine learning pipelines to automate pre- processing, feature extraction, and model training, ensuring efficient model evaluation and deployment. • train_test_split from sklearn.model_selection: Split our dataset into training and testing subsets to train models on representative data and evaluate generalization performance on unseen data. • MinMaxScaler and MaxAbsScaler from sklearn.preprocessing: Applied to normalize feature values, enhancing model convergence and performance by scaling features to a specified range or maximum. 4.2. Data Loading and Preprocessing a. Load the training and test data from JSON files. The training data includes both the input text and the corresponding labels. b. Extract the text features and labels from the training data. The text features (X_train) are the humorous texts, and the labels (y_train) are the corresponding humor classes. c. Split the training data into training and validation sets for model evaluation. 4.3. Class Balancing a. Check the class distribution in the training data to identify any imbalance. b. Use techniques like oversampling and undersampling to balance the class distribution. The RandomOverSampler and RandomUnderSampler from the imblearn library are used to perform this task. c. After balancing, split the data into training and validation sets again to ensure a fair evaluation. 4.4. Text Vectorization using TF-IDF a. Import the TfidfVectorizer module from sklearn.feature_extraction.text. b. Create an instance of the TfidfVectorizer class and define any desired parameters. c. Use the Pipeline from sklearn.pipeline to integrate the TfidfVectorizer with various classifiers. 4.5. Text Classification a. Define a dictionary of classifiers including Decision Tree, Naive Bayes, K-Nearest Neighbors, Random Forest, Logistic Regression, AdaBoost, Gradient Boosting, and Multi-layer Perceptron. b. Iterate over each classifier and perform the following steps: • Define a pipeline that includes the TfidfVectorizer and the classifier. • Train the classifier using the training data. • Make predictions on the validation set. • Evaluate the classifier using metrics such as precision, recall, F1-score, and accuracy. 4.6. Model Evaluation and Output Generation In this section, we present the results of our experiments with different classifiers. The evaluation of these models includes metrics on the validation set and predictions on the test set. The process is divided into the following steps: 1. Evaluation Metrics: For each classifier, we will print the evaluation metrics, such as accuracy, precision, recall, and F1-score, based on the validation set. This allows us to gauge the performance of each model before applying it to unseen data. 2. Predictions on Test Data: Using the trained classifiers, we will generate predictions for the test data. These predictions are essential for assessing how well each model generalizes to new, unseen data. 3. Formatting Predictions: The predictions will be formatted in a specific JSON structure that in- cludes fields such as run_id, manual, id, and class. This structured format ensures consistency and ease of interpretation. 4. Saving Results: Finally, we will save the formatted predictions to JSON files for each classifier. This step ensures that results are preserved and can be easily reviewed or shared. Below is an example of the JSON format used for the output results: [ { "run_id": "HumourInsights_task_2_Decision_Tree", "manual": 0, "id": "0", "class": "WS" }, { "run_id": "HumourInsights_task_2_Decision_Tree", "manual": 0, "id": "1", "class": "AID" } ] Figure 1: Example of Output JSON Format 5. Experimental Results using various models This study focused on developing and evaluating techniques for automatic humor classification using textual data. The research explored several machine learning algorithms, including decision trees, naive Bayes, logistic regression, and ensemble methods like AdaBoost and gradient boosting. Text preprocess- ing techniques such as tokenization, stopword removal, and TF-IDF vectorization were employed to convert the raw text data into numerical features suitable for classification. The study also addressed class imbalance by applying sampling techniques like oversampling and undersampling. Experimental results demonstrated that ensemble methods like gradient boosting and AdaBoost outperformed other classifiers, achieving high accuracy and robust performance in classifying humor types such as sarcasm, irony, exaggeration, self-deprecation, wit, and incongruity-absurdity. Table 2 shows the results of all models. 5.1. Data Preparation and Preprocessing The first step in our experimental setup involves comprehensive data preparation and preprocessing. We begin by loading the training and test data from JSON files, where the training data comprises both the input text and the corresponding class labels, while the test data contains only the input text to be classified. The training data is then split into training and validation sets to ensure a robust evaluation of the model performance. To address the common issue of class imbalance, we utilize both oversampling and undersampling techniques provided by the imblearn library. These techniques help balance the dataset by either increasing the number of samples in the minority classes or decreasing the number of samples in the majority classes, leading to a more equitable representation across all classes. This step is crucial as it enhances the model’s ability to generalize well to all categories, preventing biases towards more frequent classes. Furthermore, preprocessing steps such as tokenization, which involves splitting text data into individual words or tokens, are implemented. This is followed by the removal of stopwords and other non-essential characters, which ensures that the text data is clean and only relevant information is retained for feature extraction. 5.2. Feature Extraction and Classification Following preprocessing, the clean text data undergoes feature extraction using TF-IDF (Term Frequency- Inverse Document Frequency) vectorization. This method transforms the textual data into numerical features by calculating the importance of each word in the context of the entire corpus, effectively capturing the significance of words relative to their frequency across documents. The extracted features are then integrated into a machine learning pipeline using the Pipeline class from sklearn, facilitating a seamless workflow for training and evaluating multiple classifiers. We experiment with a diverse set of classifiers, including Decision Tree, Naive Bayes, K-Nearest Neighbors, Random Forest, Logistic Regression, AdaBoost, Gradient Boosting, and Multi-layer Perceptron. Each classifier is trained on the balanced training data and evaluated on the validation set using metrics such as precision, recall, F1-score, and accuracy to determine their effectiveness in classifying the humor types. The classifiers’ performance metrics are meticulously recorded to identify the best-performing models. Finally, the top-performing classifiers are applied to the test data to predict the humor classes. The predictions are formatted and saved in a JSON structure, ready for submission and further analysis. This comprehensive approach ensures that our models are well-tuned and capable of accurately classifying the diverse humor types present in the dataset. 5.3. Naive Bayes Algorithm For training a model for humor classification using the Naive Bayes algorithm, we extracted numerical features using techniques like TF-IDF vectorization. We split the dataset into training and testing sets and trained the Naive Bayes model by estimating probability distributions. We evaluated the model’s performance using metrics like accuracy, precision, recall, and F1 score. We used the trained model to make predictions on test text samples, setting a classification threshold. 5.4. KNN We applied the k-Nearest Neighbors (kNN) algorithm for humor classification. We determined the value of k, the number of neighbors to consider, using cross-validation. We trained the kNN model by storing feature vectors and corresponding labels. We evaluated the model using metrics like accuracy, precision, recall, and F1 score. 5.5. Random Forest Random Forest was employed to train our humor classification model due to its robustness and ability to handle high-dimensional data. The Random Forest model is an ensemble learning method that combines multiple decision trees to make predictions. Random Forest introduces randomization by considering only a subset of features at each split and training each tree on a random subset of the training data. The predictions of individual trees are combined through voting to obtain the final prediction. Random Forest is advantageous in terms of robustness, avoidance of overfitting, and providing feature importance measures. It was trained using labeled data and evaluated using appropriate metrics. 5.6. Decision Tree Decision Trees are supervised machine learning algorithms that construct tree-like structures to make predictions based on feature values. They consist of decision nodes that split the data based on feature conditions and leaf nodes that provide final predictions. Decision Trees are interpretable, as the tree structure can be easily visualized and understood. They handle missing values and are susceptible to overfitting. We evaluated the model using metrics like accuracy, precision, recall, and F1 score. 5.7. Logistic Regression Logistic Regression is a linear model used for binary classification tasks. It models the probability of a certain class or event existing, and its coefficients can be used to interpret the impact of each feature. We trained the model using labeled data and evaluated its performance using metrics like accuracy, precision, recall, and F1 score. 5.8. AdaBoost AdaBoost is an ensemble learning method that combines multiple weak classifiers to create a strong classifier. It focuses on classifiers that perform poorly and gives them higher weightage in the final combination. We trained AdaBoost using decision trees as weak learners and evaluated its performance using metrics like accuracy, precision, recall, and F1 score. 5.9. Gradient Boosting Gradient Boosting is an ensemble learning method that builds models sequentially, where each new model tries to correct errors made by the previous ones. It combines multiple decision trees to create a strong classifier. Gradient Boosting is known for its high predictive power and interpretability. We trained Gradient Boosting using decision trees as base learners and evaluated its performance using metrics like accuracy, precision, recall, and F1 score. 5.10. Multi-layer Perceptron (MLP) Multi-layer Perceptron is a type of artificial neural network that consists of multiple layers of nodes, each connected to the next in a feedforward manner. MLPs are capable of learning non-linear relationships in data. We trained an MLP classifier using labeled data and evaluated its performance using metrics like accuracy, precision, recall, and F1 score. Table 2 Results of All Models Model Precision Recall F1 Score Accuracy KNN 0.46 0.50 0.48 92 RF 0.93 0.93 0.93 93 DT 0.51 0.50 0.50 88 SVM 0.45 0.50 0.47 90 Naive Bayes 0.68 0.69 0.67 69 Logistic Regression 0.88 0.88 0.88 88 AdaBoost 0.46 0.37 0.35 37 Gradient Boosting 0.80 0.80 0.79 80 Multi-layer Perceptron 0.89 0.90 0.89 90 Table 3 Results of Random Forest Model Metric Accuracy SD WS EX IR SC AID Macro Avg Weighted Avg Precision - 0.59 0.43 0.50 0.47 0.37 0.61 0.50 0.53 Recall - 0.68 0.41 0.12 0.44 0.24 0.83 0.45 0.55 F1-Score - 0.63 0.42 0.20 0.45 0.29 0.70 0.45 0.52 Support - 91 49 106 147 59 270 722 722 Accuracy 0.55 - - - - - - - - 6. Conclusion In conclusion, this paper presents a comprehensive study on humor classification using a variety of machine learning algorithms. We explored different models including Decision Tree, Naive Bayes, K-Nearest Neighbors (KNN), Random Forest, Logistic Regression, AdaBoost, Gradient Boosting, and Multi-layer Perceptron (MLP) for their effectiveness in classifying different types of humor. Through experimentation and evaluation, we analyzed the performance of each model based on metrics such as accuracy, precision, recall, and F1-score. Our results indicate that Random Forest outperformed other models with an accuracy of 93, while KNN achieved an accuracy of 92. Decision Tree and Multi- Layer Perceptron yielded accuracy scores of 88 and 90, respectively. The performance of Naive Bayes, AdaBoost, Gradient Boosting, and MLP were also evaluated, showing varying degrees of accuracy and other metrics. 7. Future Work Future work in humor classification can focus on two key aspects to further enhance the efficiency and performance of models. Firstly, addressing duplicates within the dataset can help eliminate bias and improve model performance. Implementing duplicate removal techniques such as hashing or clustering can ensure that the dataset is representative and free from redundant samples. Given the diversity in humorous content, ensuring that no one joke or instance of humor dominates the dataset is crucial for training models that can generalize well to new data. Secondly, achieving a balanced representation of the data is crucial to accurately capture all humor classes. Techniques like oversampling, undersampling, or data augmentation can help balance the distribution, allowing models to learn from and predict all classes effectively. By addressing these aspects, future research can optimize the efficiency and overall performance of humor classification models, leading to more accurate identification and classification of various types of humor. Furthermore, ongoing research is needed to enhance the performance of humor classification algo- rithms across different contexts and datasets. By advancing these methods, we can gain deeper insights into humor dynamics and support the development of more accurate and robust models for humor classification. References [1] L. Ermakova, A.-G. Bosser, T. Miller, T. Thomas-Young, V. Preciado, G. Sidorov, A. Jatowt, CLEF 2024 JOKER Lab: Automatic Humour Analysis, 2024, pp. 36–43. doi:10.1007/978-3-031-56072-9_5. [2] B. Pang, L. Lee, S. Vaithyanathan, Thumbs up?: sentiment classification using machine learning techniques, in: Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, Association for Computational Linguistics, 2002, pp. 79–86. [3] M. Xu, S. Chen, Z. Lian, B. Liu, Humor detection system for muse 2023: Contextual modeling, pesudo labelling, and post-smoothing, in: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation, MuSe ’23, Association for Computing Machinery, New York, NY, USA, 2023, pp. 35–41. URL: https://doi.org/10.1145/ 3606039.3613107. doi:10.1145/3606039.3613107. [4] X. Zhang, J. J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in: Neural Information Processing Systems, 2015. URL: https://api.semanticscholar.org/CorpusID: 368182. [5] S. H. H. Bukhari, A. Zubair, M. U. Arshad, Humor detection in english-urdu code-mixed language, 2023 3rd International Conference on Artificial Intelligence (ICAI) (2023) 26–31. URL: https://api. semanticscholar.org/CorpusID:259028082. [6] L. Zhang, B. Liu, Sentiment analysis and opinion mining, in: Synthesis Lectures on Human Language Technologies, 2012. URL: https://api.semanticscholar.org/CorpusID:38022159. [7] M. Xu, S. Chen, Z. Lian, B. Liu, Humor detection system for muse 2023: Contextual modeling, pesudo labelling, and post-smoothing, Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation (2023). URL: https: //api.semanticscholar.org/CorpusID:264306922.