HAHA@IberLEF2021: Humor Analysis using Ensembles of Simple Transformers Karish Grover1∗ and Tanishq Goel 2∗ 1 Indraprastha Institute of Information Technology, Delhi, India karish19471@iiitd.ac.in 2 International Institute of Information Technology, Hyderabad, India tanishq.goel@research.iiit.ac.in *All authors contributed equally and are mentioned alphabetically. Abstract. This paper describes the system submitted to the Humor Analysis based on Human Annotation (HAHA) task at IberLEF 2021. This system achieves the winning F1 score of 0.8850 in the main task of binary classification (Task 1) utilizing an ensemble of a pre-trained multilingual BERT, pre-trained Spanish BERT (BETO), RoBERTa, and a naive Bayes classifier. We also achieve second place with macro F1 Scores of 0.2916 and 0.3578 in Multi-class Classification and Multi-label Classification tasks, respectively, and third place with an RMSE score of 0.6295 in the Regression task. Keywords: Natural Language Processing · Ensemble Learning · Humor Classification · Pre-trained Models 1 Introduction Humor Analysis based on Human Annotation (HAHA) 2021 [1] is a challenge that aims to classify Spanish tweets as humorous or not and further analyze humor by determining the characteristics present in the tweets which contribute to the humor. This challenge proposes four tasks: to classify the corpus as hu- morous or not, rating the humor present in the tweets, multi-class classification to find humor mechanism, and Multi-label classification tasks to find the humor target. 2 Related Work 2.1 Humor Recognition and Rating Deep learning approaches in humor recognition have become ubiquitous like in (Chen and Soo, 2018)[2] and (Wang et al., 2020)[3]. (Weller and Seppi, 2019)[4] first proposed the use of transformers in humor detection. Ismailov [5] and An- namoradnejad [6] extended the use of BERT models to humor classification. IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2.2 Voting and Ensemble Learning Incorporating voting in ensembles is a machine learning algorithm. These algo- rithms have been utilized in various domains ranging from Early diabetes predic- tion, heart diseases prediction [7] to fields of NLP for Named Entity Recognition. 3 Data We were provided with a corpus of crowd-annotated tweets separated into three subsets: training (24,000 tweets), development (6,000 tweets), and testing (6,000 tweets). The columns present in the corpus utilized for training and testing are as follows: – text - Text of the tweet. – is-humor - binary value (0 or 1) indicating if the tweet is humorous or not. – humor-rating - Real value (between 1 and 5) representing the average score the annotators gave to the tweet. – humor-mechanism - Label for humor mechanism. Only a subset of the tweets have the humor mechanism annotated. – humor-target - Zero or more labels for humor target, separated by ”;”. 4 Task Description This challenge3 proposes four sub-tasks which are as follows: Humor Detection: The main aim is to classify if a tweet is humorous. Funniness Score Prediction: Regression task which aims to rate a tweet in terms of humor. Humor Mechanism Classification: A multi-class classification task with the primary goal of predicting the mechanism by which the tweet conveys humor. Humor Target Classification: A multi-label classification task which aims at exploring the content of the joke based on its target. 5 Methodology We have released our code4 and experiments for easy replication. All the follow- ing models were fine-tuned using the AdamW optimizer, with a learning rate of 4e-5 and batch size of 8. These models were trained on the NVIDIA Tesla T4 GPU. Fig. 1. Final Ensemble Model for Binary Classification Task (Task 1) 5.1 Task 1 : Binary Classification The results for this task are summarized in Table 2. The baseline provided by the organizers for this task uses Naive Bayes with TFIDF features for Binary Classification of tweets achieving an F1 score of 0.6619 over the testing corpus. In the final solution, we tried a series of ensembles of pre-trained models. We use the Simple Transformers classification model, ClassificationModel for this task which uses a pre-trained model for this task of Binary Classification. Table 1. Ensemble Models (Experimentation) Ensemble ID Ensembles Used Jocoso(1) sBERT + mBERT + BETO + RoBERTa + NB Jocoso(2) sBERT + mBERT + ALBERT + BETO + NB + RoBERTa Jocoso(3) sBERT + mBERT + BETO + NB Jocoso(4) sBERT + mBERT +ALBERT + BETO + NB Jocoso(5) mBERT + BETO + sBERT + DeBERTa[8] + NB Jocoso(6) mBERT + BETO + ALBERT + sBERT 3 https://www.fing.edu.uy/inco/grupos/pln/haha/ 4 https://github.com/TanishqGoel/HAHA-IberLEF2021_Jocoso The final model is based on hard voting in an ensemble of 5 models:- Mul- tilingual cased BERT (mBERT) [9] which was pre-trained on 104 languages including Spanish; BETO [10], which is a BERT model pre-trained on a big Spanish corpus[10]. ALBERT , which was pre-trained on the English language using a masked language modeling (MLM) objective; a variant of BETO model fine-tuned for sentiment analysis (sBETO), trained with TASS 2020 corpus (around 5000 tweets) of several dialects of Spanish. RoBERTa base, which is a model pre-trained on a large corpus of English data in a self-supervised fashion. Finally, we use a Multinomial Naive Bayes Classifier using TFIDF features. We use the Tensorflow implementation available on HuggingFace5 . All the mod- els were fine-tuned for 3 epochs and took approximately 18-20 minutes for the complete training process per model. Table 2. Task 1 (Results and Experimentation) Ensemble F1 Precision Recall Accuracy Jocoso(1) 0.8850 0.9198 0.8526 0.8891 Jocoso(2) 0.8826 0.9194 0.8486 0.8871 Jocoso(3) 0.8822 0.9157 0.8509 0.8863 Jocoso(4) 0.8791 0.9176 0.8436 0.8840 Jocoso(5) 0.8777 0.9221 0.8373 0.8833 Jocoso(6) 0.8758 0.9215 0.8343 0.8816 Second Place 0.8716 - - - Third Place 0.8696 - - - BETO 0.8687 0.9044 0.8356 0.8736 mBERT 0.8561 0.9137 0.8053 0.8646 Baseline 0.6619 - - - While training our models on the given 24,000 tweets, we observed that BETO outperforms all other pre-trained models. We experimented with vari- ous ensembles from these pre-trained models based on hard voting. We used a 90:10 split for the training corpus without any preprocessing. 2 We have solved this problem with the technique of classification voting ensemble, predicting the results based on the majority vote of contributing models (preference is given to BETO and multilingual BERT with high individual F1 scores). 5.2 Task 2 : Regression The results for this task are summarized in Table 3. Here the baseline is SVM with TFIDF features which achieves an RMSE of 0.6704 over the test corpus. 5 https://huggingface.co/models 2 We observe that preprocessing reduces the F1 score. Fig. 2. Final Ensemble Model for Regression Task (Task 2) Table 3. Task 2 (Results and Experimentation) Ensemble RMSE First Place Solution 0.6226 Second Place Solution 0.6246 mBERT+ ALBERT + RoBERTa + DistilBERT + BETO + XLNet 0.6295 BETO + mBERT + ALBERT 0.6378 BETO + DistilBERT 0.6397 BETO + ALBERT 0.6391 BETO + XLNet 0.6400 BETO + mBERT 0.6412 Fourth Place Solution 0.6587 Baseline 0.6704 In this task, we tried a series of ensembles of pre-trained models, and re- sults are predicted utilizing the technique of regression voting ensembles. We combine our model with a regression head. Our ensemble comprises of 6 pre- trained models:- Multilingual Base cased BERT (mBERT), ALBERT base v2, RoBERTa base, DistilBERT base cased [11], BETO [10] and XLNet [12] base cased model followed by regression voting. All the models were fine- tuned for 3 epochs and took approximately 10 minutes for the complete training process per model. 5.3 Task 3 : Multi-Class Classification The results of task 3 are summarized in table 4. The baseline provided by the organizers for Task 3 achieves a macro F1 score of 0.1001 over the training corpus, which is based on Naive Bayes with TFIDF features. Table 4. Task 3 (Results and Experimentation) Models Used Macro F1 Score First Place Solution 0.3396 BETO - Cased 0.2916 BETO - Cased + BETO - Uncased6 0.2636 Third Place Solution 0.2522 Baseline 0.1001 Our model, with a Macro F1 score of 0.2916, utilizes BETO [10] to solve this problem of multi-class classification. We fine-tuned our model over the training corpus, which comprises of approx 4800 tweets for this task. All the models were fine-tuned for 3 epochs and took approximately 4-5 minutes for the complete training process per model. 5.4 Task 4 : Multi-Label Classification Table 5. Task 4 (Results and Experimentation) Models Used Macro F1 Score First Place Solution 0.4228 BETO - Cased, Not Preprocessed 0.3578 BETO - Cased, Preprocessed7 0.3569 Third Place Solution 0.3225 Baseline 0.0527 Table 5 comprises the results achieved by our various ensembles and the main Spanish BERT model in task 4. The baseline mentioned lies in the range of (0.05-0.06), which is based on the frequencies of words related to a specific tag. We use MultiLabelClassificationModel from Simple Transformers for this task. Our final system comprises of a pre-trained Spanish BERT cased 6 Combining BETO Cased and Uncased: The BETO model classifier outputs Softmax probabilities for all the classes. We choose the top 3 classes i.e. the classes with the highest probabilities for both the models. Next, from these 6 classes, we choose the class which appears maximum times as the final prediction. model, which is fine-tuned for 4 epochs on approximately 2000 tweets. It took approximately 5 minutes for the complete training process per model. Various ensembles and their results are listed in the above table. 6 Conclusion This paper describes the winning solution for Task 1, the second-place solution for task 3 and task 4, and the third-place solution for Task 2 in the evaluation phase of the Humor Analysis based on Human Annotation (HAHA) challenge at the Iberian Languages Evaluation Forum (IberLEF) 2021. During the devel- opment phase, our models achieved first place in all four tasks. The combined results for both phases are mentioned in Table 4. Table 6. Results for both Phases Phase Task 1 Task 2 Task 3 Task 4 Development Phase 0.8278 0.6262 0.2760 0.3389 Evaluation Phase 0.8850 0.6296 0.2916 0.3578 In all the tasks, we tried to exploit the power of voting in ensembles to get excellent results. For Task 1, 6 of our ensemble models outperform the second and third place solutions. Similarly, in other tasks, our models outperform the next place solutions by a high margin. Further work can be done in preprocessing the Spanish tweets to analyze the effects of various preprocessing methods on Humor prediction. An interesting approach is the translation of Spanish tweets to English and back to Spanish (i.e., Back Translation) as a method of preprocessing, which is a domain open for further experimenting and research. 7 Pre-processing includes cleaning, tokenizing, and parsing:- URLs, hashtags, men- tions, reserved words (RT, FAV), emojis, and smileys. Sample preprocessor can be found at https://pypi.org/project/tweet-preprocessor/ References [1] Luis Chiruzzo, Santiago Castro, Santiago Góngora, Aiala Rosá, J. A. Meaney, and Rada Mihalcea. Overview of HAHA at IberLEF 2021: Detect- ing, Rating and Analyzing Humor in Spanish. Procesamiento del Lenguaje Natural, 67(0), 2021. [2] Peng-Yu Chen and Von-Wun Soo. Humor recognition using deep learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 113–117, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. [3] Minghan Wang, Hao Yang, Ying Qin, Shiliang Sun, and Yao Deng. Unified humor detection based on sentence-pair augmentation and transfer learning. In EAMT, 2020. [4] Orion Weller and Kevin Seppi. Humor detection: A transformer gets the last laugh. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3621–3625, Hong Kong, China, November 2019. Association for Computational Linguistics. [5] Adilzhan Ismailov. Humor analysis based on human annotation challenge at iberlef 2019: First-place solution. In IberLEF@SEPLN, 2019. [6] Issa Annamoradnejad and Gohar Zoghi. Colbert: Using bert sentence em- bedding for humor detection, 2021. [7] Khalid Raza. Chapter 8 - improving the prediction accuracy of heart disease with ensemble learning and majority voting rule. In Nilanjan Dey, Amira S. Ashour, Simon James Fong, and Surekha Borra, editors, U-Healthcare Mon- itoring Systems, Advances in Ubiquitous Sensing Applications for Health- care, pages 179–196. Academic Press, 2019. [8] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Con- ference on Learning Representations, 2021. [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language under- standing. CoRR, abs/1810.04805, 2018. [10] José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez. Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020, 2020. [11] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distil- bert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. [12] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhut- dinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2020.