Humor Analysis in Spanish Tweets with Multiple Strategies ( ) Lianxi Wang1,2, Xiaotian Lin1, Nankai Lin1  , Yingwen Fu1, Kaiying Wu1 and Jiajun Wu1 1School of Information Science and Technology, Guangdong University of Foreign Studies, China 2 Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong Uni- versity of Foreign Studies, Guangzhou neakail@outlook.com Abstract. In this article, we report the solution of the team BERT 4EVER for the Humor Analysis based on Human Annotation task in IberLeF 2021, which aims to identify humorous articles from a computational perspective. We propose the BERT model to tackle the problem. In addition, we leverage various strategies including pseudo-label technology, Task-Adaptive Pre-training and ensemble learning to improve the generalization capability. Experimental results as well as the leading position our team on the task leaderboard demonstrate the effective- ness of our method with the first place in two subtasks Keywords: Humor Analysis, Multiple Strategies, BERT. 1 Introduction Humor, a complex phenomenon in human communication that results in amusement or laughter, not only serves to interchange information or share implicit meaning, but also engages a relationship between those exposed to the funny message. However, while humor has been historically studied from a psychological, cognitive and linguistic standpoint, there have been only few attempts to create computational models for hu- mor recognition or generation. Besides, the existing research mainly focuses on high- resource languages such as Chinese and English and a characterization of humor that allows its automatic recognition and generation is far from being specified. Luckily, HAHA @IberLEF 2021 propose the task “Humor Analysis based on Hu- man Annotation” [1], which aims to gain a better insight into what is humorous and what causes laughter, and propose to go further in the direction of analyzing humor structure and content. During the task, four subtasks are proposed: (1) Humor IberLEF 2021, September 2021, Málaga, Spain. ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Detection: determining if a tweet is a joke or not (intended humor by the author or not). (2) Funniness Score Prediction: predicting a Funniness Score value for a tweet in a 5- star ranking, assuming it is a joke. (3) Humor Mechanism Classification: for a humor- ous tweet, predict the mechanism by which the tweet conveys humor from a set of classes such as irony, wordplay, hyperbole, or shock. (4) Humor Content Classification : for a humorous tweet, predict the content of the joke based on its target (what it is making fun of) from a set of classes such as racist jokes, sexist jokes, dark humor, dirty jokes, etc. Our team, BERT 4EVER, also participated in this task and achieved good results with the first place in two subtasks. In this report, we will review our solution to this task, namely, the BERT model aided by pseudo-label technology, task-adaptive pre-training and teacher-student network with MSE loss function. 2 Related Work To our best knowledge, the existing researches on humor detection are mainly focus on identifying whether the text is humor or not and humor rating. Previous research for humor recognition is mainly based on taking the problem into account as a classification problem. Barbieri et al. [2] proposed to train classification procedures with a rich set of features and representation though casting it as a classifi- cation problem. Chen et al. [3] presented a Convolutional Neural Network (CNN) for humor recognition concentrating on lexical cues. Furtherly, Zhang et al. [4] designed several simple but effective features to capture the emotionality and subjectivity in hu- morous texts, which enables the model to make full use of the contextual knowledge. Although humor recognition has commonly been regarded as a binary classification task, recent works have further toward humor detection as a relative ranking task. Semeval-2017 Task 6 [5] asked competitors to predict the ranking gave by the comedy program’s audience and producers using the humorous tweets submitted to a comedy program. To better identify funnier captions, Shahaf et al. [6] proposed to analyze the caption pairs. Besides, they further find significant differences between the funnier and less-funny captions. As regards to Spanish language, Castro et al. [7] construct a tweet corpus labeled as humor/no humor and a funniness score from 1 to 5 for Spanish humor recognition tasks containing 27000 tweets. Ortega-Bueno et al. [8] proposed to combine both linguistic features and an Attention-based Recurrent Neural Network, where the attention layer helps to calculate the contribution of each term towards targeted humorous classes. 3 Method 3.1 Humor Detection BERT. BERT [9], a language model based on bidirectional encoder characterization, is designed to pretrain deep bidirectional representations via two unsupervised subtasks (namely Mask Language Model and Next Sentence Prediction) from unlabeled text by 3 jointly conditioning on both left and right context in all layer, meaning it can be fine- tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as text classification, without substantial tasks-specific architecture modifications. Based on these, we choose BERT as our language model shown in Fig- ure 1 to conduct our own various strategies on it. Fig. 1. BERT Model. Task-Adaptive Pre-training. Following Gururangan et al. [10], Our approach to task- adaptive pretraining is straightforward—we continue pretraining BERT on the unla- beled training set provided by HAHA @IberLEF 2021. Specially, we select the data whose labels are marked as 1 from the training set in HAHA@IberLEF 2021. Com- pared to the BERT without task-adaptive pre-training or using all the training data, it uses a smaller pretraining corpus but one that is much more task-relevant to further improve the performance of the task. MSE Strategy. According to Hinton et al. [11], they pointed out that crudely using one-hot encoding may lose additional information for different labels for classification tasks and proposed to utilize the probability outputted by the teacher network to instruct the student network for re-training. Inspired by their research, we train a teacher model with the training data and use its probabilities to the loss function MSE and Cross En- tropy to train the student model with the same training data. Five-fold Cross-validation Models Fusion. In our conducted experiment, in order to fairly increase the robustness of the model, we leveraged 5-fold cross-validation in which we divided all the datasets into 5 parts to obtain an ensemble model with a better generalization performance. 4 parts of them are for training and the remaining 1 part is for verification. Afterwards we leverage the average results of 5 cross models as an estimation of the effectiveness of the strategy. 3.2 Funniness Score Prediction During this task, our method consists of three models with five-fold cross-validation: XGBoost with word frequency matrix, LightGBM with TFIDF and BERT. XGBoost. XGBoost [12], a highly effective and widely used machine learning method, achieves state-of-the-art results on many machine learning challenges and Regression tasks. Based on its scalability in all scenarios and algorithmic optimizations, we choose this model with word frequency matrix to select text features. LightGBM. Being a variant of XGBoost, it outperforms the XGboost model in terms of accuracy and speed. For the shortcomings of the XGboost model, the LightGBM [13] proposed two effective methods called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to tackle them. During this paper, we leverage this model and select the text features with TFIDF to achieve the task of Fun- niness Score Prediction. Multiple Models Fusion. To our best knowledge, different models usually focus on different information about the same task, which will cause the difference performance on regression tasks. Based on this, in order to further increase the robustness of the model, we merge these three models with five-fold Cross-validation during the task of Funniness Score Prediction. 3.3 Humor Mechanism Classification Although manually annotating dataset is expensive, it is relatively easy to collect mas- sive unlabeled data in the target domain. Hence, it becomes desired to improve the generalization of the model by leveraging unlabeled data 𝐷𝑈 and limited labeled data 𝐷𝐿 [14][15]. In this competition, only 4800 pieces of data were annotated with the hu- mor mechanism tag, which were regarded as the limited labeled data 𝐷𝐿 . The rest of the data is considered as the unlabeled data 𝐷𝑈 . In this task, we also use the task-adaptive pretraining model in the humor detection task. We further use data 𝐷𝐿 to train a humor mechanism classification model 𝑀1 . We use model 𝑀1 to predict the label of the unlabeled data 𝐷𝑈 , and keep samples with label probability greater than 0.8. Through this strategy, we obtained a total of 1940 pseudo- labeled data 𝐷𝑃 . We merge the labeled data 𝐷𝐿 and the pseudo-labeled data 𝐷𝑃 to train a new model 𝑀2 which has the stronger generalization capability. In the final evalua- tion phase, we use 𝑀2 to predict the test data. 3.4 Humor Target Classification Similar to the humor mechanism classification task, we use pseudo-label technology to solve the task. However, we did not use the task-adaptive pretraining model in the hu- mor detection task. At the same time, because humor target classification is a multi- label classification task, that is, a sample may have zero or more labels for humor target, we added a “None” label on the basis of the original label set. In addition, since many of the 4800 annotated samples have no labels, when generating the pseudo-labeled data set, we only selected the data whose prediction results contained one or more labels. The pseudo-labeled data set contains a total of 774 samples. 5 4 Experiment 4.1 Experiment Settings We use Transformers2 library using Pytorch3 as backend to construct BERT-based models and scikit-learn4 to construct machine learning models. What’s more, we lever- aged BETO5 [16] as our base model. The hyperparameters are shown in Table 2. Table 1. Hyperparameters. Parameter Value Learning Rate 5e-5 Batch Size 16 Epoch 15 Optimizer Adam Task-Adaptive Pre-training Epoch 3 4.2 Results Table 2. Results for Humor Detection Task. Method F1-validation F1-test BERT 83.06% - BERT with pretrained 85.35% 85.00% Teacher-student network with pretrained and 86.32% 86.45% MSE6 In our conducted experiment, in order to fairly explore the effectiveness of different strategies, we leveraged 5-fold cross-validation in which we divided all the datasets into 5 parts to obtain an ensemble model with a better generalization performance. The experimental results of humor detection task are shown in Table 2. The experi- mental results show the effectiveness of the task-adaptive pre-training strategy and the MSE strategy. Among them, the model that uses two strategies at the same time has the best performance, and the F value is 86.45% on the final test data set. Table 3. Results for Funniness Score Prediction Task. Method MMR-validation MMR-test BERT 0.6471 - BERT with pretrained 0.6470 0.6673 XGBoost 0.6470 0.6615 2 https://github.com/huggingface/transformers 3 https://github.com/pytorch/pytorch 4 https://github.com/scikit-learn/scikit-learn 5 https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased 6 We only used the four folds fusion as the final submission result since the fifth fold did not perform well. LightGBM 0.6532 0.6643 Merge - 0.6587 The results in Table 3 show that on the funniness score prediction task, among the three base models, XGBoost has the best performance. The results on the validation set and test set are 0.6470 and 0.6615, respectively. In the end, the multiple models fusion strategy brought further improvement to the task. The result on the test set was 0.6587, ranking fourth place among all teams. Table 4. Results for Humor Mechanism Classification Task. Method F1-validation F1-test BERT 26.50% 29.99% BERT with pretrained 28.50% 32.27% BERT with pretrained and pseudo-label 33.67% 33.96% Table 5. Results for Humor Target Classification Task Method F1-validation F1-test BERT 26.50% 37.20% BERT with pretrained 23.47% 29.42% BERT with pseudo-label 29.31% 42.28% We achieved the first place in the leaderboard in both the humor mechanism classi- fication task and the humor target classification task. Whether on validation and test sets, we can see that pseudo-label technology has brought significant improvements to the model. 5 Conclusion Aiming at humor analysis task for Spanish tweets in HAHA@IberLEF 2021, we adopt a monolingual pre-trained Spanish BERT model as our base model and fine-tune it with the labeled tweets. In addition, for different tasks, we leverage different strategies to enhance the classic fine-tuned model. Experimental results demonstrate the effective- ness of our method. In the future, we will further try more strategies to achieve better results on the humor analysis task for Spanish tweets. Acknowledgements This work was supported by the National Social Science Foundation of China (No. 17CTQ045), the Soft Science Research Project of Guangdong Province (No.2019A101002108), the Science and Technology Program of Guangzhou (No.202002030227), the National Natural Science Foundation of China (No. 61572145) and the Key Field Project for Universities of Guangdong Province (No. 7 2019KZDZX1016). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. References 1. Luis, C., Santiago, C., Santiago, G., Aiala R., Meaney, J. A. and Rada, M.: Overview of HAHA at IberLEF 2021: Detecting, Rating and Analyzing Humor in Spanish. Procesamiento del Lenguaje Natural 67. (2021). 2. Barbieri, F., Saggion, H.: Automatic Detection of Irony and Hu-mour in Twitter. In: Inter- national Conference on Computer and Communications, pp. 155-162. (2014). 3. Mihalcea, R., Strapparava, C.: Making computers laugh: Investi-gations in automatic humor recognition. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 531-538. (2005). 4. Zhang, D., Song, W., Liu, L., Du, C., et al.:(2017, December). Investigations in automatic humor recognition. In: 2017 10th International Symposium on Computational Intelligence and Design. pp. 272-275. IEEE (2017). 5. Potash, P., Romanov, A., Rumshisky, A.: SemEval-2017 task 6: #HashtagWars: Learning a sense of humor. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). pp. 49–57. Association for Computational Linguistics, Vancouver, Canada (2017). 6. Shahaf, D., Horvitz, E., Mankoff, R.: Inside jokes: Identifying hu-morous cartoon captions. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, pp. 1065-1074. (2015). 7. Castro, S., Chiruzzo, L., Rosá, A.et al.: A crowd-annotated Spanish corpus for humor anal- ysis. CORR. (2017). 8. Ortega-Bueno, R., Muniz-Cuza, C. E., Pagola, J. E. M.: UO UPV: Deep linguistic humor detection in Spanish social media. In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages co-located with 34th Conference of the Spanish Society for Natural Language Processing, pp. 204-213. (2018). 9. Devlin, J., Chang, M. W., Lee K., et al.: BERT: Pre-training of deep bidirectional transform- ers for language understanding, In: Proceedings of NAACLHLT 2019, pp. 4171-4186. (2019). 10. Gururangan, S., Marasović, A., Swayamdipta, S., et al.: Don't Stop Pretraining: Adapt Lan- guage Models to Domains and Tasks. In: Proceedings of ACL, pp. 8342—8360. (2020). 11. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CORR. (2015). 12. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785-794. (2016). 13. Ke, G., Meng, Q., Finley, T., et al.: Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 3146-3154(2017). 14. Lee, D.: Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. In: ICML 2013 Workshop: Challenges in Representation Learning. pp. 1-6. (2013). 15. Shi, W., Gong, Y., Ding, C., et al.: Transductive semi-supervised deep learning sing min- max features. In Proceedings of the European Conference on Computer Vision. pp. 299-315. (2018). 16. Cañete, J., Chaperon, G., Fuentes, R., Ho, J., Kang, H. and Pérez, J.: Spanish Pre-Trained BERT Model and Evaluation Data. In: Preceedings of ICLR 2020. (2020).