Introduction

HAHA@IberLEF2021: Humor Analysis using Ensembles of Simple Transformers

Karish Grover

karish19471@iiitd.ac.in 0

Tanishq Goel

tanishq.goel@research.iiit.ac.in 1 0 Indraprastha Institute of Information Technology , Delhi , India 1 International Institute of Information Technology , Hyderabad , India

2018

This paper describes the system submitted to the Humor Analysis based on Human Annotation (HAHA) task at IberLEF 2021. This system achieves the winning F1 score of 0.8850 in the main task of binary classi cation (Task 1) utilizing an ensemble of a pre-trained multilingual BERT, pre-trained Spanish BERT (BETO), RoBERTa, and a naive Bayes classi er. We also achieve second place with macro F1 Scores of 0.2916 and 0.3578 in Multi-class Classi cation and Multi-label Classi cation tasks, respectively, and third place with an RMSE score of 0.6295 in the Regression task.

Natural Language Processing Ensemble Learning Humor Classi cation Pre-trained Models

Introduction

Humor Analysis based on Human Annotation (HAHA) 2021 [ 1 ] is a challenge that aims to classify Spanish tweets as humorous or not and further analyze humor by determining the characteristics present in the tweets which contribute to the humor. This challenge proposes four tasks: to classify the corpus as humorous or not, rating the humor present in the tweets, multi-class classi cation to nd humor mechanism, and Multi-label classi cation tasks to nd the humor target. 2.1

Voting and Ensemble Learning

Incorporating voting in ensembles is a machine learning algorithm. These algorithms have been utilized in various domains ranging from Early diabetes prediction, heart diseases prediction [ 7 ] to elds of NLP for Named Entity Recognition. 3

Data

We were provided with a corpus of crowd-annotated tweets separated into three subsets: training (24,000 tweets), development (6,000 tweets), and testing (6,000 tweets).

The columns present in the corpus utilized for training and testing are as follows: { text - Text of the tweet. { is-humor - binary value (0 or 1) indicating if the tweet is humorous or not. { humor-rating - Real value (between 1 and 5) representing the average score the annotators gave to the tweet. { humor-mechanism - Label for humor mechanism. Only a subset of the tweets have the humor mechanism annotated.

{ humor-target - Zero or more labels for humor target, separated by ";". 4

Task Description

This challenge3 proposes four sub-tasks which are as follows:

Humor Detection: The main aim is to classify if a tweet is humorous.

Funniness Score Prediction: Regression task which aims to rate a tweet in terms of humor.

Humor Mechanism Classi cation: A multi-class classi cation task with the primary goal of predicting the mechanism by which the tweet conveys humor.

Humor Target Classi cation: A multi-label classi cation task which aims at exploring the content of the joke based on its target. 5

Methodology

We have released our code4 and experiments for easy replication. All the following models were ne-tuned using the AdamW optimizer, with a learning rate of 4e-5 and batch size of 8. These models were trained on the NVIDIA Tesla T4 GPU. The results for this task are summarized in Table 2. The baseline provided by the organizers for this task uses Naive Bayes with TFIDF features for Binary Classi cation of tweets achieving an F1 score of 0.6619 over the testing corpus.

In the nal solution, we tried a series of ensembles of pre-trained models. We use the Simple Transformers classi cation model, Classi cationModel for this task which uses a pre-trained model for this task of Binary Classi cation.

3 https://www.fing.edu.uy/inco/grupos/pln/haha/ 4 https://github.com/TanishqGoel/HAHA-IberLEF2021_Jocoso

The nal model is based on hard voting in an ensemble of 5 models:- Multilingual cased BERT (mBERT) [ 9 ] which was pre-trained on 104 languages including Spanish; BETO [ 10 ], which is a BERT model pre-trained on a big Spanish corpus[ 10 ]. ALBERT , which was pre-trained on the English language using a masked language modeling (MLM) objective; a variant of BETO model ne-tuned for sentiment analysis (sBETO), trained with TASS 2020 corpus (around 5000 tweets) of several dialects of Spanish. RoBERTa base, which is a model pre-trained on a large corpus of English data in a self-supervised fashion. Finally, we use a Multinomial Naive Bayes Classi er using TFIDF features. We use the Tensor ow implementation available on HuggingFace5. All the models were ne-tuned for 3 epochs and took approximately 18-20 minutes for the complete training process per model.

While training our models on the given 24,000 tweets, we observed that BETO outperforms all other pre-trained models. We experimented with various ensembles from these pre-trained models based on hard voting. We used a 90:10 split for the training corpus without any preprocessing. 2 We have solved this problem with the technique of classi cation voting ensemble, predicting the results based on the majority vote of contributing models (preference is given to BETO and multilingual BERT with high individual F1 scores). 5.2

Task 2 : Regression

The results for this task are summarized in Table 3. Here the baseline is SVM with TFIDF features which achieves an RMSE of 0.6704 over the test corpus.

5 https://huggingface.co/models

2 We observe that preprocessing reduces the F1 score.

In this task, we tried a series of ensembles of pre-trained models, and results are predicted utilizing the technique of regression voting ensembles. We combine our model with a regression head. Our ensemble comprises of 6 pretrained models:- Multilingual Base cased BERT (mBERT), ALBERT base v2, RoBERTa base, DistilBERT base cased [ 11 ], BETO [ 10 ] and XLNet [ 12 ] base cased model followed by regression voting. All the models were netuned for 3 epochs and took approximately 10 minutes for the complete training process per model.

Task 3 : Multi-Class Classi cation

The results of task 3 are summarized in table 4. The baseline provided by the organizers for Task 3 achieves a macro F1 score of 0.1001 over the training corpus, which is based on Naive Bayes with TFIDF features.

Our model, with a Macro F1 score of 0.2916, utilizes BETO [ 10 ] to solve this problem of multi-class classi cation. We ne-tuned our model over the training corpus, which comprises of approx 4800 tweets for this task. All the models were ne-tuned for 3 epochs and took approximately 4-5 minutes for the complete training process per model. 5.4

Task 4 : Multi-Label Classi cation

We use MultiLabelClassi cationModel from Simple Transformers for this task. Our nal system comprises of a pre-trained Spanish BERT cased 6 Combining BETO Cased and Uncased: The BETO model classi er outputs Softmax probabilities for all the classes. We choose the top 3 classes i.e. the classes with the highest probabilities for both the models. Next, from these 6 classes, we choose the class which appears maximum times as the nal prediction. model, which is ne-tuned for 4 epochs on approximately 2000 tweets. It took approximately 5 minutes for the complete training process per model. Various ensembles and their results are listed in the above table. 6

Conclusion

This paper describes the winning solution for Task 1, the second-place solution for task 3 and task 4, and the third-place solution for Task 2 in the evaluation phase of the Humor Analysis based on Human Annotation (HAHA) challenge at the Iberian Languages Evaluation Forum (IberLEF) 2021. During the development phase, our models achieved rst place in all four tasks. The combined results for both phases are mentioned in Table 4.

In all the tasks, we tried to exploit the power of voting in ensembles to get excellent results. For Task 1, 6 of our ensemble models outperform the second and third place solutions. Similarly, in other tasks, our models outperform the next place solutions by a high margin.

Further work can be done in preprocessing the Spanish tweets to analyze the e ects of various preprocessing methods on Humor prediction. An interesting approach is the translation of Spanish tweets to English and back to Spanish (i.e., Back Translation) as a method of preprocessing, which is a domain open for further experimenting and research. 7 Pre-processing includes cleaning, tokenizing, and parsing:- URLs, hashtags, mentions, reserved words (RT, FAV), emojis, and smileys. Sample preprocessor can be found at https://pypi.org/project/tweet-preprocessor/

[1]

Luis

Chiruzzo , Santiago Castro, Santiago Gongora, Aiala Rosa,

J. A.

Meaney , and

Rada

Mihalcea . Overview of HAHA at IberLEF 2021: Detecting, Rating and Analyzing Humor in Spanish . Procesamiento del Lenguaje Natural , 67 ( 0 ), 2021 .

[2] Peng-Yu Chen and Von-Wun Soo . Humor recognition using deep learning . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 2 ( Short Papers) , pages 113 { 117 , New

Orleans

, Louisiana, June 2018 . Association for Computational Linguistics .

[3]

Minghan

Wang , Hao Yang , Ying Qin, Shiliang Sun, and Yao Deng . Uni ed humor detection based on sentence-pair augmentation and transfer learning . In EAMT , 2020 .

[4]

Orion

Weller and

Kevin

Seppi . Humor detection: A transformer gets the last laugh . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 3621 { 3625 , Hong

Kong

, China, November 2019 . Association for Computational Linguistics .

[5]

Adilzhan

Ismailov . Humor analysis based on human annotation challenge at iberlef 2019: First-place solution . In IberLEF@SEPLN , 2019 .

[6]

Issa

Annamoradnejad and

Gohar

Zoghi . Colbert: Using bert sentence embedding for humor detection , 2021 .

[7]

Khalid

Raza . Chapter 8 - improving the prediction accuracy of heart disease with ensemble learning and majority voting rule . In Nilanjan Dey, Amira S. Ashour, Simon James Fong, and Surekha Borra, editors, U-Healthcare Monitoring Systems, Advances in Ubiquitous Sensing Applications for Healthcare , pages 179 { 196 . Academic Press, 2019 .

[8]

Pengcheng

He , Xiaodong Liu,

Jianfeng

Gao , and

Weizhu

Chen . Deberta: Decoding-enhanced bert with disentangled attention . In International Conference on Learning Representations , 2021 .

[9]

Jacob

Devlin , Ming-Wei

Chang

Kenton

Lee ,

and Kristina

Toutanova . BERT: pre-training of deep bidirectional transformers for language understanding . CoRR , abs/ 1810 .04805, 2018 .

[10] Jose Can~ete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui

, Hojin Kang, and

Jorge

Perez . Spanish pre-trained bert model and evaluation data . In PML4DC at ICLR 2020 , 2020 .

[11] Victor

Sanh

, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter . ArXiv, abs/ 1910 .01108, 2019 .

[12] Zhilin

Yang

, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc

Le . Xlnet: Generalized autoregressive pretraining for language understanding , 2020 .