Introduction

Humor Analysis Based on Human Annotation Challenge at IberLEF 2019: First-place Solution

0 Universitat Pompeu Fabra , Spain

2019

160 164

This paper describes the winning solution to the Humor Analysis based on Human Annotation (HAHA) task at IberLEF 2019. The main classi cation task is solved using an ensemble of a ne-tuned multilingual BERT (Bidirectional Encoder Representations from Transformers) model and a naive Bayes classi er. Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 September 2019, Bilbao, Spain.

Natural Language Processing Deep Neural Networks

Introduction

Humor Analysis based on Human Annotation (HAHA) challenge [ 2 ] proposes two tasks: to classify tweets in Spanish as humorous or not, and rate how funny they are on a given scale. This paper describes the winning solution for both of these tasks.

The main classi cation task is solved using an ensemble of a ne-tuned multilingual BERT (Bidirectional Encoder Representations from Transformers [ 3 ]) model and a naive Bayes classi er. The solution achieves the F1 score of 0.821 with the second-place score of 0.816.

The regression task is also solved by ne-tuning a multilingual BERT model. The nal submission is a weighted average of the regression BERT model and a LightGBM model (https://github.com/microsoft/LightGBM) estimated on TFIDF features. The solution achieves the root-mean-square error (RMSE) score of 0.736 with the second-place team achieving the RMSE of 0.746.

The rest of the paper is organised as follows. Section 2 describes the challenge and Section 3 describes the solutions for each of the tasks. Section 4 concludes. The challenge asks to classify tweets in Spanish as humorous or not, and rate how funny they are on a scale from one (not humorous) to ve. The dataset [ 1 ] is a corpus of crowd-annotated Spanish-language tweets split into a train and a test set. The train set consists of 24000 tweets out of which 38.6% are considered humorous with an average rating of 2.05. The test set comprises 6000 tweets for which only text is given. There are two tasks: Humour detection and Funniness score prediction: Humour Detection: the goal is to classify tweets into jokes (intended humour by the author) and not jokes. The performance is measured using F1 score. Funniness Score Prediction: the goal is to predict a funniness score (average of crowd-sourced ranks) for a tweet supposing it is a joke. The performance is measured using root-mean-square error. 3

Solution Description

For every model the nal predictions were obtained by averaging predictions from each fold of a ve-fold cross-validation schemes, in other words by averaging predictions from ve models each estimated using 80% of the data. The validation scores reported below are F1 scores and RMSEs calculated on out-offold predictions for the full train set. 3.1

Classi cation task The results are summarised in Table 1. The baseline provided by the organisers considers a tweet a joke randomly with a probability of 0.5.

The main model used in the nal solution is ne-tuned multilingual cased BERT model (12-layer, 768-hidden, 12-heads, 110M parameters) [ 3 ] that supports 104 languages including Spanish. We use PyTorch implementation by HuggingFace1 that also provides pretrained weights and vocabulary, and build on top of that. We tokenize the text by basic tokenization followed by WordPiece tokenization, following the original implementation and do not apply any preprocessing to the text.

For the classi cation task we use binary cross-entropy loss and one-cycle learning rate schedule [ 4 ]. Given the small dataset and large model capacity over tting is a major issue, and to combat that we follow [ 4 ] and apply di erential learning rates across the layers, with the maximum learning rate for the rst layer set at a half of the maximum learning rate for the last layer, the classi cation head (set at 2e-5). The batch size used is 32 and we train the model for each fold for four epochs, and use the checkpoint for the epoch with the best validation F1 score for test-set predictions. The di erential learning rates and one-cycle learning-rate schedule are implemented using FastAI library2. This base model achieves the score of 0.818 on cross-validation and 0.807 on the test set. 1 https://github.com/huggingface/pytorch-pretrained-BERT 2 https://docs.fast.ai/

To further reduce over tting we ne-tune the pretrained weights by using unsupervised learning and text data from both train and test sets. The idea comes from the ULMFiT paper [ 4 ], where unsupervised ne-tuning on the text from the same domain as the target task before classi cation step signi cantly improved results. However while in [ 4 ] the unsupervised task is predicting the next word here we use the tasks used for training the original BERT language model: combination of masked language modelling and next sentence prediction loss, again by modifying HuggingFace library implementation to this task. We ne-tune the language model to the domain for ten epochs and use the obtained weights in place of the original pretrained weights for the classi cation task and repeat the steps above. This model achieves 0.829 on cross validation and 0.815 score on the test set, a signi cant improvement over the base model.

We then average these predictions with predictions from a naive Bayes model following [ 5 ] with a logistic regression3 estimated on uni- and bi-grams features picked using TFIDF.

The model alone achieves only 0.771 on cross-validation, however an ensemble of the BERT model above and the naive Bayes model with weights 0.72 and 0.28 respectively scores 0.833 on cross validation and 0.821 on the leaderboard. The optimal weights are derived from cross-validation. This ensemble was the nal solution for the classi cation task. The results are summarised in Table 2. Here the baseline is a constant prediction of 3 assigned to all items in the test. For the regression task we also ensemble two models: a ne-tuned BERT model and a LightGBM model based on gradientboosted trees.

The only di erence between the BERT model for the regression task and the BERT model for the classi cation task is the loss function - for former we use the mean-squared loss given that the challenge metric is RMSE. This model achieves RMSE of 0.726 on cross-validation and 0.746 on the test set. 3 Based on the code nb-svm-strong-linear-baseline from

https://www.kaggle.com/jhoward/

The second model is a LightGBM model estimated on the same features as the naive Bayes model above. The loss function is also mean-squared error and we set bagging and feature fraction parameters to 0.7 to add regularisation. This model scores 0.795 on cross validation.

For the nal solution we average predictions from these two models linearly with weight of 0.71 assigned to the prediction from the BERT model and 0.29 - to the LightGBM model, with weights coming from the cross-validation. This ensemble scores 0.712 on cross-validation and 0.736 on the test set. This paper describes the winning solution for both classi cation and regression tasks of the Humor Analysis based on Human Annotation challenge at IberLEF 2019, which consists of an ensemble of a ne-tuned BERT model and a complementary model estimated on TFIDF features derived from uni- and bi-grams.

Firstly, we can see that a high score can be achieved by solely appropriately ne-tuning a BERT model. In fact that model alone would rank second in this competition, and with longer hyperparameter search or longer language model ne-tuning - possibly rank rst.

Secondly, unsupervised learning can help to reduce over tting and improve the score signi cantly when the size of the dataset is small.

And, nally, adding weaker but di erent and diverse models to the ensemble helps to boost the score as has been demonstrated above.

Further work can be done both in improving the quality of the predictions and studying how the models assign them. For instance it would be interesting to examine predictions from di erent layers of the neural network to study where does attention fall and what token combinations make the model decide whether a tweet is humorous or not.

1. Castro , S. , Chiruzzo , L. , Rosa , A. , Garat , D. , Moncecchi , G.: A Crowd-Annotated Spanish Corpus for Humor Analysis . Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media , 7 { 11 ( 2018 )

2. Chiruzzo , L. , Castro , S. , Etcheverry , M. , Garat , D. , Prada , J. , Rosa , A. : Overview of HAHA at IberLEF 2019: Humor Analysis based on Human Annotation . Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 ). CEUR Workshop Proceedings ( 2019 )

3. Devlin , J. , Chang , M. Lee , K. , Toutanova , K. : Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 . 04805 . ( 2018 )

4. Howard , J. , Ruder , S. : Universal language model ne-tuning for text classi cation . arXiv preprint arXiv: 1801 . 06146 . ( 2018 )

5. Wang

, Manning

. Baselines and bigrams: Simple, good sentiment and topic classi cation. Proceedings of the 50th annual meeting of the association for computational linguistics: Short papers -volume 2 , 90 - 94 . Association for Computational Linguistics. ( 2012 )