Applying Pre-trained Model and Fine-tune to
    Conduct Humor Analysis on Spanish Tweets

                                      Yongyi Kui

              Information Institute of Yunnan University, Yunnan,China
                                   3964438@qq.com


        Abstract. This paper describes in detail the four subtasks of HAHA@
        IberLEF 2021 [7] : Humor Analysis based on Human Annotation. Sub-
        task 2 is a regression problem, and the other three subtasks are all text
        classification problems. The data comes from the Twitter social plat-
        form, and the language is Spanish. The classification problem is mainly
        solved by integrating the Multilingual Bert model and the LSTM model,
        and the regression problem is solved by the GPT-2 Model. According to
        the official evaluation results, the result of the method proposed in this
        paper ranks fourth, eighth, fifth, and sixth on the four subtasks, respec-
        tively. For this task, I have uploaded the code to GitHub kuiyongyi for
        easy reference by others.

        Keywords: Spanish Text Classification · Humor Analysis · Pre-trained
        Model · Fine-tuning.


1     Introduction

Humor is a very common phenomenon in human communication. It is relatively
easy for humans to understand whether the content of a text is humorous, but
the computer can learn the characteristic information in the text, only after
learning a large amount of corpus and then detect whether the text is humorous
[14].
    Humor detection has been a relatively hot field for many years. Semeval-
2015 Task 11 proposes the influence of figurative language such as metaphor
and irony on sentiment analysis. Semeval-2017 Task 6 requires predicting the
ranking of comedy shows based on humorous tweets of comedy shows. Both
IberEVAL 2018 and IberLEF 2019 include two subtasks: humor detection and
interest score prediction. Castro [6] established a corpus of annotated tweets,
allowing annotators to judge subjectively which tweets are humorous, and then
build a humor classifier for Spanish tweets based on the method of supervised
learning. Barbieri and Saggion [2] proposed a machine learning method based
on a set of language-driven features. Radev [15] described a method for humor
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright© 2021 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
detection in cartoon subtitles. Yang [16] constructed different classifiers through
feature sets to detect humor. However, many existing systems use a combination
of Pre-trained Models and fine-tuning [9] to achieve humor detection.
    The rest of the paper is organized as follows. In the second part of the paper,
I give an overview of the task and the data. The third part describes the model
used in this task. In the fourth part, the experiments of the four subtasks are
described. The fifth part gives the test results of the model. Finally, the sixth
part of the article is the conclusion.


2     Task and Data Description

2.1   Subtasks

HAHA@IberLEF2021 includes four sub-tasks, which aim to predict the humor
classification or humor degree of the text. The four subtasks are defined as
follows:
    Subtask 1: Given the Twitter text in Spanish, the goal is to predict whether
each tweet is humorous. This is a binary classification problem. This question is
measured by F1 score.
    Subtask 2: To measure the degree of humor on the text predicted to be
humorous, this is a regression task. The evaluation standard of this subtask is
the root mean square error.
    Subtask 3: It is required to predict a humor mechanism category for tweets
among the twelve humor mechanisms given. This is a multi-classification prob-
lem. The performance of this task will be measured using the Macro-F1 score.
    Subtask 4: Given tweets and fifteen humorous target tags, the goal is to
predict the corresponding humorous target tags for each tweets (at least zero and
at most 15 tags), which is a text multi-label classification task. The evaluation
standard for this question is also Macro-F1.


2.2   Dataset

The Dataset [8] provided by HAHA@IberLEF 2021 includes three Subdatasets:
haha 2021 train (24,000 tweets), haha 2021 dev (6,000 tweets), and haha 2021 test
(6,000 tweets).
    Each record in haha 2021 train consists of twelve columns of data, which are
respectively Id, the tweets corresponding to each Id (text includes punctuation
and emoticons), non-humorous voting, the number of votes for each of the five
levels, is humor, humor rating, humor mechanism, and humor target.


2.3   Data cleaning

Since the text in the HAHA Dataset comes from Twitter social media, the length
of the text varies, most of the text is generally short, and the content or format
of the tweets is informal, (including spelling errors, character, and vocabulary
repetition). It is necessary to clean up the tweets in the Dataset, because in this
way the Language Model can predict the content to be covered based on the
context and semantics. The data cleaning methods involved in this paper are as
follows:

 – Replace the repeated characters or words (three times or more) in the tweet
   with a single character or word;
 – Delete Delete the emoticons appearing in the text;
 – Delete Delete the HTML tags that appear in the text;
 – Delete Replace the newline character in the text with spaces.


3     System Description

In this section, I will give an overview of the system that implements these four
subtasks. We can simply regard the task of Humor Analysis based on Human
Annotation in IberLEF 2021 as text classification and regression problems, given
input text, predict one or more labels, and predict the humor score value of the
text.
    One disadvantage of the LSTM model is that the input information is com-
pressed into a vector with the same length as the number of LSTM memory
units, so the LSTM model cannot remember long tweets. However, the Twitter
text length of this task is very short, so I consider using LSTM to further extract
text feature information after the output of the Pre-training Model.


3.1   Model

Since the text of the Dataset is Spanish, this paper uses three Cross-language
[12] Pre-training Models of Multilingual Bert [1] , XLM, and XLM-RoBerta,
as well as models such as XLNet, Albert, and GPT-2 to extract text feature
information. Here is a brief introduction to each Pre-training Model.
    The Bert model uses Masked Language Modeling [11] to train the bidirec-
tional Transformers to generate deep bidirectional language representations. Af-
ter the Pre-training phase is over, you just need to add an output layer for
fine-tuning. This time, Bert-base-multilingual-uncased is used. This model uses
a corpus of one hundred and two languages, including Spanish, during the Pre-
training phase.
    Albert uses Transformer and GELU activation function. Albert uses such
as parameter sharing and matrix decomposition to reduce the number of model
parameters. Albert uses Sentence Order Prediction Loss to replace Next Sentence
Prediction Loss.
    XLM is a Cross-language model, similar to Bert, it is also a Masked Lan-
guage Model, and its Pre-training method is the next token prediction. In this
task, XLM-mlm-tlm-xnli15-1024 is used. The model uses fifteen corpus including
Spanish for Cross-language sentence training.
   XLM-Roberta is a model trained on a corpus of one hundred different lan-
guages. Unlike XLM, it doesn’t require lang tensors to understand which lan-
guage is used and should be able to determine the correct language from the
input ids. In this task, XLM-RoBerta-base is used.
   XLNet mainly optimized the Bert model in three aspects. First, the Auto-
encoding Model [4] is replaced with an Autoregressive [10] Model, and then
the Transformer of the Bert model is improved with Transformer-xl, and XLNet
uses a dual-stream [13] attention mechanism. In the Pre-training phase, the Next
Sentence Prediction method of the Bert model is discarded.
   The GPT-2 model is usually padding the inputs on the right and was trained
using causal language modeling targets. The GPT-2 model and the Bert model
are constructed through the decoder and encoder modules of the transformer,
respectively.


3.2   Parameter setting

In subtask 1, the optimizer chooses AdamW, train for 50 epochs with a 3e-5
learning rate and a 32 batch size. And the weight decay parameter is set to 1e-2,
the maxlen parameter value is set to 64, and the loss function is selected as
Crossentropy.
    In subtask 2, the loss function used is mse loss, the learning rate is set to
1e-5, the maxlen parameter value is set to 64, batch size is set to 32, and after
every sixty-four steps of training, a verification is carried out.
    In subtask 3 and subtask 4, the learning rate and weight decay parameters
are both 5e-6 and 1e-2, the epoch is 50 and 100 respectively, the loss func-
tion uses CategoricalCrossentropy and BinaryCrossentropy respectively, both
the drop out parameter and the optimizer are set to 0.5 and Adam.


4     Experiments

4.1   Subtask 1

Divide haha 2021 train according to the ratio of 4/1 as the Training Dataset
and Validation Dataset of the models in subtask 1. The actual Training Dataset,
Validation Dataset, and Testing Dataset lengths of subtask 1 are 19200, 4800,
and 6000 respectively. Next, the data is cleaned, word segmented, and coded
before it is ready to be input into the model.
    In subtask 1, First, I used three Pre-trained Models (Albert-base-v2, XLNet-
base-cased and Bert-base-multilingual-uncased) for text binary classification.
The results show that the Multilingual Bert model is used to obtain the highest
score in the text classification problem of Spanish materials. The F1 score of
this model on the Validation Dataset is 0.8712, and then add a 4-layer unidi-
rectional LSTM network after the model to further extract text features, and
finally input the LSTM results into the fully connected layer for classification.
This combination has an F1 score of 0.8785 on the Validation Dataset, compared
Table 1. Performance of the four models on the Validation Dataset of subtask 1,
performance is measured by F1 value

                                Model                        F1
                          XLNet-base-cased                 0.8611
                            Albert-base-v2                 0.8653
                    Bert-base-multilingual-uncased         0.8712
                Bert-base-multilingual-uncased+LSTM        0.8785


Table 2. The performance of the four pre-trained Language models on the Validation
Dataset in subtask 2, measured using root-mean-squared error (RMSE)

                                Model                  RMSE
                    Bert-base-multilingual-uncased     0.6833
                         XLM-RoBerta-base              0.6761
                        XLM-mlm-xnli15-1024            0.6719
                               GPT-2                   0.6683


to a single pre-trained model, the score increased by 0.0073. Thus, the prediction
result of this combined model is the answer I finally submitted on subtask 1.
The performance of these models on the Validation Dataset is shown in Table 1.


4.2   Subtask 2

There are 9253 pieces of data available for subtask 2 in the haha 2021 train.
Divide the data according to the method of subtask 1. The Training Dataset
length of subtask 2 is 7402, the Validation Dataset length is 1851, and the
Testing Dataset length is 6000.
    In subtask 2, I compared the performance of three Cross-language Pre-
training Models and the GPT-2 model in the regression problem. These four
Pre-training Models are Bert-base-multilingual-uncased, XLM-mlm-xnli15-1024,
XLM-RoBerta-base, and GPT-2. The experimental results show that the per-
formance of the GPT-2 model in subtask2 is slightly better than the other three
models, therefore, I use the prediction result of the GPT-2 model as the answer
I finally submitted in subtask 2. The specific performance of these four models
on the Validation Dataset is shown in Table 2.


4.3   Subtask 3 & 4

In haha 2021 train, humor mechanism column and humour target column have
4800 and 1629 non-empty data respectively. Divide the data according to the
method of subtask 1. The Training Dataset lengths of subtask 3 and subtask 4
Table 3. The performance of our solutions on the Validation Datasets of subtask 3
and subtask 4 respectively, The evaluation standard of them are both Macro-F1.

                                                     subtask 3      subtask 4
                        Model
                                                     Macro-F1       Macro-F1
       Bert-base-multilingual-uncased                  0.2433        0.2927
       Bert-base-multilingual-uncased+BiLSTM           0.2516        0.3044


are 3840 and 1303, respectively, the Validation Dataset lengths are 960 and 326,
respectively, and, the length of the test Dataset for both is 6000.
    After dividing the data, first, encode the labels of the Training Dataset and
Validation Dataset in subtask 3 and subtask 4 into one hot vectors. Next, the
Dataset is processed in the same way as subtask 1. Then, use the sequential
function to build a double-layer BiLSTM [5]. In the double-layer BiLSTM model,
LSTM Cells are set to 192 and 64 respectively, and the return sequences param-
eter is set to True and False respectively. Subtask 3 is a multi-class classification
problem, while subtask 4 is a multi-label classification. The biggest difference
between the two is only the activation function used in the last layer of the
network. The activation function used in subtask 3 is Softmax, while subtask 4
uses Sigmoid.
    The results of subtask 1 show that the Multilingual Bert model performs
slightly better than Albert and XLNet in the Spanish Twitter text classification
task. So in subtask 3 and subtask 4, I use the multilingual Bert model as the
basis. First, input the processed data into the Multilingual Bert model; next,
input the result of the Bert model into the two-layer BiLSTM network to further
extract features; finally, input the result of the LSTM network into the fully
connected layer (The number of neurons in the fully connected layer of subtasks
3 and 4 are twelve and fifteen respectively) for classification. Subtask 3 finally
outputs a twelve-dimensional vector, taking the element with the largest value
upwards to 1, and setting the remaining elements to 0 (The 1 and 0 in subtask
3 and 4 respectively indicate that it is predicted to be or not predicted to be a
certain category). The output of subtask 4 is a fifteen-dimensional vector, and the
threshold is set to 0.5. For each element in the output vector, if it is greater than
the threshold, it is set to 1 upwards, otherwise, it is set to 0 downwards. I used
the prediction results of the combined model of Multilingual Bert and BiLSTM
as the answers I finally submitted on the last two subtasks. The performance of
my solution on verification sets of subtask 3 and subtask 4 is shown in Table 3.

5   Results
Among all the teams participating in this task, the result of my solution ranked
fourth, eighth, fifth, and sixth among the four subtasks. Table 4 lists the scores
of the best performing teams in the four subtasks of the IberLEF 2021 HaHa
competition, and the scores of my method on the official Testing Dataset.
Table 4. The final evaluation result of my solution in the Humor Analysis based on
Human Annotation Forum challenge, the best results of each subtask and the baseline
score

                              subtask 1    subtask 2    subtask 3    subtask 4
              Method
                                 F1         RMSE        Macro-F1     Macro-F1
        Winning approach       0.8850       0.6226       0.3142       0.3787
         Our proposal          0.8681       0.6797       0.2187       0.2836
            Baseline           0.6619       0.6704       0.1001       0.0527


6   Conclusion

This paper describes the data processing, the comparison of Pre-trained Mod-
els, and the final model construction in the HAHA@ IberLEF2021 challenge.
Although the solution I proposed has achieved good results, it is undeniable
that there is still a lot of room for improvement. Due to time constraints, I can
not do error analysis at present. In my future work, I will conduct a detailed
error analysis to understand the limitations of the program.
    In future work, first of all, we can try to extract the emoticon information
in tweets instead of deleting it directly. If both the emoticons and the text
information can be extracted, it will promote humor classification. Secondly,
during the experiment, I found that the model had overfitting problems. For
this reason, I will try to use transfer learning [3] to make improvements in the
future. Finally, the distribution of data of each category in subtask 3 and subtask
4 is not balanced, especially in subtask 4. For this reason, I will consider dealing
with unbalanced data by setting the weight of the loss function.


References
 1. Azhar, A.N., Khodra, M.L.: Fine-tuning pretrained multilingual bert model for
    indonesian aspect-based sentiment analysis (2021)
 2. Barbieri, F., Saggion, H.: Automatic detection of irony and humour in twitter. In:
    ICCC. pp. 155–162 (2014)
 3. Bengio, Y.: Deep learning of representations for unsupervised and transfer learning.
    In: Proceedings of ICML workshop on unsupervised and transfer learning. pp. 17–
    36. JMLR Workshop and Conference Proceedings (2012)
 4. Bi, B., Li, C., Wu, C., Yan, M., Wang, W., Huang, S., Huang, F., Si, L.: Palm: Pre-
    training an autoencoding&autoregressive language model for context-conditioned
    generation (2020)
 5. Bohnet, B., Mcdonald, R., Simes, G., Andor, D., Maynez, J.: Morphosyntactic tag-
    ging with a meta-bilstm model over context sensitive token encodings. In: Proceed-
    ings of the 56th Annual Meeting of the Association for Computational Linguistics
    (Volume 1: Long Papers) (2018)
 6. Castro, S., Cubero, M., Garat, D., Moncecchi, G.: Is this a joke? detecting humor
    in spanish tweets. In: Springer International Publishing (2016)
 7. Chiruzzo, L., Castro, S., Góngora, S., Rosá, A., Meaney, J.A., Mihalcea, R.:
    Overview of HAHA at IberLEF 2021: Detecting, Rating and Analyzing Humor
    in Spanish. Procesamiento del Lenguaje Natural 67(0) (2021)
 8. Chiruzzo, L., Castro, S., Rosá, A.: Haha 2019 dataset: A corpus for humor anal-
    ysis in spanish. In: Proceedings of The 12th Language Resources and Evaluation
    Conference. pp. 5106–5112 (2020)
 9. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B.,
    Liu, T., Jiang, D.: Codebert: A pre-trained model for programming and natural
    languages (2020)
10. Gong, X.R., Jin, J.X., Zhang, T.: Sentiment analysis using autoregressive language
    modeling and broad learning system. In: 2019 IEEE International Conference on
    Bioinformatics and Biomedicine (BIBM) (2019)
11. Goyal, N., Du, J., Ott, M., Anantharaman, G., Conneau, A.: Larger-scale trans-
    formers for multilingual masked language modeling (2021)
12. Li, J., He, R., Ye, H., Ng, H.T., Bing, L., Yan, R.: Unsupervised domain adapta-
    tion of a pretrained cross-lingual language model. arXiv preprint arXiv:2011.11499
    (2020)
13. Li, R., Li, S.: Human behavior recognition based on attention mechanism. In:
    2020 International Conference on Artificial Intelligence and Education (ICAIE).
    pp. 103–107. IEEE (2020)
14. Mihalcea, R., Strapparava, C.: Making computers laugh: investigations in auto-
    matic humor recognition. In: Conference on Human Language Technology Empir-
    ical Methods in Natural Language Processing (2005)
15. Radev, D., Stent, A., Tetreault, J., Pappu, A., Iliakopoulou, A., Chanfreau, A.,
    de Juan, P., Vallmitjana, J., Jaimes, A., Jha, R., et al.: Humor in collective dis-
    course: Unsupervised funniness detection in the new yorker cartoon caption con-
    test. arXiv preprint arXiv:1506.08126 (2015)
16. Yang, D., Lavie, A., Dyer, C., Hovy, E.: Humor recognition and humor anchor ex-
    traction. In: Proceedings of the 2015 Conference on Empirical Methods in Natural
    Language Processing (2015)