BERT and SHAP for Humor Analysis based on
            Human Annotation

        Guillem Garcı́a Subies, David Betancur Sánchez, Alejandro Vaca

Instituto de Ingenierı́a del Conocimiento, Francisco Tomás y Valiente st. 11, EPS, B
            Building, 5th floor UAM Cantoblanco. 28049 Madrid, Spain
             guillem.garcia@iic.uam.es david.betancur@iic.uam.es
                             alejandro.vaca@iic.uam.es


        Abstract. This paper describes a system created for the HAHA 2021
        shared task, framed within the IberLEF 2021 workshop [6]. We present
        an approach mainly based in fine-tuned and hyperparameter-optimized
        BERT models for binary, multi class and multi label classification. Our
        models far outperform the baselines and achieve results close to to the
        state-of-the-art. We also present a SHAP-values based model to explain
        predictions on what is humorous and what is not.

        Keywords: humour Detection · BERT · Transformers · multiclass · mul-
        tilabel · hyperparameter optimization


1     Introduction
Humor research has been done historically from different domains such as lin-
guistics, history, literature and psychology. Machine learning and computational
linguistics are some tools that have been implemented on certain studies [2,11,14]
but there is still a lot to tackle.
    This article shows the process of using a BERT [7] model in Spanish to predict
some of the present tasks such as humor prediction, humor mechanism and
humor target. For all the tasks, fine-tuning was performed for binary, multiclass
and multilabel problems. Additionally, for the binary task, a hyperparameter
optimization was performed.
    After predictions were performed, some explicability models were used to
show the true portions of the text that was giving the humor to give us some
insights on why some of them produce laughter.

2     Related Work
Most of the work in humor detection is dated before the appearance of Trans-
former [16] models. After this milestone, the state-of-the-art models started to
be based in Transformers.
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
     In a generic way, Sun et al. [15] propose very interesting techniques for text
classification. However these are more focused in longer text. Some of the con-
clusions they achieve are that he top layer of BERT is more useful for text
classification, within-task and in-domain further pre-training can significantly
boost its performance and that a preceding multi-task fine-tuning is also helpful
to the single-task fine-tuning, but its benefit is smaller than further pre-training.
     Specifically, for short and humorous texts, ColBERT [1] is a novel approach
based on sentence embeddings that achieves the best state-of-the-art results for
English data.
     For the Spanish language, we can underline the results of the 2019 HAHA
shared task [5] where the best results were BERT-based models.
     Some features have been evaluated to detect humor in texts. For example, in
[2], they used features such as Animal presence, Keywords and a binary variable
that establishes if a tweet is a dialog to detect humor. This feature engineering,
though helpful for ML models, is based on heuristics rather than on a deep
understanding of language so that we can truly understand when something is
funny. A language model with pure texts as input, on the other hand, may need
to understand the actual meaning of words in the concrete context of that input,
thus they may be able to get a deeper and more complete understanding of the
role language plays on humor construction.


3   Tasks Description

The main corpus consists of 24000 texts in Spanish for training and 6000 for
evaluation. For each text we predicted if it was humorous or not, the mechanism
of the joke and the target or targets the joke had.
    The first task, Humor Detection consisted on determining if a tweet is a joke
or not (intended humor by the author or not). The performance of this task was
measured using the F1 score of the ‘humorous’ class.
    The second task consisted on Funniness Score Prediction. It consists on giving
a rating from 1 to 5 for how funny is a tweet. Here we did not made a competitive
model.
    The third task was Humor Mechanism Classification, which consisted on
predicting the mechanism by which the tweet conveys humor from a set of classes
such as irony, wordplay or exaggeration. In this task, only one class per tweet
was allowed. The performance of this task was measured using the Macro-F1
score.
    The fourth and last task was Humor Target Classification, which consisted
on predicting the target of the joke (what it is making fun of) from a set of
classes such as racist jokes, sexist jokes, etc. The performance of this task was
measured using the f1-macro score.
    As we can see in Table 1, the dataset is unbalanced. That is the main reason
why the F-measure is used as the ranking metric.
        Class                  Nº Samples Task1 Nº Samples Task3
        non-humor                      14747             14747
        wordplay                                          701
        reference                                         578
        exaggeration                                      476
        unmasking                                         441
        misunderstanding                                  416
        absurd           humor          9253              566
        irony                                             371
        analogy                                           319
        embarrassment                                     301
        parody                                            255
        stereotype                                        230
        insults                                           146
                        Table 1. Distribution of Samples


   In the table below, we can see some illustrative examples of the data and
their labels:


           La realidad es dura pero se tiene que afrontar.              non-humor
 #20CosasQueHacerAntesDeMorir: Enseñarles la diferencia entre:
                                                                         reference
 -Hay de haber -Ahı́ de lugar -Ay de exclamar - Ai se eu te pego.
        Te quiero pero #YoTan Twitter y tú tan Facebook.                 analogy
    Cambié mi contraseña de Twitter por ”incorrecta”, si se me
olvida, twitter me la recordará: Su contraseña es ”incorrecta”. Soy       irony
                              una genio
  WhatsApp cayó varias veces en 2015 y vos todavı́a no caes que
                                                                           insults
                          nadie te soporta.
                Soy virgen, lo juro por mis dos hijos!                    absurd
  —Bienvenido a los X-Men, ¿cuál es tu poder? —Creo regresaré
                                                                          parody
          con mi Ex —Muy bien, te llamaremos ”Bestia”.
   –¿Tiene pastillas para la diarrea? –No. –Ok, deme un rollo de
                                                                      embarrassment
                          papel higiénico :(
—Hola linda, ¿Por qué tan sola? —Es que me vine a tirar un pedo.       unmasking
 —¿Y Thomas? —No, yo no tomo. —No, ¿Que si Thomas vino?
 —No me gusta el vino. —¡No! ¡¿Que si llegó Thomas?! —No, no misunderstanding
                     tomaré ni aunque llegues.
Tenı́amos una farmacia pero la cerramos porque no tenı́amos mas
                                                                         wordplay
                            remedio. #fb
Si yo fuera presidente haria pintar la casa de gobierno de celeste ,
                                                                        stereotype
                 porque soy varon #chistes #humor
  Doctor, ¿cuanto me queda de vida? - Diez... - Diez qué? - Diez,
                                                                       exaggeration
                  nueve, ocho, siete... #humortico
                      Table 2. Examples of the different classes
4     Models

4.1   Data Preprocessing

We performed a simple texts preprocessing where we substituted some expres-
sions with a more normalized form:

 – Every URL was replaced with the token “[URL]” so we do not get strange
   tokens when the tokenizer tries to process a URL. Furthermore, no semantic
   information about humor can be inferred from a URL, the only information
   relevant for the model is that there is a URL in that token.
 – The hashtag characters (“#”) were deleted (“#example” → “example”) be-
   cause the base language models we will use, are trained in generic text and
   might not understand their meaning. Furthermore, most of hashtags are used
   the same way as normal words.
 – We replaced every username with the generic token “[USER]” because the
   exact name of a user does not really add any information about the humor.
   The only relevant feature is knowing if someone was mentioned or not, but
   not who.
 – Finally we normalized every laugh (“jasjajajajj” → “haha”) so we minimize
   the noise of the misspellings, common in social networks.


4.2   Baselines

The competition owners provided some baselines to compare our models with.
The baselines consisted on the following models:

 – task 1: Naive Bayes with tfidf features. (0.6493 F1 over the dev corpus)
 – task 2: SVM regression with tfidf features (0.6532 RMSE over the dev corpus)
 – task 3: Naive Bayes with tfidf features (0.1038 macro-F1 over the dev corpus)
 – task 4: Assign label X if the tweet contains one of the ”top” words for label
   X on the training corpus (top words were selected as the 50th to 60th most
   frequent words for the label) (0.0595 F1 over the dev corpus)


4.3   Language Models

For our main language models we selected BETO [4], a BERT model trained with
the Spanish Unannotated Corpora (SUC) [3] that has proven to be much better
than the multilingual BERT model. The fine-tuning was performed distinctly
for each task, varying the last layer of the model architecture to make binary
(task1), regression (task2), multiclass (task3) and multilabel (task4) predictions.
For Task1, Task2 and Task3, the default loss was used. For Task4 a custom
loss was included in the Trainer class from Transformers library [17] to handle
multilabel data. BCEWithLogitsLoss from pytorch was used as the custom loss
for this task. This loss consist on calculating the binary cross entropy for each
label and then averaging the values.
    In addition, for the fine-tuning process, on Task1 we carried out a Grid-search
optimization over the main parameters of the neural network: learning rate,
batch size and dropout rate. The search was performed with a 5-fold stratified
cross-validation with the following grid: Learning rate, (1e − 6, 1e − 5, 3e − 5, 5e −
5, 1e − 4); batch size, (8, 16, 32) and dropout rate, (0.08, 0.1, 0.12). The best
parameters for both models were: learning rate, 1e−5; batch size, 16 and dropout
rate, 0.1.
    On the task 2, 3 and 4, the default hyperparameters were used.
    On task4 for final predictions we computed the f1 metric through different
thresholds on the validation set in order to convert the logits to classes. The
threshold values evaluated were 0.2, 0.3 and 0.4. The best one on validation was
0.2 but on test the best result was on a 0.4 threshold.
    Finally, the epochs performed for the fine tuning of the models were 5 for
task1, 2 for task2, 7 for task3 and 8 for task4.
    For explainability, shap values were calculated for each token of the sentences.
Random sentences were evaluated for insights look-up.


5     Experiments and Results
5.1   Experimental Setup
We trained all the models with a NVIDIA Tesla P100-PCIE-16GB GPU and
a Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz CPU with 500GB of RAM
memory.
    The software we used was Python3.8, transformers 4.5.1 [17], pytorch 1.8.1
[12], shap 0.39.0 [10] and scikit-learn 0.24.1 [13].

5.2   Results
In the Table 3, under Task1, we can see the results for our models in the test
set of the first task, where we obtained the second place.


Task 1         Task 2              Task 3              Task 4
Model Accuracy Model Accuracy Model Accuracy Model Accuracy
BETO 0.87      BETO 0.68           BETO 0.25           BETO 0.31
baseline 0.66  baseline 0.66       baseline 0.1        baseline 0.05
Winner 0.885   Winner 0.62         Winner 0.33         Winner 0.42
              Table 3. Results for tasks (value is Accuracy)


    In the Table 3, under Task2, we can see the results for our models in the test
set of the second task.
    For the third task, the results dropped in terms of F1, this for the difficulty
of a multiclass model. In the Table 3, under Task3, we can look at them in more
depth.
   Finally, for the task 4 results varied a lot. The baseline was very poor and
the winner was very far on top of the other competitors. Our model did good
compared to the baseline, but there is a long way to reach the winner. Results
are shown on Table3, under Task4.
   For the explicability shap model we visualized the sentences, highlighting the
important parts that led the model to make a ”humor” prediction. We found 3
main cases:

 – case 1: Some jokes have a very specific format, such as dialogues between
   characters. For example on Fig 1 the ”–” that characterize a dialog, is very
   important on a prediction for a humor sentence. As mentioned before in the
   Related Work section, Castro et al. [2] also detect these features that specify
   dialog in a tweet, so it is definitely a important matter on detecting humor.
   This, however, is a weak heuristic because the text being in conversation
   form is not inherently funny, it’s simply a typical text form people use for
   expressing humor in Twitter, therefore it doesn’t mean the models are really
   understanding how humor is constructed in general; this aspect of texts has
   nothing to do with the language used, the humor techniques used (such as
   irony, sarcasm, etc.), but with the structure of the text.
 – case 2: Some jokes are not exactly on the train set, but some are very similar,
   so the model ”overfits” (highly weight words just for being in train and not
   because they are humorous) under this jokes and gives high values for pre-
   dictions. For example in Fig 2 seems like there is an overfitting. We searched
   on the train set and found this tweet:
   —Mi amor ¿me compras un teléfono? —¿Y el otro? —El otro me va a com-
   prar un iPad —¡ME REFERÍA AL OTRO TELÉFONO! — :decepcionado:
   AY!!
   Both texts are very similar, so the model overfits.
 – case 3: Finally there is the case that we think represents the best kind of
   prediction. Where the model understands relations between words and set
   the context as a humorous one. One example can be seen on Fig 3


                           Fig. 1. Case for format joke


                          Fig. 2. Case for repeated joke
                              Fig. 3. Case for real joke


6   Conclusions and Future Work

Through this shared task, we have seen that NLP can be of great help in de-
tecting and classifying humorous and non-humorous texts and there is still a
long way to go. As was explained before, when analyzing which parts of the text
the models use for deciding whether the text is humorous or not are based on
heuristics such that whether or not the tweet represents a conversation. This
shows that there is still much work to do until language models are able to un-
derstand the inherent semantics of the text so well that it can really understand
the aspects of the texts, independently of the text form, that causes laughter.
However, humor is an expression of high-level intelligence, expressed in sophisti-
cated communication techniques, therefore only understanding the text meaning
is probably not enough for many cases.
    The results obtained by our systems are very promising given their great
performance and their simplicity. Furthermore, the use of explicability models
can really help get some insight on models behaviour for this kind of data. All
this is very significant and could lead to much better results when combined
with other improvements from the state-of-the-art.
    We believe that our results could improve a lot using specific language models
trained with corpora from social networks like TWilBert [8] for Spanish tweets.
Finally, we have proven that good hyperparameters are also key for a good
neural network so a better search, like the Population Based Training, though
computationally expensive, [9], would further improve the model.


Acknowledgments

This work has been partially funded by the Instituto de Ingenierı́a del Conocimiento
(IIC) and the hardware used was also provided by the IIC.


References

 1. Annamoradnejad, I., Zoghi, G.: Colbert: Using bert sentence embedding for humor
    detection (2021)
 2. Castro, S., Cubero, M., Garat, D., Moncecchi, G.: Is this a joke? detecting humor
    in spanish tweets (11 2016). https://doi.org/10.1007/978-3-319-47955-2 12
 3. Cañete,     J.:   Compilation     of    large     spanish    unannotated     cor-
    pora          (May         2019).        https://doi.org/10.5281/zenodo.3247731,
    https://doi.org/10.5281/zenodo.3247731
 4. Cañete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish pre-trained bert model
    and evaluation data. In: to appear in PML4DC at ICLR 2020 (2020)
 5. Chiruzzo, L., Castro, S., Etcheverry, M., Garat, D., Prada, J.J., Rosá, A.: Overview
    of haha at iberlef 2019: Humor analysis based on human annotation. In: IberLEF@
    SEPLN. pp. 132–144 (2019)
 6. Chiruzzo, L., Castro, S., Góngora, S., Rosá, A., Meaney, J.A., Mihalcea, R.:
    Overview of HAHA at IberLEF 2021: Detecting, Rating and Analyzing Humor
    in Spanish. Procesamiento del Lenguaje Natural 67(0) (2021)
 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
    rectional transformers for language understanding (2018)
 8. Ángel González, J., Hurtado, L.F., Pla, F.: Twilbert: Pre-trained
    deep bidirectional transformers for spanish twitter. Neurocomputing
    (2020).              https://doi.org/https://doi.org/10.1016/j.neucom.2020.09.078,
    http://www.sciencedirect.com/science/article/pii/S0925231220316180
 9. Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.M., Donahue, J., Razavi,
    A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., Fernando, C., Kavukcuoglu,
    K.: Population based training of neural networks (2017)
10. Lundberg, S.M., Lee, S.I.: Shap: A unified approach to interpreting
    model predictions. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach,
    H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neu-
    ral Information Processing Systems 30, pp. 4765–4774. Curran Associates,
    Inc. (2017), http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-
    model-predictions.pdf
11. Mihalcea, R., Strapparava, C.: Learning to laugh (automatically): Computational
    models for humor recognition. Computational Intelligence 22, 126–142 (05 2006).
    https://doi.org/10.1111/j.1467-8640.2006.00278.x
12. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
    Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito,
    Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chin-
    tala, S.: Pytorch: An imperative style, high-performance deep learning library. In:
    Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett,
    R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035.
    Curran Associates, Inc. (2019), http://papers.neurips.cc/paper/9015-pytorch-an-
    imperative-style-high-performance-deep-learning-library.pdf
13. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
    Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
    Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
    learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
14. Sjöbergh, J., Araki, K.: Recognizing humor without recognizing meaning (07 2007).
    https://doi.org/10.1007/978-3-540-73400-0 59
15. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune bert for text classification?
    (2020)
16. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need (2017)
17. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
    Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C.,
    Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush,
    A.M.: Transformers: State-of-the-art natural language processing. In: Proceedings
    of the 2020 Conference on Empirical Methods in Natural Language Processing:
    System Demonstrations. pp. 38–45. Association for Computational Linguistics,
    Online (Oct 2020), https://www.aclweb.org/anthology/2020.emnlp-demos.6