=Paper=
{{Paper
|id=Vol-3180/paper-142
|storemode=property
|title=Assessing Wordplay-Pun classification from JOKER dataset with pretrained BERT humorous
models
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-142.pdf
|volume=Vol-3180
|authors=Victor Manuel Palma Preciado,Grigori Sidorov,Carolina Palma Preciado
|dblpUrl=https://dblp.org/rec/conf/clef/Palma-PreciadoS22
}}
==Assessing Wordplay-Pun classification from JOKER dataset with pretrained BERT humorous
models==
Assessing Wordplay-Pun classification from JOKER dataset with pretrained BERT humorous models Victor Manuel Palma Preciadoa, Grigori Sidorova and Carolina Palma Preciadoa a Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Av. Juan de Dios Batiz, s/n, 07320, Mexico City, Mexico Abstract Humor is one of the most subjective matters of human behavior since it includes a wide range of variables: sentiments, wordplay, double meanings structurally or phonetic, all of this within the construction of written humor. It is important to assess the humor from a different point of view since this variability tends to provide insight into the true structure or the main core of the humoristic dilemma, as we know the range of humor is so diverse that it presents a high skilled problem even on the simplest tasks. Pre-trained base Bert and DistilBert models trained with a humorous one-liners dataset were used, these trained models were tested with a merged dataset from JOKER from data of tasks 1 and task 3, the collected data was trimmed from duplicated records and special characters to create a final dataset with 3,601 humorous sentences. Under this experiment we try to see if our models were able to detect a different humor from the initial type with which they were trained, it was noted that both methods are able to successfully classify another type of humor. On the one hand, it was expected that the pre-trained models would be able to classify at least a portion of the humor in the data set, the results obtained were much better than anticipated, obtaining 95.64% for BERT and 92.58% for DistilBERT, the models were really able to identify humor, an analysis of the worst and best cases were taken into account. Keywords 1 Humor identification, Transformers, Humourism, Classifiers 1. Introduction As we know, humor has a high written complexity, in addition to its different formats and interpretations, which cause quite a big challenge in the field of NLP, as much as it is in the tasks of classification, interpretation, and translation. In previous work under the classification of written humor, the results were quite good for the set of One liners, which had 3 types of short jokes: the riddle type, the differences type, and a short sentence with a single delivery. These One liners, generally considered as humor, were used to train the BERT-like models and in turn, Emebbedigs such as ELMO and USE with simple networks, although these were surpassed by the BERT-like models. Furthermore, we do not know if these models were really able to identify humor as a general concept or only the structure of humor contained in the data set, this leads to the question of whether the capacity of these models is extended to another type of data with humor, in the style of [1][2][3] that present a high level of typification as does the data set for the JOKER[4] tasks. In this case, we are interested in knowing if the previously trained model has the ability to recognize the humor found in the data set provided in the JOKER[4] tasks and, therefore, check if this type of model is capable of recognizing another humor, in addition to the approach and vision of what was not classified as humor despite being so. This work intends to use the data set of tasks 1 and 3 of JOKER[4] with a preprocessing step, joining them to have a larger data set, in order to have a better representation of humor in its different forms. 1 CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy EMAIL: victorpapre@gmail.com (A. 1); sidorov@cic.ipn.mx (A. 2); c.palma.p0@gmail.com (A. 3) ORCID: 0000-0003-3901-3522 (A. 2); ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Proceedings All this with the intention to check if our methods have a little more validity in terms of the humor type or in certain cases the same humor, therefore, said training could corroborate this type of classification in a certain way. 2. Implementation As one of the first steps in the development of the classification, we opted to perform a preprocessing step in which we remove the links, special characters, duplicates, and a very short superficial manual review was made to ensure a certain homogeneity of the conjunction of both sets of data, belonging to tasks 1 and 3. It was chosen a certain portion of the data with the tags [het, hom, pun], with both portions of test and train. Of a total of 12,540 humorous texts that we initially had for classification, after trimming we were left with 3,639 texts before doing the manual review, to finally end up with a set of 3,601 registers. Subsequently, the models saved in their Tensorflow format were recovered to be used in the classification, using a wrapper called Ktrain[5] to easily load our BERT-like models, which in turn will help us determine if the model perceives the probability of any of the 2 events in the classification between humor and non-humor. With the aim of discovering, in the same way, some structure within the new data set that does allow us to enrich the knowledge that we have of our two models about the capabilities and limitations to see if these coincide with the weaknesses present in the original model. Once the model was loaded and as part of our classification, the percentage of confidence of belonging to one of the 2 categories (humor-nohumor) was obtained, which were passed through BERT- humor[6] and DistilBERT-humor[6] models, with which we classify the entire data set. As we know in general, the Tranformers scheme models and specifically the BERT-like ones tend to behave favorably with a sufficient amount of information, in most classification tasks, question-answering, ranking explainability, among others. On the other hand, the ELI5 library was used to obtain a weighted attention on the items to be classified, since it allows us to better visualize the structure we want to study. 3. Results After performing the prediction process under the two pre-trained models proposed for the task of identifying humorous text (BERT and DistilBERT), it was found that the BERT model obtained better results since it identified 95.64% of the data set as humor, which means that of the 3,601 sentences in the data set obtained from JOKER, it correctly detected 3,444 records. For its part, DistilBERT although it did not achieve the same performance, obtained good results since 3,334 texts were correctly identified, thus reaching 92.59%. Table 1 shows the performance of the two models, and although both managed to distinguish the greatest amount of humorous text, there is a small difference of 110 incorrectly classified records between the two. Table 1 BERT-like models performance Models Performance Correctly detected Incorrectly detected BERT 95.64% 3,444 157 DistilBERT 92.58% 3,334 267 For the performance of the models employed, which identify whether a text is humorous or not, a function that predicts the class to which they belong was applied and the confidence probability was also calculated for each record to assess how sure the model is that a text really belongs to the humor class. Even though the functions used to calculate the confidence yield values with six tenths, it was decided to make five ranges of 0.2 to present a more visual representation of the obtained data, these results are shown in Figure 1. The confidence extracted from the predictions made by BERT reflects that for most of the texts the model classified them as humor with certainty greater than 0.8, although for 105 texts they were identified as humor with a score less than 0.2 which implies that the text is not a humorous one, these of cases were the minority. For the ranges between 0.2 to 0.8, lower records were found among the three, which indicates that the model predicts mostly with high confidence. Figure 1: BERT humorous probability classification In the case of the DistilBERT model, the results obtained are similar to those found with BERT, since the largest amount of data had high confidence as 3,134 texts reached a value greater than 0.8. On the other hand, the other columns with ranges from 0.2 to 0.8 had a greater amount of data in comparison with BERT, but even so, these represent a smaller group; in comparison, both models present results alike (Figure 2). It is worth mentioning that the models recognize as humor texts reach a score above 0.5. Figure 2: DistilBERT humorous probability classification Once the analysis of the values obtained from the probabilities of each model was carried out, the humorous texts that obtained the best and worst scores were identified in order to visualize which types of writings the models manage to distinguish in a better way compared to the others. Table 2 presents the best five humorous texts detected by BERT, it is found that puns (PUN) are the ones that obtained the best probability. It can also be observed that the probability score obtained in confidence is high since it almost reaches 1. Table 2 Top BERT positive cases Tag Humorous text Probability PUN Why did the pig leave the party early? Because everyone thought he was a 0.9999985 boar! PUN Why are ghosts bad liars? Because you can see right through them! 0.9999982 HOM Why don't people like to talk to garbage men? They mostly talk trash. 0.9999981 PUN Why don't sharks eat clowns? Because they taste funny. 0.9999980 PUN Why did the student eat his homework? Because the teacher told him it was 0.9999980 a piece of cake! Likewise, the results of Table 3, which shows the results of the DistilBERT model, are similar to those described above since they are also mostly puns and texts with the tag HOM, with the difference that DistilBERT has among its best scores text the identification tag HET. For both cases, the best scores are riddle texts which have a question-and-answer format. Table 3 Top DistilBERT positive cases Tag Humorous text Probability PUN What's the best fruit for avoiding scurvy? Naval oranges, of course. 0.9996407 PUN What does an angry pepper do? It gets jalapenos face. 0.9996404 PUN What do you call a duck that gets all A's? A wise quacker. 0.9996402 HOM There was an eye doctor who wanted to re-locate but couldn't find a job 0.9996402 because he didn't have enough contacts. HET What is the best store to be in during an earthquake? A stationery store. 0.9996401 On the other hand, for the case of the humorous texts with the worst probabilities presented in Table 4, the BERT model detected mostly HET and very few HOM and wordplay (puns). As can be seen, the scores obtained were low since they tend to 0, this may be due to the structure of this type of example since in comparison with the best results no riddles are found in the worst rated. Table 4 Top BERT negative cases Tag Humorous text Probability HET Opportunities take ''now'' for an answer 0.0000654 HET Podiatrist malpractice: Callous neglect 0.0001483 HOM Bill Gates took advantage of his Windows of opportunity 0.0001519 HET Exposure to the son may prevent burning 0.0002049 PUN Can honeybee abuse lead to a sting operation? 0.0002278 In the same way, this phenomenon occurs for the DistilBERT model since among the humorous text with the worst score there are two texts that also appear in results achieved by BERT, such as: Opportunities take ''now'' for an answer and Exposure to the son may prevent burning, this indicates that both models are perceiving the same text as not humorous, which points to a similar detection structure. Table 5 Top DistilBERT negative cases Tag Humorous text Probability HET Exposure to the son may prevent burning 0.0007237 HET Opportunities take ''now'' for an answer 0.0007271 HET Exposure to the son prevents burning 0.0007380 HOM Could modern submarines be the wave of the future? 0.0007441 HET A budget helps us live below our yearnings 0.0007591 3.1. ELI5 prospection The ELI5 library was used to obtain a certain resemblance to where our BERT-like models classify the humorous point of attention, given that it manages a joint probability, we can observe that in general the humorous sentences that fared better were those of the riddle type, which consists of a question and an answer as the humorous delivery is usually made. Bert - humorous texts with the best scores: Why did the pig leave the party early? because everyone thought he was a boar! Why are ghosts bad liars? because you can see right through them! DistilBERT - humorous texts with the best scores: What's the best fruit for avoiding scurvy? Naval oranges, of course. What does an angry pepper do? It gets jalapenos face. As we can see above, humorous question-and-answer statements had a fairly good rating, since in general this type of text was successfully evaluated. On the other hand, the probability colorations that ELI5 marks make sense since they start with the WH-questions, which is one of the main ways of gathering information, and since in a riddle it is not interesting to reveal a small amount of information, it perfectly fulfills the scheme of attention to delivery after the question taking a darker coloration at the end of the sentence for both the question and answer. Bert - humorous texts with the worst scores: Opportunities take ''now'' for an answer Podiatrist malpractice: Callous neglect DistilBERT - humorous texts with the worst scores: Exposure to the son may prevent burning Opportunities take ''now'' for an answer On the other hand, the sentences that did worse turned out to have a certain pattern, since if we try to look closely these sentences could very well allude to the title of a review or article, given their content between serious and humorous seemed to confuse the decision of the model, when it comes to classification. It seems that the model, given the coloration, focuses on nouns such as Opportunities, Malpractice and Exposure to classify the text as something that does not contain humor. 4. Conclusion The pre-trained models used in this work that addresses the subject of humorous text identification, managed to detect the majority of the data set obtained from JOKER as humor with a good outcome for both models. It should be noted that it seems that the opposite or negative part (the data set previously used to train the models) strongly affects the result, that is, the non-humorous part of the data set. The paradigm of the models based on Transformers tend to classify humorous texts well, in this case, BERT and DistilBERT manage to classify humorous texts with high probabilities. A tendency was also found for the models to identify with better scores the riddles that were found in the data set as puns, as opposed to the other types of examples. Therefore, it is obtained that the structure of the humorous text strongly influences its identification. On the one hand, it is not surprising that it behaved favorably with the JOKER data set, since the humor contained in it in certain aspects becomes very similar to a certain extent to the humor with which these models were trained, it should be noted that Certain curiosities were found within the best and worst classified elements, certain patterns that have a lot to do with the counterpart of the humor with which it was trained, giving a view of some weaknesses and strengths of the models that will be explained later. 5. Acknowledgements The work was done with partial support from the Mexican Government through the grant A1-S- 47854 of CONACYT, Mexico, grants 20220852 and 20220859 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico and acknowledge the support of Microsoft through the Microsoft Latin America PhD Award. 1. References [1] Mihalcea Rada and Strapparava Carlo. "Making computers laugh: investigations in automatic humor recognition." In Conference on Human Language Technology and Empirical Methods in Natural Language Processing, (2005): 531–538. [2] Miller, K.E.L. "The Unuttered Punch Line: Pragmatic Incongruity and the Parsing of 'What's the Difference' Jokes." (3 December 2009). [3] Orion Weller and Kevin Seppi. "The rJokes Dataset: a Large Scale Humor Collection." In Proceedings of the 12th Language Resources and Evaluation Conference, (2020): 6136–6141, Marseille, France. European Language Resources Association. [4] Ermakova, L., Miller, T., Puchalski, O., Regattin, F., Mathurin, É., Araújo, S., Bosser, A.-G., Borg, C., Bokiniec, M., Corre, G. L., Jeanjean, B., Hannachi, R., Mallia, Ġ., Matas, G., & Saki, M. (2022). CLEF Workshop JOKER: Automatic Wordplay and Humour Translation. In M. Hagen, S. Verberne, C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, & V. Setty (Eds.), Advances in Information Retrieval (Vol. 13186, pp. 355–363). Springer International Publishing. [5] Maiya, A.S. "Ktrain: A Low-Code Library for Augmented Machine Learning" (2020). ArXiv, abs/2004.10703. [6] V. M. Palma, Automatic Detection of Jokes in Texts, Master’s thesis, Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Ciudad de México, México, 2021.