CLEF 2024 JOKER Tasks 1-3: Humour identification and classification⋆ Notebook for the JOKER Lab at CLEF 2024 Rowan Mann1 , Tomislav Mikulandric2 1 Christian-Albrechts-Universität zu Kiel (CAU), Christian-Albrechts-Platz 4, 24118 Kiel 2 The University of Split, Ul. Ruđera Boškovića 31, 21000, Split, Croatia Abstract The CLEF 2024 JOKER track focuses on the automatic processing of wordplay through three tasks: humour-aware information retrieval, humour classification by genre and technique, and translation of puns from English to French. Recent advancements in Large Language Models (LLMs) have enhanced their conversational abilities, yet they struggle with humour detection and generation. Addressing this gap can significantly improve human- computer interactions. For Task 1, we implemented a TF-IDF vectorizer and logistic regression model to identify and rank humorous sentences. The model achieved an F1 score of 0.93 for non-puns and 0.73 for puns, indicating robust performance in humour detection. In Task 2, we classified jokes into five categories using logistic regression, Naive Bayes, and support vector machines. The SVM model performed best, with F1 scores ranging from 0.14 to 0.61, showing particular efficacy in classifying wit. Task 3 involved translating puns using the MarianMT model from the Hugging Face library. Although successful, the process was time-intensive, suggesting the need for more efficient methods. Overall, our approaches demonstrated effective humour identification and translation capabilities but faced challenges in genre-specific classification. This research underscores the importance of improving LLMs’ humour processing abilities for better human-computer interaction. Keywords LLMs, Humour identification, Humour classification, Large Language Models, TF-IDF 1. Introduction In recent years, Large Language Models (LLMs) have increased in their capabilities exponentially. Since the release of ChatGPT-4 in 2023, the world has quickly become accustomed to interacting with LLMs via a chatbot API that made users feel like they were truly chatting with the machine. A 2024 study applied the Turing Test to participants using ChatGPT-4, finding that humans incorrectly judged ChatGPT-4 to be human 54 percent of the time, showing just how convincingly LLMs can now emulate human conversation. [1] As impressive as this is, LLMs are still found to be lacking in several areas, humour being one of them. Large language models struggle to reliably detect and explain humour[2][3][4], and generate novel jokes [5]. For humans, humour plays a central role in forming relationships and can enhance performance and motivation.[6] Therefore, giving LLMs a good grasp of humour has the potential to massively boost the success of human-computer interactions. Puns are a form of humour based on wordplay. Usually puns exploit double meanings of words or similarity of sounds between different words to create a humorous or witty effect, with frequent use of double entendré, homophones, or similar-sounding words. Words which could be used to form puns are words like “profit” and “prophet”, for example, or “check” and “Czech”. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France ⋆ CLEF 2024 JOKER Tasks 1-3: Humour identification and classification * Corresponding author. † These authors contributed equally. $ rowanmann93@gmail.com (R. Mann); tomislav.mikulandric@gmail.com (T. Mikulandric) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Perhaps what makes humour analysis such a challenge for LLMs is the many different subtle forms of it that exist. Humour can be “on-the-nose”, physical, awkward, subtle, obvious, visual, childish, or intelligent; it’s especially hard to define, which presents problems for LLMs. Wordplay, by its very nature, exploits the intrinsic structure of the source language used and certain characteristics used that may be impossible or difficult to find replacements or analogues of in the target language.[7] Therefore, translating jokes from language to language requires more than simply replacing one word with the corresponding word in the target language. The deeper understanding of context required is an area where LLMs could excel, if methods are developed. The JOKER track of CLEF 2024 aims to develop interdisciplinary approaches to the automatic processing of wordplay. [8] This year, the JOKER track is split into three tasks: • Task 1: Humour-aware information retrieval. • Task 2: Humour classification according to genre and technique. • Task 3: Translation of puns from EN to FR. Can LLMs succeed in identifying, classifying, and translating humour? In this paper we will explore this question, ahead we detail our workflow and results for tackling the three tasks of the JOKER track. 2. Task 1: Experimental Setup 2.1. Data Description The data provided consisted of four JSON files. There was a “corpus” file containing a list of 61,268 pun and non-pun sentences, “queries test” and “queries train” files that list the corresponding keyword linked to the sentences, and a “qrels train” file that if the sentence was a pun (“1”) or not (“0”). 2.2. Method Our first step was to merge the data, creating a table with five columns: “qid”, “docid”, “qrel”, “text, and “query”. We then used a TF IDF vectorizer to train the model on all our text and the corresponding “qrel” values. # q u e r y t e x t and j o k e t e x t i n t o a s i n g l e column − TF− IDF V e c t o r i z e r d a t a merged [ ’ t e x t a l l ’ ] = d a t a merged [ ’ query ’ ] + " " + d a t a merged [ ’ text ’ ] # F i t and t r a n s f o r m t h e combined t e x t t f i d f m a t r i x = t f i d f v e c t o r i z e r . f i t t r a n s f o r m ( d a t a merged [ ’ t e x t a l l ’]) Then we created a logistic regression model based off of our training data and used the model to make predictions based off of our “queries test” data. from s k l e a r n . l i n e a r model i m p o r t L o g i s t i c R e g r e s s i o n # L o g i s t i c R e g r e s s i o n model model = L o g i s t i c R e g r e s s i o n ( ) # T r a i n e d model t r a i n e d model = model . f i t ( X t r a i n , y t r a i n ) The model returned relevance scores for each joke in the corpus for each query. (Appendix A) Based on these results we produced a JSON file listing the “best” or rather, most relevant, jokes. (Appendix B) 3. Task 1: Experimental Results The results obtained from our model were very positive. Calculation of an F1 score for our model produced results of 0.93 for the “0” (no pun) class, and 0.73 for the “1” class. This indicates that it performed extremely well, with high precision and recall, for identifying sentences without puns and moderately well for identifying those with puns. 4. Task 2: Experimental Setup 4.1. Data Description The data provided consisted of 3 JSON files. “classification test data” and “classification train input data” each contained “text” column containing thousands of jokes. The classification data was contained in the file “classification train qrels data”, which provided the labels for our training data, categorising each joke into one of five classes, IR (irony), SC (sarcasm), EX (exaggeration), AID (incongruity), SD (self-deprecating), WS (wit). 4.2. Method We merged the training data JSONs to create a dataframe with the text and classes side by side. We preprocessed our text by removing contractions, making all letters lowercase, removing special characters and URLs, then replaced the original text with our cleaned text in our data frame containing the training labels. (Appendix C) prompt t e r m s = " " " You a r e a r o b o t t h a t ONLY o u t p u t s JSON . You r e p l y i n JSON f o r m a t w i t h t h e f i e l d ’ terms ’ . You p r o v i d e ONLY s e m i c o l o n − s e p a r a t e d l i s t o f MAXIMUM 3 s c i e n t i f i c t e r m s o f a s o u r c e s e n t e n c e ONLY . You DO NOT add ’ Sure , Here a r e t h e s c i e n t i f i c t e r m s o f your sentence : ’ . Example s o u r c e s e n t e n c e : I n t h e modern e r a o f a u t o m a t i o n and robotics , \ autonomous v e h i c l e s a r e c u r r e n t l y t h e f o c u s o f a c a d e m i c and industrial research .? \ Example answer : { ’ terms ’ : ’ r o b o t i c s ; autonomous v e h i c l e s ’ } Now h e r e i s my s e n t e n c e : """ We encoded our classifications to numbers and used TF-IDF to vectorise the text. The data was then used to train a logistic regression model, a naive bayes model and a support vector machines model. # T r a i n L o g i s t i c R e g r e s s i o n model l o g i s t i c r e g r e s s i o n model = L o g i s t i c R e g r e s s i o n ( max i t e r = 1 0 0 0 ) l o g i s t i c r e g r e s s i o n model . f i t ( X t r a i n t f i d f , y t r a i n ) # T r a i n N a i v e B a y e s model n a i v e b a y e s model = M u l t i n o m i a l N B ( ) n a i v e b a y e s model . f i t ( X t r a i n t f i d f , y t r a i n ) # T r a i n SVM model svm model = SVM ( k e r n e l = ’ l i n e a r ’ ) # You can s p e c i f y d i f f e r e n t k e r n e l s l i k e ’ l i n e a r ’ , ’ po l y ’ , ’ r b f ’ , e t c . svm model . f i t ( X t r a i n t f i d f , y t r a i n ) With our models trained, we were able to make predictions based off of our test data, which we preprocessed in the same way as we did our training data. We then had to convert our classes back to their original names, from numbers. (Appendix D) 5. Task 2: Experimental Results The results of our F1 testing reveals big differences in the performance of each model. The logistic regression model achieved F1 results of between 0.05-0.54 for all the 5 categories. The SVM model achieved better results, with F1 results between 0.14 and 0.61. Interestingly, for both models, the “WS” (wit) class achieved the highest F1 score, indicating this was the easiest to classify. In both models, “IR” (irony) and “EX” (exaggeration) achieved lowest scores, indicating difficulty in classifying those types of jokes. 6. Task 3: Experimental Setup 6.1. Data Preparation The data provided consisted of 3 JSON files: “translation EN FR train input”, consisting of 1406 jokes in English, “translation EN FR train qrels”, containing 5839 jokes in french, and “task3 2024 test”, consisting of 4502 rows of jokes in English. There were also two .tsv files, “joker translation EN-FR train input.tsv, and “joker translation EN-FR train qrels.tsv”. In the parent folder, there was also a JSON, “joker translation test”, and “joker translation test.tsv”. 6.2. Method First we merged our data to create a unified data frame around the English and French jokes, joined using the “id en” variable. We loaded the hugging face “transformer” library and used the MarianMTModel and MarianTokenizer then used a pre-trained model, “Helsinki-NLP Opus”, after trying first with EasyNMT and finding better results from the Helsinki model. (Appendix E) We iterated through the whole list of English jokes, translating to French, then decoded the vectors back to letters, before saving in a new column alongside the originals. (Appendix F) 7. Task 3: Experimental Results The translations seemed to be successful however, one drawback of using our method was that the process was highly time intensive, suggesting there may be more practical methods available. 8. Conclusion We can conclude that our techniques are effective at identifying the presence of a joke, but rather less effective at classifying them into one of our five classes. Thanks to the existence of pre-trained models which are tailored to the task, we can conclude that the translation of our jokes from English to French was successful. Acknowledgments We would like to extend our gratitude to the University of Brest for organising the Blended Intensive Programme (BIP) AI For Humanities. We would also like to thank Liana Ermakova for her teaching of the course and Caroline L’haridon for her support during our stay in Brest. References [1] C. Jones, B. K. Bergen, People cannot distinguish gpt-4 from a human in a turing test, arXiv, 2024. URL: https://arxiv.org/abs/2405.08007. [2] A. Baranov, V. Kniazhevsky, P. Braslavski, You told me that joke twice: A systematic investigation of transferability and robustness of humor detection models, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore, 2023, pp. 13701–13715. [3] F. Góes, P. Sawicki, M. Grześ, D. Brown, M. Volpe, Is gpt-4 good enough to evaluate jokes?, 2024. [4] J. Hessel, A. Marasovic, J. D. Hwang, L. Lee, J. Da, R. Zellers, R. Mankoff, Y. Choi, Do androids laugh at electric sheep? humor “understanding” benchmarks from the new yorker caption contest, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 688–714. [5] S. Jentzsch, K. Kersting, Chatgpt is fun, but it is not funny! humor is still challenging large language models, 2023. [6] B. M. Savage, H. L. Lujan, R. R. Thipparthi, S. E. DiCarlo, Humour, laughter, learning, and health! a brief review, Advances in Physiology Education (2017). [7] D. Delabastita, Wordplay as a Translation Problem: A Linguistic Perspective, De Gruyter Mouton, 2004. [8] A.-G. B. V. M. P. P. G. S. Liana Ermakova, Tristan Miller, A. Jatowt, Overview of clef 2024 joker track on automatic humor analysis, experimental ir meets multilinguality, multimodality, and interaction. proceedings of the fifteenth international conference of the clef association (clef 2024) (2024). .1. Appendix A results = [] # I t e r a t e over each t e s t query f o r index , t e s t query in data t e s t q u e r i e s . i t e r r o w s ( ) : query i d = t e s t query [ ’ qid ’ ] q u e r y t e x t = t e s t q u e r y [ ’ query ’ ] # C a l c u l a t e r e l e v a n c e f o r each joke in the corpus with t h i s query scores = [] for , joke in data corpus . iterrows ( ) : i f j o k e [ ’ t e x t ’ ] i s None : continue else : t e x t a l l = query t e x t + " " + joke [ ’ t e x t ’ ] vectorized text = t f i d f v e c to r iz e r . transform ( [ text a l l ] ) r e l e v a n c e s c o r e = model . p r e d i c t p r o b a ( v e c t o r i z e d t e x t ) [ 0 , 1] s c o r e s . append ( { ’ docid ’ : joke [ ’ docid ’ ] , ’ score ’ : relevance score }) .2. Appendix B # S o r t j o k e s by r e l e v a n c e s c o r e i n d e s c e n d i n g o r d e r s c o r e s . s o r t ( key = lambda x : x [ ’ s c o r e ’ ] , r e v e r s e = True ) # P r e p a r e o u t p u t JSON f o r m a t f o r rank , s c o r e i n f o i n e n u m e r a t e ( s c o r e s , s t a r t = 1 ) : r e s u l t s . append ( { ’ run i d ’ : " T o m i s l a v&Rowan t a s k 1 TFIDF " , ’ manual ’ : 0 , ’ rank ’ : rank , ’ score ’ : score info [ ’ score ’ ] , ’ docid ’ : score i n f o [ ’ docid ’ ] , ’ qid ’ : query i d }) w i t h open ( ’ r e s u l t j o k e r t a s k 1 . j s o n ’ , ’w ’ ) a s o u t f i l e : j s o n . dump ( r e s u l t s , o u t f i l e , i n d e n t = 4 ) .3. Appendix C # Preprocessing function from n l t k . stem i m p o r t WordNetLemmatizer import c o n t r a c t i o n s import re import nlt k n l t k . download ( ’ s t o p w o r d s ’ ) n l t k . download ( ’ wordnet ’ ) from n l t k . c o r p u s i m p o r t s t o p w o r d s lem = WordNetLemmatizer ( ) def preprocess text ( text ) : sms = c o n t r a c t i o n s . f i x ( s t r ( t e x t ) ) # c o n v e r t i n g s h o r t e n e d words t o o r i g i n a l ( Eg : " I ’m" t o " I am " ) sms = sms . l o w e r ( ) # l o w e r c a s i n g t h e message sms = r e . sub ( r ’ h t t p s ? : / / S + |www. S + ’ , " " , sms ) . s t r i p ( ) # removing url sms = r e . sub ( " [ ^ a − z ] " , " " , sms ) # removing s y m b o l s and numbers ( k e e p i n g o n l y c h a r a c h t e r s from a − z ) sms = sms . s p l i t ( ) # s p l i t t i n g # l e m m a t i z a t i o n and s t o p w o r d r e m o v a l sms = [ lem . l e m m a t i z e ( word ) f o r word i n sms i f n o t word i n s e t ( s t o p w o r d s . words ( " e n g l i s h " ) ) ] sms = " " . j o i n ( sms ) r e t u r n sms X = df t r a i n [ " t e x t " ] . apply ( preprocess t e x t ) .4. Appendix D d f b a y e s t e s t = pd . DataFrame ( t e s t d a t a ) # Apply t e x t p r e p r o c e s s i n g df bayes t e s t [ ’ clean text ’ ] = df bayes t e s t [ ’ text ’ ] . apply ( preprocess text ) # TF− IDF V e c t o r i z a t i o n f o r t e s t d a t a X t e s t t f i d f = t f i d f v e c t o r i z e r . transform ( df bayes t e s t [ ’ clean text ’]) # Predict b a y e s p r e d i c t i o n s = n a i v e b a y e s model . p r e d i c t ( X t e s t tfidf ) # C o n v e r t b a c k t o o r i g i n a l names bayes p r e d i c t e d c l a s s e s = l a b e l encoder . i n v e r s e transform ( bayes predictions ) .5. Appendix E from t r a n s f o r m e r s i m p o r t MarianMTModel , M a r i a n T o k e n i z e r # Load pre − t r a i n e d MarianMT model and t o k e n i z e r f o r E n g l i s h t o French t r a n s l a t i o n model name = " H e l s i n k i −NLP / opus −mt−en − f r " model = MarianMTModel . from p r e t r a i n e d ( model name ) t o k e n i z e r = M a r i a n T o k e n i z e r . from p r e t r a i n e d ( model name ) # Define input t e x t input t e x t = " T r a n s l a t e t h i s t e x t to French . " # Tokenize input t e x t i n p u t s = t o k e n i z e r ( i n p u t t e x t , r e t u r n t e n s o r s =" pt " ) # Perform t r a n s l a t i o n o u t p u t s = model . g e n e r a t e ( ∗ ∗ i n p u t s ) # Decode t r a n s l a t e d o u t p u t t r a n s l a t e d t e x t = t o k e n i z e r . decode ( outputs [ 0 ] , s ki p s p e c i a l tokens = True ) # Print translated text print (" Translated text : " , translated text ) .6. Appendix F # Assuming you have a l r e a d y l o a d e d t h e t e s t d a t a i n t o a DataFrame d f t e s t data results = [] # Translate jokes for , row i n d f t e s t d a t a . i t e r r o w s ( ) : # T r a n s l a t e e a c h row ’ s E n g l i s h t e x t t o F r e n c h t r a n s l a t i o n = model . g e n e r a t e ( ∗ ∗ t o k e n i z e r ( row [ ’ t e x t en ’ ] , r e t u r n t e n s o r s = " p t " , p a d d i n g = True ) ) t r a n s l a t e d t e x t = t o k e n i z e r . decode ( t r a n s l a t i o n [ 0 ] , s ki p s p e c i a l t o k e n s = True ) # Append t h e t r a n s l a t i o n r e s u l t t o t h e r e s u l t s l i s t r e s u l t s . append ( { ’ run i d ’ : " T o m i s l a v&Rowan t a s k 3 MarianMTModel " , ’ manual ’ : 0 , ’ i d en ’ : row [ ’ i d en ’ ] , ’ text fr ’ : t r a n s l a t e d text }) # C o n v e r t r e s u l t s l i s t t o DataFrame t r a n s l a t e d d f = pd . DataFrame ( r e s u l t s ) # P r i n t o r u s e t h e t r a n s l a t e d DataFrame a s n e e d e d print ( t r a n s l a t e d df )