CLEF 2024 JOKER Tasks 1-3: Humour identification and
                         classification⋆
                         Notebook for the JOKER Lab at CLEF 2024

                         Rowan Mann1 , Tomislav Mikulandric2
                         1
                             Christian-Albrechts-Universität zu Kiel (CAU), Christian-Albrechts-Platz 4, 24118 Kiel
                         2
                             The University of Split, Ul. Ruđera Boškovića 31, 21000, Split, Croatia


                                         Abstract
                                         The CLEF 2024 JOKER track focuses on the automatic processing of wordplay through three tasks: humour-aware
                                         information retrieval, humour classification by genre and technique, and translation of puns from English to
                                         French. Recent advancements in Large Language Models (LLMs) have enhanced their conversational abilities,
                                         yet they struggle with humour detection and generation. Addressing this gap can significantly improve human-
                                         computer interactions.
                                             For Task 1, we implemented a TF-IDF vectorizer and logistic regression model to identify and rank humorous
                                         sentences. The model achieved an F1 score of 0.93 for non-puns and 0.73 for puns, indicating robust performance
                                         in humour detection. In Task 2, we classified jokes into five categories using logistic regression, Naive Bayes, and
                                         support vector machines. The SVM model performed best, with F1 scores ranging from 0.14 to 0.61, showing
                                         particular efficacy in classifying wit. Task 3 involved translating puns using the MarianMT model from the
                                         Hugging Face library. Although successful, the process was time-intensive, suggesting the need for more efficient
                                         methods.
                                             Overall, our approaches demonstrated effective humour identification and translation capabilities but faced
                                         challenges in genre-specific classification. This research underscores the importance of improving LLMs’ humour
                                         processing abilities for better human-computer interaction.

                                         Keywords
                                         LLMs, Humour identification, Humour classification, Large Language Models, TF-IDF


                         1. Introduction
                         In recent years, Large Language Models (LLMs) have increased in their capabilities exponentially.
                         Since the release of ChatGPT-4 in 2023, the world has quickly become accustomed to interacting with
                         LLMs via a chatbot API that made users feel like they were truly chatting with the machine. A 2024
                         study applied the Turing Test to participants using ChatGPT-4, finding that humans incorrectly judged
                         ChatGPT-4 to be human 54 percent of the time, showing just how convincingly LLMs can now emulate
                         human conversation. [1]
                            As impressive as this is, LLMs are still found to be lacking in several areas, humour being one of
                         them. Large language models struggle to reliably detect and explain humour[2][3][4], and generate
                         novel jokes [5]. For humans, humour plays a central role in forming relationships and can enhance
                         performance and motivation.[6] Therefore, giving LLMs a good grasp of humour has the potential to
                         massively boost the success of human-computer interactions.
                            Puns are a form of humour based on wordplay. Usually puns exploit double meanings of words or
                         similarity of sounds between different words to create a humorous or witty effect, with frequent use of
                         double entendré, homophones, or similar-sounding words. Words which could be used to form puns
                         are words like “profit” and “prophet”, for example, or “check” and “Czech”.


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         ⋆
                           CLEF 2024 JOKER Tasks 1-3: Humour identification and classification
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ rowanmann93@gmail.com (R. Mann); tomislav.mikulandric@gmail.com (T. Mikulandric)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Perhaps what makes humour analysis such a challenge for LLMs is the many different subtle forms
of it that exist. Humour can be “on-the-nose”, physical, awkward, subtle, obvious, visual, childish, or
intelligent; it’s especially hard to define, which presents problems for LLMs.
   Wordplay, by its very nature, exploits the intrinsic structure of the source language used and certain
characteristics used that may be impossible or difficult to find replacements or analogues of in the
target language.[7] Therefore, translating jokes from language to language requires more than simply
replacing one word with the corresponding word in the target language. The deeper understanding of
context required is an area where LLMs could excel, if methods are developed.
   The JOKER track of CLEF 2024 aims to develop interdisciplinary approaches to the automatic
processing of wordplay. [8] This year, the JOKER track is split into three tasks:

    • Task 1: Humour-aware information retrieval.
    • Task 2: Humour classification according to genre and technique.
    • Task 3: Translation of puns from EN to FR.

   Can LLMs succeed in identifying, classifying, and translating humour? In this paper we will explore
this question, ahead we detail our workflow and results for tackling the three tasks of the JOKER track.


2. Task 1: Experimental Setup
2.1. Data Description
The data provided consisted of four JSON files. There was a “corpus” file containing a list of 61,268
pun and non-pun sentences, “queries test” and “queries train” files that list the corresponding keyword
linked to the sentences, and a “qrels train” file that if the sentence was a pun (“1”) or not (“0”).

2.2. Method
Our first step was to merge the data, creating a table with five columns: “qid”, “docid”, “qrel”, “text, and
“query”. We then used a TF IDF vectorizer to train the model on all our text and the corresponding
“qrel” values.
  # q u e r y t e x t and j o k e t e x t i n t o a s i n g l e column − TF− IDF V e c t o r i z e r
  d a t a merged [ ’ t e x t a l l ’ ] = d a t a merged [ ’ query ’ ] + " " + d a t a merged
        [ ’ text ’ ]


# F i t and t r a n s f o r m t h e combined t e x t
t f i d f m a t r i x = t f i d f v e c t o r i z e r . f i t t r a n s f o r m ( d a t a merged [ ’ t e x t a l l
       ’])
 Then we created a logistic regression model based off of our training data and used the model to
make predictions based off of our “queries test” data.


from s k l e a r n . l i n e a r model i m p o r t L o g i s t i c R e g r e s s i o n


# L o g i s t i c R e g r e s s i o n model
model = L o g i s t i c R e g r e s s i o n ( )


# T r a i n e d model
t r a i n e d model = model . f i t ( X t r a i n , y t r a i n )
  The model returned relevance scores for each joke in the corpus for each query. (Appendix A)
  Based on these results we produced a JSON file listing the “best” or rather, most relevant, jokes.
(Appendix B)


3. Task 1: Experimental Results
The results obtained from our model were very positive. Calculation of an F1 score for our model
produced results of 0.93 for the “0” (no pun) class, and 0.73 for the “1” class. This indicates that it
performed extremely well, with high precision and recall, for identifying sentences without puns and
moderately well for identifying those with puns.


4. Task 2: Experimental Setup
4.1. Data Description
The data provided consisted of 3 JSON files. “classification test data” and “classification train input data”
each contained “text” column containing thousands of jokes. The classification data was contained in
the file “classification train qrels data”, which provided the labels for our training data, categorising
each joke into one of five classes, IR (irony), SC (sarcasm), EX (exaggeration), AID (incongruity), SD
(self-deprecating), WS (wit).

4.2. Method
We merged the training data JSONs to create a dataframe with the text and classes side by side.
We preprocessed our text by removing contractions, making all letters lowercase, removing special
characters and URLs, then replaced the original text with our cleaned text in our data frame containing
the training labels. (Appendix C)
prompt t e r m s = " " "
    You a r e a r o b o t t h a t ONLY o u t p u t s JSON .
    You r e p l y i n JSON f o r m a t w i t h t h e f i e l d ’ terms ’ .
    You p r o v i d e ONLY s e m i c o l o n − s e p a r a t e d    l i s t o f MAXIMUM 3
        s c i e n t i f i c t e r m s o f a s o u r c e s e n t e n c e ONLY .
    You DO NOT add ’ Sure , Here a r e t h e s c i e n t i f i c t e r m s o f your
        sentence : ’ .
    Example s o u r c e s e n t e n c e : I n t h e modern e r a o f a u t o m a t i o n and
        robotics , \
    autonomous v e h i c l e s a r e c u r r e n t l y t h e f o c u s o f a c a d e m i c and
        industrial research .? \
    Example answer : { ’ terms ’ : ’ r o b o t i c s ; autonomous v e h i c l e s ’ }
    Now h e r e i s my s e n t e n c e :
"""
  We encoded our classifications to numbers and used TF-IDF to vectorise the text. The data was then
used to train a logistic regression model, a naive bayes model and a support vector machines model.
# T r a i n L o g i s t i c R e g r e s s i o n model
l o g i s t i c r e g r e s s i o n model = L o g i s t i c R e g r e s s i o n ( max i t e r = 1 0 0 0 )
l o g i s t i c r e g r e s s i o n model . f i t ( X t r a i n t f i d f , y t r a i n )

# T r a i n N a i v e B a y e s model
n a i v e b a y e s model = M u l t i n o m i a l N B ( )
n a i v e b a y e s model . f i t ( X t r a i n t f i d f , y t r a i n )
# T r a i n SVM model
svm model = SVM ( k e r n e l = ’ l i n e a r ’ ) # You can s p e c i f y d i f f e r e n t
   k e r n e l s l i k e ’ l i n e a r ’ , ’ po l y ’ , ’ r b f ’ , e t c .
svm model . f i t ( X t r a i n t f i d f , y t r a i n )
  With our models trained, we were able to make predictions based off of our test data, which we
preprocessed in the same way as we did our training data. We then had to convert our classes back to
their original names, from numbers. (Appendix D)


5. Task 2: Experimental Results
The results of our F1 testing reveals big differences in the performance of each model. The logistic
regression model achieved F1 results of between 0.05-0.54 for all the 5 categories. The SVM model
achieved better results, with F1 results between 0.14 and 0.61. Interestingly, for both models, the “WS”
(wit) class achieved the highest F1 score, indicating this was the easiest to classify. In both models, “IR”
(irony) and “EX” (exaggeration) achieved lowest scores, indicating difficulty in classifying those types
of jokes.


6. Task 3: Experimental Setup
6.1. Data Preparation
The data provided consisted of 3 JSON files: “translation EN FR train input”, consisting of 1406 jokes
in English, “translation EN FR train qrels”, containing 5839 jokes in french, and “task3 2024 test”,
consisting of 4502 rows of jokes in English. There were also two .tsv files, “joker translation EN-FR
train input.tsv, and “joker translation EN-FR train qrels.tsv”. In the parent folder, there was also a JSON,
“joker translation test”, and “joker translation test.tsv”.

6.2. Method
First we merged our data to create a unified data frame around the English and French jokes, joined using
the “id en” variable. We loaded the hugging face “transformer” library and used the MarianMTModel and
MarianTokenizer then used a pre-trained model, “Helsinki-NLP Opus”, after trying first with EasyNMT
and finding better results from the Helsinki model. (Appendix E)
   We iterated through the whole list of English jokes, translating to French, then decoded the vectors
back to letters, before saving in a new column alongside the originals. (Appendix F)


7. Task 3: Experimental Results
The translations seemed to be successful however, one drawback of using our method was that the
process was highly time intensive, suggesting there may be more practical methods available.


8. Conclusion
We can conclude that our techniques are effective at identifying the presence of a joke, but rather less
effective at classifying them into one of our five classes. Thanks to the existence of pre-trained models
which are tailored to the task, we can conclude that the translation of our jokes from English to French
was successful.
Acknowledgments
We would like to extend our gratitude to the University of Brest for organising the Blended Intensive
Programme (BIP) AI For Humanities. We would also like to thank Liana Ermakova for her teaching of
the course and Caroline L’haridon for her support during our stay in Brest.


References
[1] C. Jones, B. K. Bergen, People cannot distinguish gpt-4 from a human in a turing test, arXiv, 2024.
    URL: https://arxiv.org/abs/2405.08007.
[2] A. Baranov, V. Kniazhevsky, P. Braslavski, You told me that joke twice: A systematic investigation of
    transferability and robustness of humor detection models, in: Proceedings of the 2023 Conference
    on Empirical Methods in Natural Language Processing, Association for Computational Linguistics,
    Singapore, 2023, pp. 13701–13715.
[3] F. Góes, P. Sawicki, M. Grześ, D. Brown, M. Volpe, Is gpt-4 good enough to evaluate jokes?, 2024.
[4] J. Hessel, A. Marasovic, J. D. Hwang, L. Lee, J. Da, R. Zellers, R. Mankoff, Y. Choi, Do androids laugh
    at electric sheep? humor “understanding” benchmarks from the new yorker caption contest, in:
    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume
    1: Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 688–714.
[5] S. Jentzsch, K. Kersting, Chatgpt is fun, but it is not funny! humor is still challenging large language
    models, 2023.
[6] B. M. Savage, H. L. Lujan, R. R. Thipparthi, S. E. DiCarlo, Humour, laughter, learning, and health! a
    brief review, Advances in Physiology Education (2017).
[7] D. Delabastita, Wordplay as a Translation Problem: A Linguistic Perspective, De Gruyter Mouton,
    2004.
[8] A.-G. B. V. M. P. P. G. S. Liana Ermakova, Tristan Miller, A. Jatowt, Overview of clef 2024 joker track
    on automatic humor analysis, experimental ir meets multilinguality, multimodality, and interaction.
    proceedings of the fifteenth international conference of the clef association (clef 2024) (2024).


.1. Appendix A

results = []
# I t e r a t e over each t e s t query
f o r index , t e s t query in data t e s t q u e r i e s . i t e r r o w s ( ) :
      query i d = t e s t query [ ’ qid ’ ]
      q u e r y t e x t = t e s t q u e r y [ ’ query ’ ]
      # C a l c u l a t e r e l e v a n c e f o r each joke in the corpus with t h i s
            query
      scores = []
      for         , joke in data corpus . iterrows ( ) :
               i f j o k e [ ’ t e x t ’ ] i s None :
                       continue
              else :
                   t e x t a l l = query t e x t + " " + joke [ ’ t e x t ’ ]
                   vectorized text = t f i d f v e c to r iz e r . transform ( [ text a l l ] )
                   r e l e v a n c e s c o r e = model . p r e d i c t p r o b a ( v e c t o r i z e d t e x t ) [ 0 ,
                          1]
                   s c o r e s . append ( {
                            ’ docid ’ : joke [ ’ docid ’ ] ,
                            ’ score ’ : relevance score
                   })
.2. Appendix B


# S o r t j o k e s by r e l e v a n c e s c o r e i n d e s c e n d i n g o r d e r
s c o r e s . s o r t ( key = lambda x : x [ ’ s c o r e ’ ] , r e v e r s e = True )
        # P r e p a r e o u t p u t JSON f o r m a t
f o r rank , s c o r e i n f o i n e n u m e r a t e ( s c o r e s , s t a r t = 1 ) :
        r e s u l t s . append ( {
                ’ run i d ’ : " T o m i s l a v&Rowan t a s k 1 TFIDF " ,
                ’ manual ’ : 0 ,
                ’ rank ’ : rank ,
                ’ score ’ : score info [ ’ score ’ ] ,
                ’ docid ’ : score i n f o [ ’ docid ’ ] ,
                ’ qid ’ : query i d
        })


w i t h open ( ’ r e s u l t j o k e r t a s k 1 . j s o n ’ , ’w ’ ) a s o u t f i l e :
       j s o n . dump ( r e s u l t s , o u t f i l e , i n d e n t = 4 )


.3. Appendix C

# Preprocessing function
from n l t k . stem i m p o r t WordNetLemmatizer
import c o n t r a c t i o n s
import re
import nlt k
n l t k . download ( ’ s t o p w o r d s ’ )
n l t k . download ( ’ wordnet ’ )
from n l t k . c o r p u s i m p o r t s t o p w o r d s


lem = WordNetLemmatizer ( )
def preprocess text ( text ) :
      sms = c o n t r a c t i o n s . f i x ( s t r ( t e x t ) ) # c o n v e r t i n g s h o r t e n e d words
              t o o r i g i n a l ( Eg : " I ’m" t o " I am " )
      sms = sms . l o w e r ( ) # l o w e r c a s i n g t h e message
      sms = r e . sub ( r ’ h t t p s ? : / / S + |www. S + ’ , " " , sms ) . s t r i p ( ) # removing
               url
      sms = r e . sub ( " [ ^ a − z ] " , " " , sms ) # removing s y m b o l s and
            numbers ( k e e p i n g o n l y c h a r a c h t e r s from a − z )
      sms = sms . s p l i t ( ) # s p l i t t i n g
      # l e m m a t i z a t i o n and s t o p w o r d r e m o v a l
      sms = [ lem . l e m m a t i z e ( word ) f o r word i n sms i f n o t word i n s e t (
            s t o p w o r d s . words ( " e n g l i s h " ) ) ]
      sms = " " . j o i n ( sms )
      r e t u r n sms
X = df t r a i n [ " t e x t " ] . apply ( preprocess t e x t )


.4. Appendix D
d f b a y e s t e s t = pd . DataFrame ( t e s t d a t a )
# Apply t e x t p r e p r o c e s s i n g
df bayes t e s t [ ’ clean text ’ ] = df bayes t e s t [ ’ text ’ ] . apply ( preprocess
      text )


# TF− IDF V e c t o r i z a t i o n f o r t e s t d a t a
X t e s t t f i d f = t f i d f v e c t o r i z e r . transform ( df bayes t e s t [ ’ clean text
    ’])


# Predict
b a y e s p r e d i c t i o n s = n a i v e b a y e s model . p r e d i c t ( X t e s t     tfidf )


# C o n v e r t b a c k t o o r i g i n a l names
bayes p r e d i c t e d c l a s s e s = l a b e l encoder . i n v e r s e transform ( bayes
   predictions )


.5. Appendix E

from t r a n s f o r m e r s i m p o r t MarianMTModel , M a r i a n T o k e n i z e r


# Load pre − t r a i n e d MarianMT model and t o k e n i z e r f o r E n g l i s h t o
      French t r a n s l a t i o n
model name = " H e l s i n k i −NLP / opus −mt−en − f r "
model = MarianMTModel . from p r e t r a i n e d ( model name )
t o k e n i z e r = M a r i a n T o k e n i z e r . from p r e t r a i n e d ( model name )


# Define input t e x t
input t e x t = " T r a n s l a t e t h i s t e x t to French . "


# Tokenize input t e x t
i n p u t s = t o k e n i z e r ( i n p u t t e x t , r e t u r n t e n s o r s =" pt " )


# Perform t r a n s l a t i o n
o u t p u t s = model . g e n e r a t e ( ∗ ∗ i n p u t s )


# Decode t r a n s l a t e d o u t p u t
t r a n s l a t e d t e x t = t o k e n i z e r . decode ( outputs [ 0 ] , s ki p s p e c i a l tokens =
      True )


# Print translated text
print (" Translated text : " , translated text )
.6. Appendix F


# Assuming you have a l r e a d y l o a d e d t h e t e s t d a t a i n t o a DataFrame d f
    t e s t data


results = []


# Translate jokes
for   , row i n d f t e s t d a t a . i t e r r o w s ( ) :
    # T r a n s l a t e e a c h row ’ s E n g l i s h t e x t t o F r e n c h
    t r a n s l a t i o n = model . g e n e r a t e ( ∗ ∗ t o k e n i z e r ( row [ ’ t e x t en ’ ] , r e t u r n
          t e n s o r s = " p t " , p a d d i n g = True ) )
    t r a n s l a t e d t e x t = t o k e n i z e r . decode ( t r a n s l a t i o n [ 0 ] , s ki p s p e c i a l
          t o k e n s = True )


      # Append t h e t r a n s l a t i o n r e s u l t t o t h e r e s u l t s l i s t
      r e s u l t s . append ( {
              ’ run i d ’ : " T o m i s l a v&Rowan t a s k 3 MarianMTModel " ,
              ’ manual ’ : 0 ,
              ’ i d en ’ : row [ ’ i d en ’ ] ,
              ’ text fr ’ : t r a n s l a t e d text
      })


# C o n v e r t r e s u l t s l i s t t o DataFrame
t r a n s l a t e d d f = pd . DataFrame ( r e s u l t s )


# P r i n t o r u s e t h e t r a n s l a t e d DataFrame a s n e e d e d
print ( t r a n s l a t e d df )