<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Emma Schuurman, Mick Cazemier, Luc Buijs and Jaap Kamps</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>This paper reports on the University of Amsterdam's participation in the CLEF 2024 Joker track. Our overall goal is to investigate non-literal use of language, such as in humor and wordplay, that are still challenging current information retrieval and natural language processing technology. Our specific focus is to investigate how an efective wordplay detector can be used for the humorous search results or candidate translations, within the context of the track's humor retrieval, classification, and translation tasks. Our main findings are the following. First, standard ranking approaches are efective for retrieving relevant sentences given a query, but a pun classification filter is efective to select humorous results. Second, a BERT encoder based classifier obtains reasonable performance in classifying diferent aspects of humor, with some distinctions being hard for both models and humans. Third, sequence to sequences machine translation models provide high quality descriptive translation, yet preserving the wordplay across languages remains challenging. More generally, we revisited the CLEF 2023 Joker Track's Pun Detection task, and were able to build efective neural pun classifiers. The value of these classifiers was demonstrated as a filter on the results of a standard ranker for the Humor-aware IR task of the CLEF 2024 Joker Track.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Storage and Retrieval</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Wordplay translation</kwd>
        <kwd>Humor retrieval</kwd>
        <kwd>Humor classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The CLEF 2024 Joker track investigates possible solutions to the challenges of automated analysis and
processing of humor. The Joker track series aims to advance the development of interpretation,
generation and translation of wordplay, by bringing together computer scientists, linguists and translators.
The CLEF 2024 Joker Track builds upon the findings from last year’s edition. The CLEF 2023 Joker track
results have shown that wordplay detection, localization and translation remain a challenge for state of
the art systems. The CLEF 2024 Joker track reuses the corpus previously used for pun detection, and
creates a new task on humor-aware information retrieval. The CLEF 2024 Joker track also introduces a
entirely new task on humor classification, and continues the important pun translation task.</p>
      <p>
        Our main approach also builds on the CLEF 2023 Joker Track, as we revisit last year’s task on pun
detection. We conduct an extensive analysis of the three tasks of the track: Task 1 on Humor-aware
Information Retrieval; Task 2 on Humor Classification ; and Task 3 on Pun Translation. For details on the
exact track setup, we refer to the Track Overview paper CLEF 2024 LNCS proceedings [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], as well as
the detailed task overviews in the CEUR proceedings [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>Our main aim is to investigate how an efective wordplay detector can be used for the humorous
search results or candidate translations, within the context of the track’s humor retrieval, classification,
and translation tasks. Specifically, our idea is to build an efective pun detector, and use this to filter out
those results that are wordplay or puns. For example, as discussed in detail below, we can use a pun
detector to filter wordplay from the results of a standard search engine focusing on topical relevance
only. But in the same way it could be used to filter out the humorous text from the negatives in the
Humor Classification corpus of Task 2. And we can have standard machine translation systems generate
sets of diferent translations, and detect which of these preserve the wordplay.</p>
      <p>The rest of this paper is structured as follows. Next, in Section 2 we discuss our experimental setup
and the specific runs submitted. Section 3 discusses the results of our runs. Section 4 provides a detailed
Task</p>
      <p>Run
analysis of pun detection and topical relevance versus humor-aware IR for each task. We end in Section 5
by discussing our results and outlining the lessons learned.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experimental Setup</title>
      <p>In this section, we will detail our approach for the three CLEF 2024 Joker track tasks, as well as for the
CLEF 2023 Joker track pun localization task.</p>
      <p>
        For details of the exact task setup and results we refer the reader to the detailed overview of the track
in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The basic ingredients of the track are:
Corpus For Task 1, there is a large corpus of 61,268 documents (usually a single sentence each) for the
retrieval task.
      </p>
      <p>Train Data For Task 1, there are 12 train queries with relevance judgments (between 5 and 452
judgments per query, and between 4 and 281 relevant per query).</p>
      <p>For Task 2, there are 1,742 sentences in the training set all labeled as either ’SC’, ’EX’, ’WS’, ’SD’,
’AID’, ’IR’, or ’WT’. These labels represent the type of humor that the sentence contains.
For Task 3, there are 1,405 English wordplays, with a total of 5,838 professional human French
translations.</p>
      <p>Test Data For Task 1, there are 57 test queries. These include the train queries, so there are a total of
45 unseen queries on which the test evaluation is based. For these unseen queries there is a total
of 1,168 relevant documents, or an average of 26 per query.</p>
      <p>For Task 2, there are 6,642 unlabeled sentences that contain one of the earlier described types of
humor. In the final test evaluation set, there are 722 sentences with one of the labels on humor
genre and technique.</p>
      <p>For Task 3, there are 4,501 English wordplays. In the final test evaluation set there are 376 source
sentences, and 834 human reference translations into French by professional translators.</p>
      <p>We created runs for all the three tasks of the 2024 track and the localization task of the 2023 track,
which we will discuss in order.</p>
      <sec id="sec-2-1">
        <title>2.1. Task 1: Humor-aware Information Retrieval</title>
        <p>This task asks to retrieve short humorous texts for a query. We submitted eight runs in total, shown in
Table 1.</p>
        <p>
          Baseline Rankers We first submitted four baseline runs focusing on regular information retrieval
efectiveness. Two are vanilla baseline runs on an Anserini index, using either BM25 or BM25+RM3
with default settings [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].1 The other two runs are neural cross-encoder rerankings of these runs, based
on zero-shot application of an MSMARCO trained ranker, reranking the top 100 of either the BM25 or
the BM25+RM3 baseline run.2 We submitted four runs aiming to take the pun detection of the results
into account.
        </p>
        <p>SimpleT5 SimpleT5, built on top of Pytorch-lightning and Transformer, streamlines the training of
T5 models for diferent NLP tasks [ 7]. T5 stands for Text-To-Text Transfer Transformer, which means
that this model can take text as input and produce new text as output. This text-to-text structure makes
it possible to apply the same model, decoding process and training procedure for diferent tasks like
summarization, classification and translation [8].</p>
        <p>For the detection task, we loaded a SimpleT5 model using its built-in ‘from_pretrained’ method with
the model name “t5-small”, specifying its size. To train this model, the data was preprocessed by first
merging the train and qrels file on ‘id’, to create a single file with the columns ‘text’ and ‘wordplay’.
These two columns were then renamed to ‘source_text’ and ‘target_text’. Additionally, we added the
prefix ‘Detect pun:’ to each line since T5 models expect a task related prefix.</p>
        <p>After preprocessing, we used train_test_split to split the data into a training and a test dataset with
90% of the data allocated for training and 10% for testing. We trained two versions of the SimpleT5
model: version 1 with a batch size of 6 and version 2 with a batch size of 8.</p>
        <p>To test the model for inference, we loaded the best-trained model by selecting the one with the lowest
validation loss. To evaluate this model’s performance, we applied ’model.predict’ on each sentence of
the test dataset, and then compared the output to its actual label.</p>
        <p>BERT The Bidirectional Encoder Representations from Transformers (BERT) model [9] is another
NLP model based on the transformer architecture. This model has been widely used and over 150
studies have been done on the model [10] (as of 2020). One major advantage of the BERT model is that
a pretrained model can be finetuned with just one additional output layer to create models for a variety
of tasks.</p>
        <p>The detection task meant that the “AutoModelForSequenceClassification” was loaded to load a
pretrained BERT model for sequence classification. We did this using the “base-bert-uncased” model.
The data was preprocessed in similarly to the T5 model, however the prefix wasn’t added since that
was unnecessary for the BERT model. We split the data using the same 90 % of the data as training data.
Additionally, the “AutoTokenizer” function was used to tokenize the data using the tokenizer associated
with the BERT model. Additionally the data was batched using “Datacollatorwithpadding.”</p>
        <p>To train the model for this specific dataset, we trained the model using Low-Rank Adaptation (LoRa)
[11]. This approach greatly reduced the number of trainable parameters and the time needed to retrain
the model by freezing the pre-trained model weights and injecting trainable rank decomposition matrices
into selected layers of the Transformer architecture. Furthermore, the training parameters were refined
through iterative adjustments and empirical evaluation, wherein diferent values were tested to assess
their impact on performance.</p>
        <p>To evaluate the performance, the model was evaluated on the test split of the dataset. The evaluation
mostly focused on improving the F1 score, using accuracy as a secondary performance metric.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task 2: Humor Classification</title>
        <p>This task asks to classify text according to genre and technique. We submitted a single run, also shown
in Table 1.
1https://github.com/castorini/pyserini
2https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
BERT The same BERT model was retrained for Task 2 using the same data preprocessing as was used
for Task 1. The only diference was that the number of classes was increased from two (“yes” and “no”
for whether the sentence contained a pun) to seven for the diferent types of humor that a sentence
could contain.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Task 3: Pun Translation</title>
        <p>This task asks to translate puns from english to french. We submitted two runs shown in Table 1.
MarianMT MarianMT is a “sequence-to-sequence” (Seq2Seq) model based on the Marian framework.
Marian, first introduced in 2017, is written entirely in C++, which supports faster training and
translation [12, 13]. MarianMT provides pre-trained models, which are smaller than most other translation
models, about 298 MB on disk, compared to other transformer-based translation models that exceed
1 GB [14, 15]. The size of MarianMT makes the model useful for fine-tuning on custom datasets for
specific tasks.</p>
        <p>For the translation task, the MarianMTModel and MarianTokenizer were loaded from the transformers
library, using the model checkpoint “Helsinki-NLP/opus-mt-en-fr”. Before training, the data was
preprocessed by merging the input en qrels files on “id_en”, to create a single csv file. The columns
were renamed to “English” and “French”, no prefix was needed. The preprocessed data was divided
with train_test_split into a 90/10 split of respectively training and test data. After which the training
data was further divided into training and validation sets using a 80/20 split. Additionally, the data was
batched by setting the ’per_device_train_batch_size’ parameter in the Seq2SeqTrainingArguments equal
to 8. The evaluation is performed based on the validation set at the end of each epoch by computing
the BLEU score as the evaluation metric.</p>
        <p>T5-base As stated in previous paragraphs, a T5 model can be used for diferent NLP tasks. It is suitable
for machine translation due to its ability to understand natural language and generate contextually
relevant information [16]. The ‘T5ForConditionalGeneration’ and the model name ‘t5-base’ were used
to load the T5-base model for English to French translation.</p>
        <p>The preprocessing of the data was done similarly to the preprocessing for the MarianMT model. The
split of the test, validation and train set was also done in the same manner. The ‘T5Tokenizer’ was used
to tokenize the data before training. Training was done with the same number of epochs, batch size
and evaluation metric as used for MarianMT.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Results</title>
      <p>In this section, we will present the results of our experiments, in three self-contained subsections
following the CLEF 2024 Joker Track tasks.</p>
      <sec id="sec-3-1">
        <title>3.1. Task 1: Humor-aware Information Retrieval</title>
        <p>We discuss our results for Task 1, asking to retrieve short humorous texts for a query.</p>
        <p>Table 2 shows the performance of the Task 1 submissions on the train data. We submitted four runs
focusing purely on standard retrieval efectiveness. First, the two Anserini baselines using BM25 with
or without RM3 query expansion perform also reasonable for the pun retrieval task. The RM3 models
outperforms vanilla BM25 on almost all measures, and higher initial precision. This highlights the fact
that the pun retrieval task still requires puns to be “relevant” to the topic, and hence that focusing purely
on topical relevance provides a reasonable baseline approach. Second, the zero-shot reranking with a
cross-encoder does not lead to an improvement of retrieval efectiveness. However, some of these runs
have high fractions of unjudged documents in the top of the ranking, up to 50% of the top 10 results.
This is a call to caution for interpreting these scores as reliable performance estimations. However, we
expect that the puns relevant for this task are judged, and hence the neural reranker is particularly
attracting topically relevant non-pun passages. We will analyze this in more detail in Section 4.3 below.</p>
        <p>We also submitted four runs post-processing the relevance-only rankings with diferent pun classifiers.
First, the BERT pun classifier applied to the BM25 baseline does lead to slightly better results when
compared to the base BM25 model. The model labels 76% of the passages as puns and thus 76 % is kept.
This is the suspected reason for the small increase in performance. When applied to the RM3 baseline,
the BERT filter causes a much larger increase in performance. Likely because for this model only 46%
of the passages are labeled as puns. Second, we applied the two diferent versions of the SimpleT5 pun
classifier on the RM3 baseline. Version 1 kept 53% of the passages and version 2 kept 43%. Comparing
the results of the RM3 baseline without the pun classifier to those with the classifier shows a significant
improvement in performance. The diference in performance between the Bert model applied on the
RM3 baseline and the SimpleT5 model is marginal.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task 2: Humor Classification</title>
        <p>We continue with Task 2, asking to classify text according to genre and technique. We submitted one
run using the BERT model. This model performed reasonably well on the 10% hold out part of the train
dataset. However, we observed that this model still struggles with recognizing certain minority classes
which leads to it predicting majority classes much more often than would be desired.</p>
        <p>We only submitted a single run based on a simple BERT classifier trained on 90% of the released train
data. Table 4 shows the performance of the Task 2 submission on the train data (top half) and the test
data (bottom half). First, it is reassuring to observe slightly lower but similar performance for the test
data than on the train data. Second, the performance is not very high but still reasonable given that this
is a multi-class text classification problem with 7 (or 6) possible labels. Third, manual inspection of the
data also suggests the classification task is non-trivial, also for humans.</p>
        <p>The aim is to automatically classify text according to the following classes: incongruity-absurdity
(AID), exaggeration (EX), irony (IR), sarcasm (SC), self-deprecating (SD), and wit-surprise (WS).
Inspection of the confusion matrix (not shown) reveals a reasonable diagonal, in particular for the classes
with the largest support in the train data (in particular “WS”). The distribution of the test data difers,
with “AID” being the largest class, explaining the small drop in performance.</p>
        <p>Our model systematically miss-classifies sentences labeled as “irony” with “sarcasm” and
“exaggeration.” Several of these examples seem to contain elements of irony (typically about a situation and an
opposite expectation) and of sarcasm (a form of expression, assuming the utterance appeared in some
conversational context), or elements of exaggeration in some sense.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Task 3: Pun Translation</title>
        <p>We continue with Task 3, asking to translate puns from english to french. Our experiments are based
on MarianMT and T5-base models, both focusing on general machine translation quality for English to
French.</p>
        <p>Table 5 shows the results of the CLEF 2024 Joker track’s Task 3, both on the train data (top) and
the test data (bottom). We make a number of observations. First, the general translation quality is
high, with BLEU scores ranging from 44% (test) to 69% (train) and BERTScore F1 ranging from 83%
(train) to 87% (test). Second, the larger T5-base MT model performs better than the MarianMT model,
Run
Source
Reference(s)
in particular on the test set. Both models were fine-tuned on some of the train data, with precautions
against overfitting such as hold-out test and validation subsets, but the performance of the T5-base
model generalizes better. Third, the performance on the test data is lower than on the train data. This
may signal some degree of overfitting in training, but this is not clearly evident in manual inspection of
the output. One important factor afecting the absolute scores is the number of reference translations,
which is significantly higher for the train data (4.2 per pun) than on the test data (2.2 per pun).</p>
        <p>These automatic evaluation measures reflect the whole translated sentence and are a necessary but
not suficient condition for correctly translating the wordplay. The ground truth consists of professional
translations preserving the wordplay across languages, making the results indicative and encouraging
for pun translation, but also suggest the value of further qualitative analysis of the output.</p>
        <p>Table 6 shows an example from the train data set. The top half of the table shows the English pun, and
the six French translations made by professional translators. We make a number of observations. First,
there is notable variation in the diferent translations, highlighting the complexity and creative element
required. This also highlights the value of obtaining multiple translations from diferent professional
translators. Second, many of the common non-pun words are shared between the diferent translations,
which can lead to overemphasizing these in the MT evaluation measures that run on singular references
like BERTscore. Measures that can naturally deal with multiple references may be preferable, and
motivate the use of classic BLEU.</p>
        <p>The bottom half of the table shows the generated translations. We make a number of observations
again. First, the overall quality of the machine translations is very impressive, both in this example
throughout the entire output. There are no fluency or other issues, and the output captures the literal
content of the English pun adequately for understanding the topic and meaning. Second, some of the
generated translations capture the literal content of the source, but not preserve the wordplay. For
example, the T5-base output in this case is “Save the whales,” said Tom, which is factually correct but not
a wordplay. Third, some of the generated translations do both preserve the content and the wordplay.
For example, the MarianMT output in this case is exactly matching one of the human professional
translations, which creatively uses a similar wordplay with the meaning of évent referring to both a
whale’s blowhole, and to “in any case.”</p>
        <p>Our analysis revealed both the quality of current machine translation, and well as the complexity of
preserving the wordplay in a literally correct translation. We observed also that the models are able
to generate creative translations preserving the wordplay, but that the most likely translation or the
ifrst one generated by the model may not be a pun. This observation supports our general idea to
generate multiple translations with the model, and use an efective pun detector to choose one of these
translations in case it is likely preserving the wordplay. We are currently running experiments on using
beam search to generate a diverse set of translations, and use a French pun detector to select the most
promising candidate. Preliminary results demonstrate the viability of this approach.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Analysis</title>
      <p>In this section, we will present further analysis, including a direct evaluation of the used humor classifier
for pun detection.</p>
      <sec id="sec-4-1">
        <title>4.1. CLEF Joker 2023 Task 1: Pun Detection Revisited</title>
        <p>We revisit the CLEF 2023 Joker Track, and in particular the Task 1 on pun detection[17]. As detailed
above, our overall approach to the Joker Track tasks is based on exploiting a pun detector to select
wordplay among candidate results. For example, in the humor retrieval setting, this would allow us to
avoid topically relevant non-humorous content. Similarly, in the translation setting, this would allow
us to select one of the possible translation candidates preserving the wordplay. In this section, we will
evaluate the quality of the pun detector directly, rather than in an end to end evaluation of the other
Joker tasks.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Approach</title>
          <p>This is essentially a classification task, with a large set of sentences of which some are wordplay and
others are linguistically similar sentences without humorous content. The dataset consisted of 5,293
English sentences, with 58% being positive examples of sentences containing a pun and 42% being
negative examples. The goal was to develop a model that could tell if an English sentence contained a
pun or not. We used a SimpleT5 and a Bert model as described in the experimental setup in Section 2.</p>
          <p>
            For more details on this task, we refer to the Track Overview paper CLEF 2023 Joker Track Overview
paper [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], and the detailed overview of this particular task [17].
4.1.2. Results
The performance of our pun detection models is shown in Table 7. Both models perform well in detecting
puns in English sentences. Based on these metrics, the SimpleT5 model seems to perform slightly better
in detecting puns. The results achieved are a significant improvement on those achieved in 2023, we
suspect that this has to do with the fact that we took steps to avoid overfitting. We found that our
models tended to achieve similar results to those achieved in 2023 when we did not enact safeguards
against it.
          </p>
          <p>The performance of the pun classifier is note-worthy as it allowed us to address several of the Joker
track tasks: we used the classifier “as is” as a filter on the relevance-based retrieval results for Joker
2024 Task 1, and we are planning to filter the most promising of multiple generated translation for Joker
2024 Task 3.</p>
          <p>There is an important diference between our evaluation (on a hold-out validation set) and the
oficial results on a (not released) test set in the CLEF 2023 Joker Track’s overview paper [ 17]. The
best performing system in the track in 2023 scored an F1 of 53.61 % which is far lower than a majority
class prediction on the test data. While not tested on the exact sentences, the performance of our pun
classifier is encouraging, and the quality has been validated by the use of these pun classifiers to address
the humorous information retrieval task of Joker 2024.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. CLEF 2023 Joker Task 2: Pun Localization Revisited</title>
        <p>We continue our quest to revisit the CLEF 2023 Joker Track, and also focus on the Task 2 on pun
localization [18]. This task asks to localize the pun word within a sentence. We are interested in this
task, as detecting the ambiguous pun word is essential for developing more efective pun detectors
(discussed in Section 4.1). Moreover, detecting the pun location allows for giving special attention to
this word in the pun translation models and approaches (discussed in Section 3.3). It also allows to
focus the pun translation evaluation specifically to the matching source and reference pun words.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Approach</title>
          <p>For the Pun Localization task, there are 2,315 English sentences, each with the pun word labeled in the
sentence. The corpus used to train the French model consists of 2,001 French sentences labeled in the
same way. We randomly sampled these datasets, and 80% was used as the train data, and the remaining
20 % was used as the test set. This meant that 463 English sentences and 420 French sentences were
used to evaluate the models.</p>
          <p>We use two primary models: RoBERTa-large for the English data and CamemBERT for the French
data.</p>
          <p>RoBERTa-large RoBERTa-large is a state-of-the-art transformer model based on the earlier-described
BERT architecture. It is an extension of the RoBERTa-base model, featuring 24 layers and 335 million
parameters [19]. The extensive pre-training of this model makes it suitable for pun localization, because
it allows the model to understand complicated linguistic patterns, similarly to BERT.</p>
          <p>For the localization task, the data was preprocessed by using the RoBERTa tokenizer from the
transformers library to tokenize the text data. This ensured consistent input representations. Padding
and truncation was applied to standardize the input length across all sequences. Token-level labels
were assigned to identify the positions of pun words within the sentences. In cases where standard
tokenization did not capture pun variations (such as capitalizations or apostrophes), alternative methods
were applied to ensure coverage.</p>
          <p>CamemBERT CamemBERT is a specialized transformer model designed specifically for French
natural language understanding tasks [20]. Built upon the RoBERTa architecture, CamemBERT features
110 million parameters in its base configuration and 335 million parameters in its large configuration.
The model was pre-trained on a large French corpus, this makes it particularly efective for the pun
localization task in French sentences.</p>
          <p>The preprocessing for the localization task was identical to the preprocessing used for the RoBERTa.
However, since a part of the dataset could not be tokenized successfully, multiple runs were done using
the CamemBERT-model. For the filtered runs, the sentences that could not be tokenized were excluded
(110 sentences). Another run was done using data that only contained a single instance of wordplay per
sentence (165 sentences excluded), similarly to the English dataset.
4.2.2. Results</p>
          <p>
            This concludes our eforts to revisit the Joker 2023 tasks, and directly related these tasks to the Joker
2024 tasks. The pun translation task of Joker 2023 ?? was continued into Joker 2024 [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], and discussed
already above in Section 3.3.
          </p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Task 1: Topical Relevance versus Wordplay Retrieval</title>
        <p>In this section, we will investigate the humor-aware retrieval models of Section 3.1 in terms of their
ability to retrieval topically relevant information.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. On-Topic versus Humorous</title>
          <p>In our results on Task 1 (Humor-aware Information Retrieval) above, we observed a low performance
for neural rerankers based on efective zero-shot cross-encoders. These models have shown highly
efective zero-shot performance for passage retrieval in numerous domains, and we speculated that the
loss of performance is due to the models attracting many topically relevant but non-humorous results.</p>
          <p>The task 1 corpus is constructed in a particular way, with known relevant puns treated as relevant
only. In order to make the task challenging, a high fraction of topically relevant but non-humorous text
is added to the corpus. Plus there is additional non-relevant content in the larger corpus. The provided
train qrels allow us to reconstruct this for the 12 train topics. Specifically, there is a total of 562 relevant
and humorous results (on average 46.8 per query) and a total of 1,827 other topically relevant results
(on average 152.3 per query, and combined 199.1 on average). So within all combined topically relevant
content, only 23.5% are humorous text and a majority of 76.5% is non-humorous.
4.3.2. Results
Table 9 (top half) shows the retrieval efectiveness of zero-shot cross-encoder rerankers for both
the lexical baseline models BM25 and BM25+RM3. We again observe that the performance drops
considerably, and decreases when reranking a larger set of results. This can be explained by our analysis
above on the distribution of humorous texts within the set of topically relevant documents.</p>
          <p>We changed the relevance assessments into a set graded judgments, where we reward both aspects.
Specifically, we treat topically relevant documents as relevance level 1, and relevant humorous content
as relevance level 2. Graded measures like NDCG will still prioritize humorous texts, but boolean
measures will treat all topically relevant results in the same way. Table 9 (bottom half) shows the
matching results when looking at all topically relevant content. We observe that the performance drop
disappears, and that scores can even increase when reranking a larger set of results. As the topic set
5
is relatively small with 12 queries, there is no clear pattern, and the main gain seems to be in early
precision. This analysis confirms that zero-shot rerankers are efective in terms of topical relevance,
but that dedicated models for humor-aware IR are needed in order to efectively retrieve humorous text.
This also highlights the value of the pun detector based approach we proposed in this paper, and which
led to clear improvements of retrieval efectiveness for humor-aware IR.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Task 3: Multiple Translation Candidates</title>
        <p>In this section, we will investigate the pun translation models of Section 3.3 in terms of their ability to
generate multiple candidate translation.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Filtering for Wordplay</title>
        <p>In this section, we discuss further experiments based on 1) MarianMT to generate multiple, and diferent,
candidate translations, 2) building an efective pun detector for French, and 3) using this wordplay
detection to retrieve the most likely pun translation.</p>
        <p>We construct a pun detector for French, following the CLEF 2023 Joker Pun Detection Task [17]
discussed above in Section 4.1. We split the train data into 90% training and 10% hold-out test data.
Table 10 (top) shows the performance on the small hold out test set. First, as this is a single boolean
prediction with a 50/50 balanced classes, a dummy classifier from scikit learn gives a random
prediction with uniform probabilities, and scores indeed around 0.50. Second, DistilBERT-base scores
notably better, and fine-tuning by hyperparameter optimization leads to further improvement on F1.
We decided to continue with the “FT1” version optimizing precision, rather than the “FT2” version
optimizing recall, as our ultimate French pun detection model. Table 10 (bottom) shows the performance
of this pun detector over the entire train data, and the Joker 2023 test data. While the train performance
is an overestimation, as 90% of this data is seen in training, the model did not seem to sufer from
significant over-fitting in manual inspection. This is confirmed with the performance on the oficial test
set, where we observe impressive performance on this complex task. To put this score in perspective,
the highest performing score at the CLEF 2023 Joker Pun Detection Task was 0.6645 [17].</p>
        <p>Our French pun classifier is trained to provide "pun" and "non-pun" class probabilities, and in the
above we treated the predicted class with the highest probability as the Boolean pun prediction asked
in the CLEF 2023 Joker Pun Detection Task. We made runs in which we have our translation model
generate five candidate translations using beam search, and select the candidate translation with the
highest pun classification probability directly.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Results</title>
        <p>Table 11 shows the evaluation over the entire output. We make the following observations. First,
MarianMT optimized to generate multiple translations using beam search over the generate output
performs better than the base MarianMT finetuned on the train data (shown in Table 5 before). We
initially observed very similar candidates, with either all or none containing wordplay. We took special
efort to generate a suficient diverse candidates, and increasing the likelihood that one of them satisfies
our pun detector. The beam search seems favorable for the setting of the task, as entertaining multiple
candidates ultimate leads to a better highest ranked candidate by the model itself. Second, the filtered run
selecting the candidate translation with the highest probability to be a wordplay also outperforms the
earlier standard finetuned MarianMT model. Third, when evaluating over the entire generated sequence,
we see that the model without the explicit filter (returning the most likely translation according to
the model) scores higher than the filtered output (returning the most likely wordplay according to the
pun detector). This may be due, in part, to the evaluation over the entire prediction containing many
non-pun words. As can be expected, our model indeed increases the number of estimated wordplays
according to the used pun detector for French.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusions</title>
      <p>This paper detailed the University of Amsterdam’s participation in the CLEF 2024 Joker track. We
conducted a range of experiments, for each of the three tasks of the track. For Task 1 on Humor-aware
Information Retrieval, we observed that standard ranking approaches are efective for retrieving relevant
sentences given a query, but a pun classification filter is efective to select humorous results. For Task 2
on Humor Classification , we submitted preliminary approaches based on a BERT encoder based classifier
to obtain reasonable performance in classifying diferent aspects of humor, with some distinctions being
hard for both models and humans. For Task 3 on Pun Translation, we experimented with sequence to
sequences machine translation models to provide high quality descriptive translation, yet preserving
the wordplay across languages remains challenging. For the task on Pun Localization, we observed that
while RoBERTa-large and CamemBERT-base models improved on the current state-of-the-art models in
locating instances of wordplay within sentences, however, achieving robust localization across diferent
languages remains a persistent challenge.</p>
      <p>Our specific focus is to investigate how an efective wordplay detector can be used for the humorous
search results or candidate translations, within the context of the track’s humor retrieval, classification,
and translation tasks. We revisited the CLEF 2023 Joker Track’s Pun Detection task, and were able to
build efective neural pun classifiers. The value of these classifiers was demonstrated as a filter on the
results of a standard ranker for the Humor-aware IR task of the CLEF 2024 Joker Track.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was conducted as part of the final research projects of the Bachelor in Artificial Intelligence at the
University of Amsterdam, We thank the coordinator Dr. Sander van Splunter for his support and flexibility to
work around the CLEF deadlines. We also thank the track and task organizers for their amazing service and
efort in making realistic benchmarks for analyzing and processing humorous text available. Jaap Kamps is
partly funded by the Netherlands Organization for Scientific Research (NWO CI # CISC.CC.016, NWO NWA
# 1518.22.105), the University of Amsterdam (AI4FinTech program), and ICAI (AI for Open Government Lab).
Views expressed in this paper are not necessarily shared or endorsed by those funding the research.
[7] S. Roy, Simplet5 — train t5 models in just 3 lines of code, 2021. URL: https://github.com/</p>
      <p>Shivanandroy/simpleT5.
[8] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning
research 21 (2020) 1–67.
[9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, 2019. arXiv:1810.04805.
[10] A. Rogers, O. Kovaleva, A. Rumshisky, A primer in BERTology: What we know about how BERT
works, Transactions of the Association for Computational Linguistics 8 (2020) 842–866. URL:
https://aclanthology.org/2020.tacl-1.54. doi:10.1162/tacl_a_00349.
[11] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, 2021. arXiv:2106.09685.
[12] M. Junczys-Dowmunt, R. Grundkiewicz, T. Dwojak, H. Hoang, K. Heafield, T. Neckermann,
F. Seide, U. Germann, A. F. Aji, N. Bogoychev, A. F. T. Martins, A. Birch, Marian: Fast neural
machine translation in C++, in: F. Liu, T. Solorio (Eds.), Proceedings of ACL 2018, System
Demonstrations, Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 116–
121. URL: https://aclanthology.org/P18-4020. doi:10.18653/v1/P18-4020.
[13] Neptune.ai, Hugging face pre-trained models: Find the best, https://neptune.ai/blog/
hugging-face-pre-trained-models-find-the-best, 2024. Accessed: 2024-06-04.
[14] HuggingFace, MarianMT documentation, https://huggingface.co/transformers/v3.5.1/model_doc/
marian.html, 2024. Accessed: 2024-06-04.
[15] K. S. Kalyan, Pretrained language models for neural machine translation, https://medium.com/
@kalyanks/pretrained-language-models-for-neural-machine-translation-b2cdd2b22e78, 2023.
Accessed: 2024-06-04.
[16] H. Bartsch, Unlocking the power of T5: The versatile language model for text-to-text tasks,
https://aggregata.de/en/blog/pretrained-transformer/t5/, 2024. Accessed: 2024-06-04.
[17] L. Ermakova, T. Miller, A. Bosser, V. M. Palma-Preciado, G. Sidorov, A. Jatowt, Overview of JOKER
2023 automatic wordplay analysis task 1 - pun detection, in: M. Aliannejadi, G. Faggioli, N. Ferro,
M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023),
Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings,
CEUR-WS.org, 2023, pp. 1785–1803. URL: https://ceur-ws.org/Vol-3497/paper-149.pdf.
[18] L. Ermakova, T. Miller, A. Bosser, V. M. Palma-Preciado, G. Sidorov, A. Jatowt, Overview of JOKER
2023 automatic wordplay analysis task 2 - pun location and interpretation, in: M. Aliannejadi,
G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation
Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR
Workshop Proceedings, CEUR-WS.org, 2023, pp. 1804–1817. URL: https://ceur-ws.org/Vol-3497/
paper-150.pdf.
[19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, 2019. URL: https://arxiv.org/abs/1907.
11692. arXiv:1907.11692.
[20] L. Martin, B. Muller, P. J. Ortiz Suárez, Y. Dupont, L. Romary, E. de la Clergerie, D. Seddah, B. Sagot,
Camembert: a tasty french language model, in: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, Association for Computational Linguistics, 2020. URL:
http://dx.doi.org/10.18653/v1/2020.acl-main.645. doi:10.18653/v1/2020.acl-main.645.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Palma-Preciado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jatowt</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2024 JOKER track: Automatic humour analysis</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>G. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Philippe Mulhem</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the CLEF 2024 JOKER task 1: Humour-aware information retrieval</article-title>
          , in: G.
          <string-name>
            <surname>Faggioli</surname>
          </string-name>
          , et al. (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Palma-Preciado</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the CLEF 2024 JOKER task 2: Humour classification according to genre and technique</article-title>
          , in: G.
          <string-name>
            <surname>Faggioli</surname>
          </string-name>
          , et al. (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the CLEF 2024 JOKER task 3: Translate puns from english to french</article-title>
          , in: G.
          <string-name>
            <surname>Faggioli</surname>
          </string-name>
          , et al. (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Palma-Preciado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jatowt</surname>
          </string-name>
          , Overview of JOKER - CLEF
          <article-title>-2023 track on automatic wordplay analysis</article-title>
          , in: A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Aliannejadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality</source>
          , Multimodality, and Interaction - 14th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2023</year>
          , Thessaloniki, Greece,
          <source>September 18-21</source>
          ,
          <year>2023</year>
          , Proceedings, volume
          <volume>14163</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2023</year>
          , pp.
          <fpage>397</fpage>
          -
          <lpage>415</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -42448-9_
          <fpage>26</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -42448-9\_
          <fpage>26</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradeep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <article-title>Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations</article-title>
          , in: F. Diaz,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          , T. Sakai (Eds.),
          <source>SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11-15</source>
          ,
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>2356</fpage>
          -
          <lpage>2362</lpage>
          . URL: https://doi.org/10.1145/3404835.3463238. doi:
          <volume>10</volume>
          .1145/3404835. 3463238.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>