<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>R</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Overview of the CLEF 2025 JOKER Task 1: Humour-aware Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Liana Ermakova</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo Campos</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anne-Gwenn Bosser</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tristan Miller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Austrian Research Institute for Artificial Intelligence (OFAI)</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Bretagne INP - ENIB, Lab-STICC CNRS UMR 6285</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer Science, University of Manitoba</institution>
          ,
          <addr-line>Winnipeg</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>INESC TEC</institution>
          ,
          <addr-line>Porto</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Université de Bretagne Occidentale</institution>
          ,
          <addr-line>HCTI</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Beira Interior</institution>
          ,
          <addr-line>Covilhã</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>0</volume>
      <issue>0</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents the details of Task 1 of the JOKER-2025 Track, an information retrieval task where the goal is to find relevant humorous in a collection of text documents. The intended use case is retrieving jokes on a specific topic, something that may benefit humanities research, second-language learning, and the writing or translation of comedic texts. We provide two document collections: one in English and another in European Portuguese. The English collection consists of 77,658 documents, of which 5,198 are annotated as humorous, and 219 queries with relevance judgments. The Portuguese collection contains 45,126 texts, including 1,199 humorous documents along with 98 queries. Together, these collections support cross-linguistic studies in humour detection and contribute to the development of more inclusive and language-aware retrieval systems. Nine teams submitted 62 runs in total for this task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Wordplay</kwd>
        <kwd>Puns</kwd>
        <kwd>Humour-aware Information Retrieval</kwd>
        <kwd>Computational Humour</kwd>
        <kwd>wordplay detection</kwd>
        <kwd>test collection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>should fulfill two criteria: they must be relevant to the query, which encodes a topic, and they must be
humorous, which for our purposes means being instances of wordplay. For example, a search query of
“math” would mean that the goal is to find math jokes (e.g., “Why don’t mathematicians argue? Because
they always try to find common denominators!”), while the query “Tom” would mean that the goal is
to find jokes about some person or entity named Tom (e.g., “Why did Tom bring a ladder to the bar?
Because he heard the drinks were on the house!”).</p>
      <p>
        The data for our task builds upon the English corpora constructed in previous editions of JOKER [
        <xref ref-type="bibr" rid="ref2">2, 3</xref>
        ],
and has been expanded with a substantial set of humorous texts in Portuguese.
      </p>
      <p>This year, nine teams, out of the total thirteen active JOKER participants, submitted 62 runs for
Task 1 out of the 136 runs submitted to the track. (See run statistics in Table 1.) The English subtask
attracted nearly twice as many participants, with nine teams submitting 41 runs, compared to five
teams submitting 21 runs for Portuguese.</p>
      <p>This year saw a significant change to the task infrastructure, now hosted at Codabench [ 10], a
Free Software web-based platform for organising AI benchmarks. We provided separate Codabench
benchmarks for English3 (see Fig. 1 and Portuguese4. Codabench facilitated the organisation of the 2025
track and attracted many new participants, who registered on the platform and gained full access to the
competition, including submission and leaderboard pages. We continue to receive new registrations
and post-competition submissions; however, this paper presents only those runs submitted prior to the
oficial release of results to participants.</p>
      <p>The remainder of this paper is structured as follows: Section 2 describes the test and train data
in English and Portuguese as well as it format, Section 3 presents the evaluation metrics, Section 4
describes the participants’ runs, and Section 5 presents an analysis of their results on the training and
test data. Finally, Section 6 provides some concluding remarks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset construction and characterisation</title>
      <p>The data for this task consist of documents in both English and Portuguese, allowing for cross-lingual
research and evaluation. The following two sections describe the procedures carried out to construct
and prepare the dataset.
3https://www.codabench.org/competitions/8686/
4https://www.codabench.org/competitions/8736/</p>
      <sec id="sec-2-1">
        <title>2.1. English data</title>
        <p>
          In the 2025 edition, the English data is an extension of that used in Task 1: Humour-aware Information
Retrieval from JOKER 2024 [
          <xref ref-type="bibr" rid="ref2">2, 5</xref>
          ], which was constructed based on an English wordplay detection
corpus [17, 18] and valid translations [
          <xref ref-type="bibr" rid="ref2">2, 19</xref>
          ]. We grouped the humorous texts into clusters of related
topics and created queries based on these clusters. We added a significant number of topically relevant
but non-humorous texts by extracting relevant passages from Wikipedia and by generating passages
using Meta’s Llama-2 (7B) models. Due to the number of queries, the corpus contains a large fraction of
non-relevant content. Both positive and negative examples included a mix of generated and
humanwritten texts to prevent the task from being reduced to simply detecting generated content.
        </p>
        <p>In 2024, the total number of documents in the corpus was 61,268, with 4,492 humorous texts and
56,776 non-humorous ones. For 57 queries, 11,831 documents were considered topically relevant.</p>
        <p>For the 2025 edition, we expanded this data with new manually created jokes and texts generated
by the LLMs Bard, Claude, ChatGPT, and Phi-3 Mini. The resulting corpus contains 77,658 texts in
total, of which 5,198 are humorous. Detailed statistics on the English-language data sources for Task 1:
Humour-aware Information Retrieval is given in Table 2.</p>
        <p>For creating the set of queries, we harnessed data from CLEF 2023 JOKER Task 2: Pun Location and
Interpretation [20, 3, 21], and in particular, the locations of wordplay in texts – i.e. words or phrases
carrying multiple meanings. In CLEF 2023 JOKER Task 2, puns were either homographic (identical
spelling as in “I used to be a banker but I lost interest”) or heterographic (i.e., exploiting paronymy as in
propane/profane in “When the church bought gas for their annual barbecue, proceeds went from the
sacred to the propane.”) To expand the queries, we used the semantic annotations of pun locations (pun
interpretation) – i.e., pairs of lemmatised word sets, containing the synonyms (or, if absent, hypernyms)
of the two words involved in the pun, excluding any that share the same spelling as the pun. The lists
of query expansions were manually checked. The document was deemed humorous and relevant to the
query if it came from the positive examples of the JOKER corpus and included the query term or its
expansions.</p>
        <p>In this edition for 219 queries, 6,655 documents were judged humourous and topically relevant. As in
2024, we used 11 queries for the train and the rest for the test. The detailed statistics on the number of
relevant humourous documents per query for the English dataset is given in Figure 2.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Portuguese data</title>
        <p>To extend the multilingual scope of the task, we introduced a substantial new dataset in European
Portuguese (PT-PT). This collection consists of 45,126 documents, of which 1,199 are humorous and
43,927 are non-humorous. The humorous texts were compiled through a three-stage process. First,
660 humorous instances from the English dataset were automatically translated into Portuguese using
DeepL. Second, 421 texts were manually curated from various Portuguese-language websites. Finally,
350
300
250
200
150
100
50
0
118 humorous texts were generated using ChatGPT (4o-mini model). All texts underwent manual
validation to ensure quality and conformity to the PT-PT variant. Queries for this collection were
derived through a systematic topic-grouping procedure. Using GPT-3.5-turbo, the puns were clustered
by theme, e.g., "grapes" and "oranges" were grouped under the broader category "fruit". Puns without a
clear thematic link were marked as irrelevant. A manual curation process refined these groupings into
98 distinct queries associated with the 1,199 humorous texts.</p>
        <p>To compile the 43,927 non-humorous documents, we employed a two-step process. First, 41,028
sentences were retrieved from Wikipedia using the same API-based approach as in the English dataset.
Then, 2,899 additional non-humorous texts were generated using GPT-3.5-turbo. To ensure consistency
with the European Portuguese variant, all texts were passed through the PtVId model [22] to detect
Brazilian Portuguese (PT-BR) entries. Any PT-BR texts were automatically translated into PT-PT using
ChatGPT-4o-mini, followed by manual validation.</p>
        <p>Twenty-nine queries with their judgments (qrels) were created for training or validating participants’
systems. Then, another 69 queries were created as a test set.5. For all 98 queries (combined test and
training), 21,636 documents were considered topically relevant (i.e., they matched the query or its
expansions). Of these, 1,334 were humorous.6</p>
        <p>The descriptive statistics of the Portuguese data sources are provided in Table 3, while Figure 3 shows
the distribution of relevant humorous texts per query. For Portuguese, the average is 14, with a median
of 8, reflecting a more compact distribution aligned with the smaller dataset size.
5Note that we also included all the training-set queries in the test input file; however, they are excluded from the resulting
scores.
6Note that this number is higher than the 1,199 humorous documents collected, as a document may be associated with more
than one query.
},
{
},
{
},
{
80
70
60
50
40
30
20
10
0
qid_test_1
qid_test_28
qid_test_46
qid_test_64</p>
        <p>qid_test_82
qid</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Input formats</title>
        <p>As described in the following subsections, the input formats for the document collection, queries, and
training/validation data (qrels) generally follows those used for the 2024 edition of the task.
2.3.1. Document collection
We provide the training and test data in a JSON format with the following fields:
• docid a unique document identifier
• text the text of the instance, which may or may not contain wordplay</p>
        <p>Input example:
{
"docid": "1",
"text": "Good laws have sprung from bad customs."
"docid": "2",
"text": "The musical score to Topsyturveydom does not survive, but amateur
productions in recent decades have used newly composed scores or performed the
work as a non-musical play."
"docid": "3",
"text": "The organic compound primarily responsible for the characteristic odor of
musk is muscone."
},
{
2.3.2. Queries
The train and test queries are also JSON files, this time with the following fields:
• qid a unique query identifier from the input file
• query the search query, e.g. "math" means that the goal is to find math jokes, while the query
"Tom" means that the goal is to find jokes about Tom
2.3.3. Qrels</p>
        <p>Input example:
{"qid":"qid_train_1","query":"steps"},
{"qid":"qid_train_3","query":"math"},
{"qid":"qid_train_4","query":"Tom"}
Finally, we provide training/validation data in the format of JSON qrels files with the following fields:
• qid a unique query identifier from the query input file
• docid a unique document identifier from the corpus
• qrel indication the document docid is relevant to the query qid and is a wordplay instance
},
{
},
{
}
Example of a qrel file:
"qid": "qid_train_0",
"docid": "27260",
"qrel": 0
"qid": "qid_train_0",
"docid": "591",
"qrel": 1
"qid": "qid_train_0",
"docid": "51135",
"qrel":1</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Output format</title>
        <p>
          As with the input formats, the output format is identical to that used in CLEF 2024 JOKER Task 1. That
is, we required results to be provided in a JSON format with the following fields:
• run_id run ID starting with &lt;team_id&gt;_&lt;task_id&gt;_&lt;method_used&gt;, e.g. UBO_task_1_TFIDF
• manual flag indicating if the run is manual {0,1}
• qid a unique identifier from the input file
• docid an identifier of the document retrieved from the corpus to the qid query
• rank retrieved document rank
• score normalised document relevance score (in the [
          <xref ref-type="bibr" rid="ref1">0–1</xref>
          ] scale)
        </p>
        <p>For each query, the maximum allowed number of distinct documents (docid field) is 1000. A sample
output file is as follows:
},
{
"run_id":"team1_task_1_TFIDF",
"manual":0,
"qid":"qid_train_0",
"docid":"591",
"rank":2,
"score":0.8
"run_id":"team1_task_1_TFIDF",
"manual":0,
"qid":"qid_train_1",
"docid":"27261",
"rank":1,
"score":0.7
},
{</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation measures</title>
      <p>Performance was measured with standard information retrieval metrics as implemented in TrecTools,
a Free Software Python library for information retrieval [23]. For each run we report the number of
documents retrieved (#ret), the number of relevant documents retrieved (#rel), mean average precision
(MAP; the mean of average precision scores across queries), geometric mean average precision (GMAP),
precision at the number of relevant documents (P@R), mean reciprocal rank (MRR; the average of the
reciprocal rank of the first relevant item across queries), precision (P@ ; the proportion of relevant
items retrieved at the top  = 5, 10, 100, 1000 positions), normalised discounted cumulative gain
(NDCG; accounting for the relevance and position of documents in the ranking, normalised against the
ideal ranking), and (for Portuguese only) the binary preference score (bpref).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Participants’ approaches</title>
      <p>In total, nine teams submitted 62 runs (see Table 1), with five of these teams contributing 21 runs to
the Portuguese subtask. Every team participating in the Portuguese subtask also submitted runs for
English. The approaches used by the participating teams are as follows:
rasion [15] This team proposed a dual-screening architecture that separates humour-aware
information retrieval into two distinct stages. The first employs a semantic similarity model that uses the
paraphrase-multilingual-mpnet-base-v2 model to encode queries and documents into dense vector
representations, and distance-based metrics and cosine similarity to quantify semantic alignment and filter
query-relevant documents. This step is followed by a transformer-based classifier (xlm-roberta-base)
that identifies humorous texts containing puns. The method, applied to both English and Portuguese
datasets, aims to reduce task complexity through modularisation. Their system achieved strong
performance in Portuguese, highlighting the efectiveness of separating relevance and humour detection
subtasks.
cryptix and sarath_kumar [11] These participants employed a fine-tuned Sentence-BERT (SBERT)
model to generate semantic embeddings of queries and documents. They trained the model using a
cosine similarity loss on humour-labelled query–document pairs, aiming to capture implicit humour
such as irony or exaggeration. The resulting vectors were indexed using the Facebook AI Similarity
Search (FAISS) for eficient retrieval, and results were re-ranked using human-annotated humour
intensity scores.
igoranchik [13] This team implemented a hybrid retrieval pipeline combining dense and lexical
retrieval, followed by cross-encoder reranking. They fine-tuned the intfloat/multilingual-e5-small
model using contrastive objectives – Multiple Negative Ranking Loss (MNRL) and an Adaptive Margin
Loss – on humour-annotated data, including synthetic queries generated with GPT-4o-mini. BM25
was used for lexical retrieval via Anserini, while dense vectors were stored in Qdrant. The top 1000
documents from both retrieval methods were merged using reciprocal rank fusion and re-ranked using
the cross-encoder/ms-marco-MiniLM-L12-v2.
pjmathematician [14] This team implemented a two-stage pipeline using the Qwen family of large
language models (LLMs). First, they applied large Qwen models (Qwen3-14B and Qwen3-32B) to analyse
the entire document corpus, generating humour-related metadata such as a binary ‘isJoke’ flag and
textual explanations for each document. These enriched representations were then used in a dense
retrieval step, where smaller Qwen embedding models (Qwen3-4B and Qwen3-8B) indexed either the
original text or the explanation-augmented versions. Retrieval was performed using both generic and
humour-specific query prompts.
tanishc228 [16] This participant proposed a multi-stage ensemble retrieval system combining
traditional IR methods with neural rerankers (ColBERT and a BERT-based cross-encoder), complemented by
handcrafted wordplay features. Their pipeline retrieves documents using both lexical and semantic
methods, followed by contextual reranking and score fusion. The system aims to capture humorous
content by incorporating features such as punctuation, repetition, and alliteration.
kamps and fhelms [12] These teams submitted baseline runs using Anserini BM25 or BM25+RM3
and zero-shot MSMARCO-trained neural cross-encoder rerankings of the top 100 results.</p>
      <p>All participants who submitted runs also submitted system description papers to the Working Notes
volume [24]. Two teams from the same university (alecs and kamps) submitted a single joint report, as
did teams cryptix and sarath_kumar, resulting in a total of seven Working Notes from the participants
of Task 2. Despite the requirement to include the team ID in the run name, participants’ submissions
often difered in their run names, registration details, and Codabench IDs. We manually matched the
Working Notes with the submitted runs and report the results using the team names provided in those
submissions.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Test data</title>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Training data</title>
        <p>As in previous years, runs were submitted for both the training and test datasets in order to analyse
potential overfitting and related efects. Tables 6 and 7 report the Task 1 results on the training data for
English and Portuguese, respectively.</p>
        <p>For the English subtask, we observe that the top four runs according to MAP (0.33 to 0.35) and
NDCG@5 (0.56 to 0.61) submitted by pjmathematician [14] remain at the top of the table with also closed
values of MAP (0.44 to 0.48) and NDCG@5 (0.53 to 0.68). Both runs Rasion_SenTransF+Roberta [15]
have better scores on the train data than the University of Amsterdam with the best-scored runs
achieving MAP = 0.59 and NDCG@5 = 0.61. The run UAms_RM3RoBERTa_drop60 [12] shows similar
performance in terms of MAP on the train and the test sets, but with an improvement of NDCG@5
for the train data. However, many runs achieved higher scores on the training set, which lowered
the ranking of the University of Amsterdam’s run. Two runs Rasion_SenTransF+Roberta [15] and
the run Cryptix_SBERT achieved more than double the MAP and at least 50% improvement in terms
of NDCG@5 on the training data compared to the test data which might be a result of overfitting.
The RM3 and BM25 runs from the University of Amsterdam [12] remain mid-ranked, showing stable
scores without signs of overfitting, suggesting similar properties between the training and test data and
confirming their strength as baselines. This also suggests that the improvement of other approaches
may be attributed to their quality rather than diferences between this year’s data and the 2024 test
collection. Cross-encoders performed poorly on both the training and test sets, likely because they are
not designed to detect humour.</p>
        <p>Teams rasion [15] and pjmathematician [14] submitted the highest-scoring runs on the English
collections, also achieving the best results on the Portuguese training and test collections. Note that
they achieved better results on the training data than on the oficial test data. They are followed by the
BM25 run from the University of Amsterdam [12], which ranked fifth on both the Portuguese training
and test collections, showing a 2–3 times drop in MAP and a 4–10 times drop in NDCG@5. However,
the high ranking of BM25 may be partly due to the fact that the Portuguese subtask had roughly half as
many runs as the English subtask. Note that on the test sets, the drop in terms of MAP and NDCG@5 is
even higher. Cross-encoders remain low and stable among test and training collections, as for English.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>
        In this paper, we have presented an overview and discussion of the results of Task 1 of the JOKER-2025
challenge on the retrieval of humorous texts relevant to a search query. Based on the data for wordplay
detection and interpretation previously constructed within the CLEF JOKER track [
        <xref ref-type="bibr" rid="ref1">1, 26, 20, 3, 21</xref>
        ], we
constructed a unique reusable test collection for wordplay retrieval in English. We manually created new
jokes to avoid potential LLM data contamination. To prevent the task from being reduced to generated
text detection, both positive and negative examples comprised a combination of human-written and
machine-generated texts. The English collection consists of 77,658 documents, of which 5,198 are
annotated as humorous, and 219 queries with relevance judgments.
      </p>
      <p>In addition to this, this year we also expanded the dataset with Portuguese data collected from
Portuguese-language websites, translated from the English corpus and generated by Chat-GPT (4o-mini
model). The Portuguese collection contains 98 queries and 45,126 texts, including 1,199 humorous
documents.</p>
      <p>This year, the track setup was updated, with submissions managed through Codabench. Nine teams
submitted 41 runs for the English subtask, of which five also submitted 21 runs for the Portuguese
subtask, resulting in 62 valid distinct runs in total.</p>
      <p>The teams applied diverse methods, ranging from traditional approaches rankers such as TF–IDF,
BM25, and RM3 to cross-encoders with and without filtering, to more modern ones, including fine-tuned
transformers and LLMs. The best results both for English and Portuguese were achieved by the team
pjmathematician [14], which applied the Qwen model for retrieval and filtering, and the team rasion [ 15],
which applied dense retrieval and transformer-based detection of humorous texts. These results might
testify AI progress in pun detection. Further analysis is needed to assess the impact of potential LLM
data contamination on this performance.</p>
      <p>This year’s English task showed remarkable progress, with the best run by team pjmathematician
achieving a MAP of 0.3501 – nearly triple last year’s top score – and outperforming all competitors by
a wide margin across the various metrics. In contrast, the University of Amsterdam’s cross-encoder
approaches performed substantially worse than their RM3 and BM25 baselines, confirming the
efectiveness of simpler retrieval strategies for this dataset. For the Portuguese subtask, results were more
balanced, with pjmathematician and raison achieving similar MAP and NDCG@5 scores around 0.4
to 0.5, far ahead of the BM25 baseline. Interestingly, while the Portuguese runs achieved higher MAP
scores, they trailed the English runs in precision and NDCG@5, likely due to the smaller pool of relevant
humorous documents per query. Overall, these findings suggest that while the dataset’s core properties
have remained stable, combining retrieval and filtering remains key to advancing performance.</p>
      <p>The University of Amsterdam’s RM3 and BM25 runs remained stable and reliable baselines, showing
no overfitting and similar performance across test/training and English/Portuguese datasets.
Improvements by other methods likely reflect their quality rather than dataset diferences. Cross-encoders
performed poorly, likely due to their unsuitability for humour detection.</p>
      <p>In general, our results confirm that retrieval models are humour-agnostic and humour detection is
still a challenge for machine learning models and LLMs despite improvement over the last year edition.</p>
      <p>
        For more information about the JOKER lab this year, please refer to the overview paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and the
Working Notes papers for Task 2: Pun Translation [6] and Task 3: Onomastic Wordplay Translation [7].
Visit the JOKER website at https://joker-project.com for any other information related to the track.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has received a government grant managed by the National Research Agency under the
program Investissements d’avenir integrated into France 2030, with the Reference ANR-19-GURE-0001.
It was also financed by National Funds through the Portuguese funding agency FCT through the project
LA/P/0063/2020 (DOI 10.54499/LA/P/0063/2020). Ricardo Campos would also like to acknowledge
project StorySense, with reference 2022.09312.PTDC (DOI 10.54499/2022.09312.PTDC). We thank all
other colleagues and students who participated in data construction, the translation contests, and the
CLEF JOKER track.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Grammarly in order to: Grammar
and spelling check and Paraphrase and reword. Further, the authors used Gemini in order to: Generate
images. After using these tools/services, the authors reviewed and edited the content as needed and
take full responsibility for the publication’s content.
[3] L. Ermakova, T. Miller, A.-G. Bosser, V. M. Palma Preciado, G. Sidorov, A. Jatowt, Overview
of JOKER – CLEF-2023 track on automatic wordplay analysis, in: A. Arampatzis, E.
Kanoulas, T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli,
N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction:
Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), volume
14163 of Lecture Notes in Computer Science, Springer, Cham, 2023, pp. 397–415. doi:10.1007/
978-3-031-42448-9_26.
[4] L. Ermakova, T. Miller, F. Regattin, A.-G. Bosser, C. Borg, Élise Mathurin, G. L. Corre, S. Araújo,
R. Hannachi, J. Boccou, A. Digue, A. Damoy, B. Jeanjean, Overview of JOKER@CLEF 2022:
Automatic wordplay and humour translation workshop, in: A. Barrón-Cedeño, G. D. S. Martino, M. D.
Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, N. Ferro (Eds.),
Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Thirteenth
International Conference of the CLEF Association (CLEF 2022), volume 13390 of Lecture Notes in
Computer Science, Springer, Cham, 2022, pp. 447–469. doi:10.1007/978-3-031-13643-6_27.
[5] L. Ermakova, A.-G. Bosser, T. Miller, A. Jatowt, Overview of the CLEF 2024 JOKER task 1:
Humouraware information retrieval, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. Seco de Herrera (Eds.),
Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), volume 3740 of
CEUR Workshop Proceedings, 2024, pp. 1775–1785.
[6] L. Ermakova, A.-G. Bosser, T. Miller, R. Campos, Overview of the CLEF 2025 JOKER Task 2:</p>
      <p>Wordplay Translation from English into French, in: [24], 2025.
[7] L. Ermakova, T. Miller, Y. Naud, A.-G. Bosser, R. Campos, Overview of the CLEF 2025 JOKER Task
3: Onomastic Wordplay Translation, in: [24], 2025.
[8] D. Gupta, M. Digiovanni, H. Narita, K. Goldberg, Jester 2.0 (demonstration abstract): Collaborative
ifltering to retrieve jokes, in: Proceedings of the 22nd Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, SIGIR ’99, Association for Computing
Machinery, New York, NY, USA, 1999, p. 333. doi:10.1145/312624.312770.
[9] L. Friedland, J. Allan, Joke retrieval: Recognizing the same joke told diferently, in: Proceedings of
the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, Association for
Computing Machinery, New York, NY, USA, 2008, p. 883–892. doi:10.1145/1458082.1458199.
[10] Z. Xu, S. Escalera, A. Pavão, M. Richard, W.-W. Tu, Q. Yao, H. Zhao, I. Guyon, Codabench:
Flexible, easy-to-use, and reproducible meta-benchmark platform, Patterns 3 (2022). doi:10.1016/
j.patter.2022.100543.
[11] S. K. P, B. A, S. M, T. S, REC_Cryptix at JOKER CLEF 2025: Teaching Machines to Laugh:</p>
      <p>Multilingual Humor Detection and Translation, in: [24], 2025.
[12] A. Kreeft-Libiu, F. Helms, C. Selçuk, J. Bakker, J. Kamps, University of Amsterdam at the CLEF
2025 JOKER Track, in: [24], 2025.
[13] I. Kuzmin, CLEF 2025 JOKER track: No pun left behind, in: [24], 2025.
[14] P. Vachharajani, pjmathematician at the CLEF 2025 JOKER Lab Tasks 1, 2 &amp; 3: A Unified Approach
to Humour Retrieval and Translation using the Qwen LLM Family, in: [24], 2025.
[15] B. Chen, C. Zhong, L. Kong, CLEF 2025 JOKER track enhancing humor-aware information retrieval
with relevance-aware classification, in: [24], 2025.
[16] T. Chaudhari, A. Vora, S. Hotha, S. Sonawane, PICT at CLEF 2025 JOKER Task 1: BERT-Enhanced</p>
      <p>Ensemble Methods, in: [24], 2025.
[17] L. Ermakova, T. Miller, A.-G. Bosser, V. M. Palma Preciado, G. Sidorov, A. Jatowt, Overview
of JOKER – CLEF-2023 track on automatic wordplay analysis, in: A. Arampatzis, E. Kanoulas,
T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro
(Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, volume 14163,
Springer Nature Switzerland, Cham, 2023, pp. 397–415. doi:10.1007/978-3-031-42448-9_26.
[18] L. Ermakova, A.-G. Bosser, A. Jatowt, T. Miller, The JOKER Corpus: English–French parallel
data for multilingual wordplay recognition, in: SIGIR ’23: Proceedings of the 46th International
ACM SIGIR Conference on Research and Development in Information Retrieval, Association for
Computing Machinery, New York, NY, 2023, pp. 2796–2806. doi:10.1145/3539618.3591885.
[19] L. Ermakova, A.-G. Bosser, T. Miller, A. Jatowt, Overview of the CLEF 2024 JOKER Task 3: Translate
puns from English to French, in: G. Faggioli, N. Ferro, P. Galuscakova, A. G. Seco de Herrera (Eds.),
Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), volume 3740 of
CEUR Workshop Proceedings, CEUR-WS.org, 2024, pp. 1800–1810.
[20] L. Ermakova, T. Miller, A.-G. Bosser, V. M. Palma Preciado, G. Sidorov, A. Jatowt, Overview
of JOKER 2023 Automatic Wordplay Analysis Task 2 – pun location and interpretation, in:
M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of CLEF 2023 – Conference
and Labs of the Evaluation Forum, volume 3497 of CEUR Workshop Proceedings, 2023, pp. 1804–1817.
[21] L. Ermakova, A.-G. Bosser, A. Jatowt, T. Miller, The JOKER Corpus: English–French parallel
data for multilingual wordplay recognition, in: SIGIR ’23: Proceedings of the 46th International
ACM SIGIR Conference on Research and Development in Information Retrieval, Association for
Computing Machinery, New York, NY, 2023, pp. 2796–2806. doi:10.1145/3539618.3591885.
[22] H. Sousa, R. Almeida, P. Silvano, I. Cantante, R. Campos, A. Jorge, Enhancing Portuguese variety
identification with cross-domain approaches, in: Proceedings of the 39th Annual AAAI Conference
on Artificial Intelligence (AAAI’25), volume 39, 2025, pp. 25192–25200.
[23] J. a. Palotti, H. Scells, G. Zuccon, TrecTools: an open-source Python library for information
retrieval practitioners involved in TREC-like campaigns, in: Proceedings of the 42nd International
ACM SIGIR Conference on Research and Development in Information Retrieval, Association for
Computing Machinery, New York, 2019, pp. 1325–1328. doi:10.1145/3331184.3331399.
[24] G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025: Conference and Labs
of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2025.
[25] E. Schuurman, M. Cazemier, L. Buijs, J. Kamps, University of Amsterdam at the CLEF 2024 JOKER
track, in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024). CEUR
Workshop Proceedings, 2024, pp. 1909–1922.
[26] L. Ermakova, T. Miller, A.-G. Bosser, V. M. Palma Preciado, G. Sidorov, A. Jatowt, Overview of
JOKER 2023 Automatic Wordplay Analysis Task 1 – pun detection, in: M. Aliannejadi, G. Faggioli,
N. Ferro, M. Vlachos (Eds.), Working Notes of CLEF 2023 – Conference and Labs of the Evaluation
Forum, volume 3497 of CEUR Workshop Proceedings, 2023, pp. 1785–1803.
M 3 3 2 2</p>
      <p>0</p>
      <p>G
T P 8 1 0 0 9 5 5 8
7
1
a
t 0 5 9 1 1 9 3 8 4</p>
      <p>4 8
3 3 3 3</p>
      <p>9 6
M 6 6 6 6
@ 2 1 0
k P 4 4 4 4
s</p>
      <p>0 5
R 1 5 7 7 6 4 9 8
.0 .6 .1 .1 .9 .
# 9 9 9 2 2 9 9 6 9 7 9 6 9 7 2 6 9 6 6
6 6 6 6 1 6 6 6 6 6 6 6 1 6
2 2 0
M 4 4 4 4
u P 1 7 1 1 9 2 5 0
s
5 l</p>
      <p>2 8 5 4 9 3 2 8 2 6 5 2 2 9 3 6 8
e
e r 3 3 0 0 2 0 6 2 9 6 9 2 1 6 2 3 5 4 2
l # 9 9 9 9 2 5 5 2 9 2 4 2 5 2 2 1 6</p>
      <p>8 5 3 3 0 1 6</p>
      <p>6 .
3
b 5 5 8 8 1</p>
      <p>1
5 4 9 5 2 3 4 6 5
1 .
0 5 6 1 1 3 3 1 3
0 0 3 1 1 3 7 4 0
R R a a
- - t t
4 4 r r
Q
-
2 4
3 1</p>
      <p>e e
Q b b
o o</p>
      <p>R R
Q + +</p>
      <p>F F
D
I</p>
      <p>Q
n _ _ s s</p>
      <p>n n
u a
R i
c
i
a
i
c r r
an an 25 s_ i
a
c
i
m o
i
t t
a a
m m S
e e
h h
t t
a a i
jm jm sa a</p>
      <p>s
o i
T T
n n b
e
_ _
n
n s_
o</p>
      <p>h t
eS t_ r</p>
      <p>7
@ 2 1 1 1 1 2 1 1 1
P</p>
      <p>0
1 1 1 1</p>
      <p>8 .</p>
      <p>6 .</p>
      <p>6 .</p>
      <p>0 .
5 5 4 5 5 4 4 3 2 1 2 2 1 2 1 2 2 1 1 1 1 1 1 1 1
1
g</p>
      <p>9 8 3 0 9 2 7 6 6 2 3 9 1 9 9 9 9 2 4 7
M 6 7 7 7 6 7 7 5 4 3 3 5 3 4 3 3 3 2 2 2 3 3 3 2 1
6 5 2 0 0 9 7
M 5 3 3 3 3 3 2 2</p>
      <p>3 4
0 4
9 6 8 2
G 6 6 5 5 5 4 5 3 2 2 2 2 2</p>
      <p>1 0 6 6
2 2 2 1 1</p>
      <p>1 1 1 1 1
0 .0 .
3 4
) P
a
t
n
i
n
i
a
r
t
(
h
s
i
l
g
n
E
T
r
o
f
0 .</p>
      <p>9 .</p>
      <p>5 .</p>
      <p>A .2 .</p>
      <p>9
2 5 9 1 0 6 1 2 7 8 6 9 5 2 3 9 9 6 6 1 1 3 2 1 0 7 2 6 1 8 4 7 2 2 6 6 7 9 8
4 9 3 1 4 9 7 3 7 3 6 5 6 5 7 5 5 7 7 7 9 8 8 7 7 8 4 7 7 8 7 0 2 6 4 1 1 1
6 3 4 4 4 5 5 5 4 5 3 3 4 2 4 4 4 4 4 4 3 3 3 3 3 4 4 4 4</p>
      <p>8 8</p>
      <p>Q Q
ID bo -2 2</p>
      <p>n</p>
      <p>+ Q Q
uR sF _ _
n
a
i
n
n a
a i
4 2 4 2
1 3 1 3
R 3 3 1 1 R</p>
      <p>Q Q a
- - t
8 8 r
Q
-
4 4 o</p>
      <p>e</p>
      <p>Q b
Q
_ _ s
n</p>
      <p>Q +</p>
      <p>F
a
i
c
i i
t t
a a</p>
      <p>R
8
Q
M
R
_
T
R</p>
      <p>E
k B
n l k
a o n
_ r C a</p>
      <p>e _ r
n n T ia le R e e</p>
      <p>n
ia a _ l R
c r R c b</p>
      <p>R b _
0
6
p
o d
r e
d c
_ n
a a a
T h T
R p p p p R C p C C C C U U C U U U U U C C C C C C S U U C U C c S t y C d c
3
M lB</p>
      <p>L i
_R o _ M</p>
      <p>C le _
i 5 _
in 3 2 T
r
m m R
n
a
h
n
E
d ed T _ b 5
e
c c R e
n E l</p>
      <p>m _E
ah l
n o e</p>
      <p>C s E</p>
      <p>n _
E _
_ e
T l
R b _E ced s_</p>
      <p>F o</p>
      <p>K
B b e 5 K 1 k</p>
      <p>s
m n 2 1 E n</p>
      <p>C ra
M E
B _C 5_
e</p>
      <p>R i
3 2 _ in le
m m R r</p>
      <p>R o
_ c
b s</p>
      <p>s
b E F E</p>
      <p>m
D lB e I
I s
n F</p>
      <p>n
D a</p>
      <p>e
b _ s
h r
k _ _</p>
      <p>n n
v r
d a e e L</p>
      <p>m o
M n se r</p>
      <p>A n</p>
      <p>M</p>
      <p>L
3 i
M
R
_
a
T
R E</p>
      <p>_
E r 5
B e 2
o d
n
i
M
_
5
e
g
r
a
l
_
a
t
r
e
M N bo
B E r</p>
      <p>_ _
oh ek l
n _ r
e s</p>
      <p>k a
c_ ra r</p>
      <p>e
x</p>
      <p>n n
m e e e</p>
      <p>_ ss
n i
n _x li</p>
      <p>e
am ip
h o
t r
n c
a _
r
e
d
o
c
0 .0 .
8 0 1
1 2
3 0 8 4 0 9 3 0</p>
      <p>8
5 .
5 4 4
@ .40 .51 .3 .8 .0 .8 .</p>
      <p>3 7 0 8
G 0 3 5
0 6 0
00 .7 .4 4 .3 .0 .9 .</p>
      <p>0 0 .3 0 0 0
1</p>
      <p>0
7 9 9 2 0 2 6 2
0 9 3
10 .7 .0 5 .6 .8 .4 .</p>
      <p>2 2 .4 3 0 6
3
5 .0 .
7 0
9 6 5 9 0 7 5 1
5 2 9 7 0 2 3 8
e R 8 1
u R .3 .1 9 .9 .</p>
      <p>8 4 .1 5
g M
u
t
1
n r n
uR s_ 25 fin t e a
_ p z_ i
_ t i</p>
      <p>p n
a p _ a
m ic
T lx t
n _ a
e
am em S
T r
R
t</p>
      <p>M _</p>
      <p>C
n B an 5 _
a s_ i</p>
      <p>c
o i</p>
      <p>2 3 T ra
h t
k a b r</p>
      <p>_
m t
m m _P
h on tre t</p>
      <p>_p e
m bo ts th
e</p>
      <p>p p lm S m p
m th s_ s_ x _ a _</p>
      <p>_ on tre s
m a
m m u m
m A A ea a o A</p>
      <p>R y U
A .9 .2 5 .4 .0 .7 .</p>
      <p>2 0 .5 0 0 4
0
8 .0 .
9 0
e p
b</p>
      <p>a
o t
K R r</p>
      <p>m 3
T lx r</p>
      <p>m
n _ _
e t
R l
:
7 # 2 1 0 1</p>
      <p>1
9 5
re 1 1 0 0 2 8 9</p>
      <p>2 2</p>
      <p>0 0 0 4 5 0 0 0 5
0 7 0 0 0 0 7 0 0 0 0 0 0 0 0 7 0 0 0 0 0
a e 0 4 0 9 9 0 4 9 0 0 9 0 0 9 0 4 9 9 0 9 9
T r# 9 4 9 2 2 9 6 2 9 9 2 9 9 2 9 4 8 2 9 2 8
2 2 2 2 2 2 2 2 2 2 2 2
- t
4 r</p>
      <p>e
Q b
s Q +
o _ s</p>
      <p>e
F b Q
o _</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , A.-G. Bosser,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <article-title>Overview of JOKER: Humour in the machine</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , A.-G. Bosser,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Palma Preciado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jatowt</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2024 JOKER track: Automatic humour analysis</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), volume
          <volume>14959</volume>
          of Lecture Notes in Computer Science, Springer, Cham,
          <year>2024</year>
          , pp.
          <fpage>165</fpage>
          -
          <lpage>182</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-031
          <source>-71908-0_8. 4 0</source>
          <volume>0 0 4 6 0 0 0 e 0 0 0 7 5 0 0 0 0 9 0 0 0 9 5 0 0 0 0 r 0 0 0 5 8 0 0 9 0 9 0 9 0 9 8 9 0 9 9 00</volume>
          <source>.3 .2 .6 . 5 3</source>
          <volume>3 1 k j</volume>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>