1. Introduction

Overview of the CLEF 2025 JOKER Task 1: Humour-aware Information Retrieval

Liana Ermakova

Ricardo Campos

3 5

Anne-Gwenn Bosser

Tristan Miller

0 2 0 Austrian Research Institute for Artificial Intelligence (OFAI) , Vienna , Austria 1 Bretagne INP - ENIB, Lab-STICC CNRS UMR 6285 , France 2 Department of Computer Science, University of Manitoba , Winnipeg , Canada 3 INESC TEC , Porto , Portugal 4 Université de Bretagne Occidentale , HCTI , France 5 University of Beira Interior , Covilhã , Portugal

2025

0 0 0000 0002

This paper presents the details of Task 1 of the JOKER-2025 Track, an information retrieval task where the goal is to find relevant humorous in a collection of text documents. The intended use case is retrieving jokes on a specific topic, something that may benefit humanities research, second-language learning, and the writing or translation of comedic texts. We provide two document collections: one in English and another in European Portuguese. The English collection consists of 77,658 documents, of which 5,198 are annotated as humorous, and 219 queries with relevance judgments. The Portuguese collection contains 45,126 texts, including 1,199 humorous documents along with 98 queries. Together, these collections support cross-linguistic studies in humour detection and contribute to the development of more inclusive and language-aware retrieval systems. Nine teams submitted 62 runs in total for this task.

eol>Wordplay Puns Humour-aware Information Retrieval Computational Humour wordplay detection test collection

1. Introduction

should fulfill two criteria: they must be relevant to the query, which encodes a topic, and they must be humorous, which for our purposes means being instances of wordplay. For example, a search query of “math” would mean that the goal is to find math jokes (e.g., “Why don’t mathematicians argue? Because they always try to find common denominators!”), while the query “Tom” would mean that the goal is to find jokes about some person or entity named Tom (e.g., “Why did Tom bring a ladder to the bar? Because he heard the drinks were on the house!”).

The data for our task builds upon the English corpora constructed in previous editions of JOKER [ 2, 3 ], and has been expanded with a substantial set of humorous texts in Portuguese.

This year, nine teams, out of the total thirteen active JOKER participants, submitted 62 runs for Task 1 out of the 136 runs submitted to the track. (See run statistics in Table 1.) The English subtask attracted nearly twice as many participants, with nine teams submitting 41 runs, compared to five teams submitting 21 runs for Portuguese.

This year saw a significant change to the task infrastructure, now hosted at Codabench [ 10], a Free Software web-based platform for organising AI benchmarks. We provided separate Codabench benchmarks for English3 (see Fig. 1 and Portuguese4. Codabench facilitated the organisation of the 2025 track and attracted many new participants, who registered on the platform and gained full access to the competition, including submission and leaderboard pages. We continue to receive new registrations and post-competition submissions; however, this paper presents only those runs submitted prior to the oficial release of results to participants.

The remainder of this paper is structured as follows: Section 2 describes the test and train data in English and Portuguese as well as it format, Section 3 presents the evaluation metrics, Section 4 describes the participants’ runs, and Section 5 presents an analysis of their results on the training and test data. Finally, Section 6 provides some concluding remarks.

2. Dataset construction and characterisation

The data for this task consist of documents in both English and Portuguese, allowing for cross-lingual research and evaluation. The following two sections describe the procedures carried out to construct and prepare the dataset. 3https://www.codabench.org/competitions/8686/ 4https://www.codabench.org/competitions/8736/

2.1. English data

In the 2025 edition, the English data is an extension of that used in Task 1: Humour-aware Information Retrieval from JOKER 2024 [ 2, 5 ], which was constructed based on an English wordplay detection corpus [17, 18] and valid translations [ 2, 19 ]. We grouped the humorous texts into clusters of related topics and created queries based on these clusters. We added a significant number of topically relevant but non-humorous texts by extracting relevant passages from Wikipedia and by generating passages using Meta’s Llama-2 (7B) models. Due to the number of queries, the corpus contains a large fraction of non-relevant content. Both positive and negative examples included a mix of generated and humanwritten texts to prevent the task from being reduced to simply detecting generated content.

In 2024, the total number of documents in the corpus was 61,268, with 4,492 humorous texts and 56,776 non-humorous ones. For 57 queries, 11,831 documents were considered topically relevant.

For the 2025 edition, we expanded this data with new manually created jokes and texts generated by the LLMs Bard, Claude, ChatGPT, and Phi-3 Mini. The resulting corpus contains 77,658 texts in total, of which 5,198 are humorous. Detailed statistics on the English-language data sources for Task 1: Humour-aware Information Retrieval is given in Table 2.

For creating the set of queries, we harnessed data from CLEF 2023 JOKER Task 2: Pun Location and Interpretation [20, 3, 21], and in particular, the locations of wordplay in texts – i.e. words or phrases carrying multiple meanings. In CLEF 2023 JOKER Task 2, puns were either homographic (identical spelling as in “I used to be a banker but I lost interest”) or heterographic (i.e., exploiting paronymy as in propane/profane in “When the church bought gas for their annual barbecue, proceeds went from the sacred to the propane.”) To expand the queries, we used the semantic annotations of pun locations (pun interpretation) – i.e., pairs of lemmatised word sets, containing the synonyms (or, if absent, hypernyms) of the two words involved in the pun, excluding any that share the same spelling as the pun. The lists of query expansions were manually checked. The document was deemed humorous and relevant to the query if it came from the positive examples of the JOKER corpus and included the query term or its expansions.

In this edition for 219 queries, 6,655 documents were judged humourous and topically relevant. As in 2024, we used 11 queries for the train and the rest for the test. The detailed statistics on the number of relevant humourous documents per query for the English dataset is given in Figure 2.

2.2. Portuguese data

To extend the multilingual scope of the task, we introduced a substantial new dataset in European Portuguese (PT-PT). This collection consists of 45,126 documents, of which 1,199 are humorous and 43,927 are non-humorous. The humorous texts were compiled through a three-stage process. First, 660 humorous instances from the English dataset were automatically translated into Portuguese using DeepL. Second, 421 texts were manually curated from various Portuguese-language websites. Finally, 350 300 250 200 150 100 50 0 118 humorous texts were generated using ChatGPT (4o-mini model). All texts underwent manual validation to ensure quality and conformity to the PT-PT variant. Queries for this collection were derived through a systematic topic-grouping procedure. Using GPT-3.5-turbo, the puns were clustered by theme, e.g., "grapes" and "oranges" were grouped under the broader category "fruit". Puns without a clear thematic link were marked as irrelevant. A manual curation process refined these groupings into 98 distinct queries associated with the 1,199 humorous texts.

To compile the 43,927 non-humorous documents, we employed a two-step process. First, 41,028 sentences were retrieved from Wikipedia using the same API-based approach as in the English dataset. Then, 2,899 additional non-humorous texts were generated using GPT-3.5-turbo. To ensure consistency with the European Portuguese variant, all texts were passed through the PtVId model [22] to detect Brazilian Portuguese (PT-BR) entries. Any PT-BR texts were automatically translated into PT-PT using ChatGPT-4o-mini, followed by manual validation.

Twenty-nine queries with their judgments (qrels) were created for training or validating participants’ systems. Then, another 69 queries were created as a test set.5. For all 98 queries (combined test and training), 21,636 documents were considered topically relevant (i.e., they matched the query or its expansions). Of these, 1,334 were humorous.6

The descriptive statistics of the Portuguese data sources are provided in Table 3, while Figure 3 shows the distribution of relevant humorous texts per query. For Portuguese, the average is 14, with a median of 8, reflecting a more compact distribution aligned with the smaller dataset size. 5Note that we also included all the training-set queries in the test input file; however, they are excluded from the resulting scores. 6Note that this number is higher than the 1,199 humorous documents collected, as a document may be associated with more than one query. }, { }, { }, { 80 70 60 50 40 30 20 10 0 qid_test_1 qid_test_28 qid_test_46 qid_test_64

qid_test_82 qid

2.3. Input formats

As described in the following subsections, the input formats for the document collection, queries, and training/validation data (qrels) generally follows those used for the 2024 edition of the task. 2.3.1. Document collection We provide the training and test data in a JSON format with the following fields: • docid a unique document identifier • text the text of the instance, which may or may not contain wordplay

Input example: { "docid": "1", "text": "Good laws have sprung from bad customs." "docid": "2", "text": "The musical score to Topsyturveydom does not survive, but amateur productions in recent decades have used newly composed scores or performed the work as a non-musical play." "docid": "3", "text": "The organic compound primarily responsible for the characteristic odor of musk is muscone." }, { 2.3.2. Queries The train and test queries are also JSON files, this time with the following fields: • qid a unique query identifier from the input file • query the search query, e.g. "math" means that the goal is to find math jokes, while the query "Tom" means that the goal is to find jokes about Tom 2.3.3. Qrels

Input example: {"qid":"qid_train_1","query":"steps"}, {"qid":"qid_train_3","query":"math"}, {"qid":"qid_train_4","query":"Tom"} Finally, we provide training/validation data in the format of JSON qrels files with the following fields: • qid a unique query identifier from the query input file • docid a unique document identifier from the corpus • qrel indication the document docid is relevant to the query qid and is a wordplay instance }, { }, { } Example of a qrel file: "qid": "qid_train_0", "docid": "27260", "qrel": 0 "qid": "qid_train_0", "docid": "591", "qrel": 1 "qid": "qid_train_0", "docid": "51135", "qrel":1

2.4. Output format

As with the input formats, the output format is identical to that used in CLEF 2024 JOKER Task 1. That is, we required results to be provided in a JSON format with the following fields: • run_id run ID starting with <team_id>_<task_id>_<method_used>, e.g. UBO_task_1_TFIDF • manual flag indicating if the run is manual {0,1} • qid a unique identifier from the input file • docid an identifier of the document retrieved from the corpus to the qid query • rank retrieved document rank • score normalised document relevance score (in the [ 0–1 ] scale)

For each query, the maximum allowed number of distinct documents (docid field) is 1000. A sample output file is as follows: }, { "run_id":"team1_task_1_TFIDF", "manual":0, "qid":"qid_train_0", "docid":"591", "rank":2, "score":0.8 "run_id":"team1_task_1_TFIDF", "manual":0, "qid":"qid_train_1", "docid":"27261", "rank":1, "score":0.7 }, {

3. Evaluation measures

Performance was measured with standard information retrieval metrics as implemented in TrecTools, a Free Software Python library for information retrieval [23]. For each run we report the number of documents retrieved (#ret), the number of relevant documents retrieved (#rel), mean average precision (MAP; the mean of average precision scores across queries), geometric mean average precision (GMAP), precision at the number of relevant documents (P@R), mean reciprocal rank (MRR; the average of the reciprocal rank of the first relevant item across queries), precision (P@ ; the proportion of relevant items retrieved at the top = 5, 10, 100, 1000 positions), normalised discounted cumulative gain (NDCG; accounting for the relevance and position of documents in the ranking, normalised against the ideal ranking), and (for Portuguese only) the binary preference score (bpref).

4. Participants’ approaches

In total, nine teams submitted 62 runs (see Table 1), with five of these teams contributing 21 runs to the Portuguese subtask. Every team participating in the Portuguese subtask also submitted runs for English. The approaches used by the participating teams are as follows: rasion [15] This team proposed a dual-screening architecture that separates humour-aware information retrieval into two distinct stages. The first employs a semantic similarity model that uses the paraphrase-multilingual-mpnet-base-v2 model to encode queries and documents into dense vector representations, and distance-based metrics and cosine similarity to quantify semantic alignment and filter query-relevant documents. This step is followed by a transformer-based classifier (xlm-roberta-base) that identifies humorous texts containing puns. The method, applied to both English and Portuguese datasets, aims to reduce task complexity through modularisation. Their system achieved strong performance in Portuguese, highlighting the efectiveness of separating relevance and humour detection subtasks. cryptix and sarath_kumar [11] These participants employed a fine-tuned Sentence-BERT (SBERT) model to generate semantic embeddings of queries and documents. They trained the model using a cosine similarity loss on humour-labelled query–document pairs, aiming to capture implicit humour such as irony or exaggeration. The resulting vectors were indexed using the Facebook AI Similarity Search (FAISS) for eficient retrieval, and results were re-ranked using human-annotated humour intensity scores. igoranchik [13] This team implemented a hybrid retrieval pipeline combining dense and lexical retrieval, followed by cross-encoder reranking. They fine-tuned the intfloat/multilingual-e5-small model using contrastive objectives – Multiple Negative Ranking Loss (MNRL) and an Adaptive Margin Loss – on humour-annotated data, including synthetic queries generated with GPT-4o-mini. BM25 was used for lexical retrieval via Anserini, while dense vectors were stored in Qdrant. The top 1000 documents from both retrieval methods were merged using reciprocal rank fusion and re-ranked using the cross-encoder/ms-marco-MiniLM-L12-v2. pjmathematician [14] This team implemented a two-stage pipeline using the Qwen family of large language models (LLMs). First, they applied large Qwen models (Qwen3-14B and Qwen3-32B) to analyse the entire document corpus, generating humour-related metadata such as a binary ‘isJoke’ flag and textual explanations for each document. These enriched representations were then used in a dense retrieval step, where smaller Qwen embedding models (Qwen3-4B and Qwen3-8B) indexed either the original text or the explanation-augmented versions. Retrieval was performed using both generic and humour-specific query prompts. tanishc228 [16] This participant proposed a multi-stage ensemble retrieval system combining traditional IR methods with neural rerankers (ColBERT and a BERT-based cross-encoder), complemented by handcrafted wordplay features. Their pipeline retrieves documents using both lexical and semantic methods, followed by contextual reranking and score fusion. The system aims to capture humorous content by incorporating features such as punctuation, repetition, and alliteration. kamps and fhelms [12] These teams submitted baseline runs using Anserini BM25 or BM25+RM3 and zero-shot MSMARCO-trained neural cross-encoder rerankings of the top 100 results.

All participants who submitted runs also submitted system description papers to the Working Notes volume [24]. Two teams from the same university (alecs and kamps) submitted a single joint report, as did teams cryptix and sarath_kumar, resulting in a total of seven Working Notes from the participants of Task 2. Despite the requirement to include the team ID in the run name, participants’ submissions often difered in their run names, registration details, and Codabench IDs. We manually matched the Working Notes with the submitted runs and report the results using the team names provided in those submissions.

5. Results 5.1. Test data 5.2. Training data

As in previous years, runs were submitted for both the training and test datasets in order to analyse potential overfitting and related efects. Tables 6 and 7 report the Task 1 results on the training data for English and Portuguese, respectively.

For the English subtask, we observe that the top four runs according to MAP (0.33 to 0.35) and NDCG@5 (0.56 to 0.61) submitted by pjmathematician [14] remain at the top of the table with also closed values of MAP (0.44 to 0.48) and NDCG@5 (0.53 to 0.68). Both runs Rasion_SenTransF+Roberta [15] have better scores on the train data than the University of Amsterdam with the best-scored runs achieving MAP = 0.59 and NDCG@5 = 0.61. The run UAms_RM3RoBERTa_drop60 [12] shows similar performance in terms of MAP on the train and the test sets, but with an improvement of NDCG@5 for the train data. However, many runs achieved higher scores on the training set, which lowered the ranking of the University of Amsterdam’s run. Two runs Rasion_SenTransF+Roberta [15] and the run Cryptix_SBERT achieved more than double the MAP and at least 50% improvement in terms of NDCG@5 on the training data compared to the test data which might be a result of overfitting. The RM3 and BM25 runs from the University of Amsterdam [12] remain mid-ranked, showing stable scores without signs of overfitting, suggesting similar properties between the training and test data and confirming their strength as baselines. This also suggests that the improvement of other approaches may be attributed to their quality rather than diferences between this year’s data and the 2024 test collection. Cross-encoders performed poorly on both the training and test sets, likely because they are not designed to detect humour.

Teams rasion [15] and pjmathematician [14] submitted the highest-scoring runs on the English collections, also achieving the best results on the Portuguese training and test collections. Note that they achieved better results on the training data than on the oficial test data. They are followed by the BM25 run from the University of Amsterdam [12], which ranked fifth on both the Portuguese training and test collections, showing a 2–3 times drop in MAP and a 4–10 times drop in NDCG@5. However, the high ranking of BM25 may be partly due to the fact that the Portuguese subtask had roughly half as many runs as the English subtask. Note that on the test sets, the drop in terms of MAP and NDCG@5 is even higher. Cross-encoders remain low and stable among test and training collections, as for English.

6. Conclusions

In this paper, we have presented an overview and discussion of the results of Task 1 of the JOKER-2025 challenge on the retrieval of humorous texts relevant to a search query. Based on the data for wordplay detection and interpretation previously constructed within the CLEF JOKER track [ 1, 26, 20, 3, 21 ], we constructed a unique reusable test collection for wordplay retrieval in English. We manually created new jokes to avoid potential LLM data contamination. To prevent the task from being reduced to generated text detection, both positive and negative examples comprised a combination of human-written and machine-generated texts. The English collection consists of 77,658 documents, of which 5,198 are annotated as humorous, and 219 queries with relevance judgments.

In addition to this, this year we also expanded the dataset with Portuguese data collected from Portuguese-language websites, translated from the English corpus and generated by Chat-GPT (4o-mini model). The Portuguese collection contains 98 queries and 45,126 texts, including 1,199 humorous documents.

This year, the track setup was updated, with submissions managed through Codabench. Nine teams submitted 41 runs for the English subtask, of which five also submitted 21 runs for the Portuguese subtask, resulting in 62 valid distinct runs in total.

The teams applied diverse methods, ranging from traditional approaches rankers such as TF–IDF, BM25, and RM3 to cross-encoders with and without filtering, to more modern ones, including fine-tuned transformers and LLMs. The best results both for English and Portuguese were achieved by the team pjmathematician [14], which applied the Qwen model for retrieval and filtering, and the team rasion [ 15], which applied dense retrieval and transformer-based detection of humorous texts. These results might testify AI progress in pun detection. Further analysis is needed to assess the impact of potential LLM data contamination on this performance.

This year’s English task showed remarkable progress, with the best run by team pjmathematician achieving a MAP of 0.3501 – nearly triple last year’s top score – and outperforming all competitors by a wide margin across the various metrics. In contrast, the University of Amsterdam’s cross-encoder approaches performed substantially worse than their RM3 and BM25 baselines, confirming the efectiveness of simpler retrieval strategies for this dataset. For the Portuguese subtask, results were more balanced, with pjmathematician and raison achieving similar MAP and NDCG@5 scores around 0.4 to 0.5, far ahead of the BM25 baseline. Interestingly, while the Portuguese runs achieved higher MAP scores, they trailed the English runs in precision and NDCG@5, likely due to the smaller pool of relevant humorous documents per query. Overall, these findings suggest that while the dataset’s core properties have remained stable, combining retrieval and filtering remains key to advancing performance.

The University of Amsterdam’s RM3 and BM25 runs remained stable and reliable baselines, showing no overfitting and similar performance across test/training and English/Portuguese datasets. Improvements by other methods likely reflect their quality rather than dataset diferences. Cross-encoders performed poorly, likely due to their unsuitability for humour detection.

In general, our results confirm that retrieval models are humour-agnostic and humour detection is still a challenge for machine learning models and LLMs despite improvement over the last year edition.

For more information about the JOKER lab this year, please refer to the overview paper [ 1 ], and the Working Notes papers for Task 2: Pun Translation [6] and Task 3: Onomastic Wordplay Translation [7]. Visit the JOKER website at https://joker-project.com for any other information related to the track.

Acknowledgments

This work has received a government grant managed by the National Research Agency under the program Investissements d’avenir integrated into France 2030, with the Reference ANR-19-GURE-0001. It was also financed by National Funds through the Portuguese funding agency FCT through the project LA/P/0063/2020 (DOI 10.54499/LA/P/0063/2020). Ricardo Campos would also like to acknowledge project StorySense, with reference 2022.09312.PTDC (DOI 10.54499/2022.09312.PTDC). We thank all other colleagues and students who participated in data construction, the translation contests, and the CLEF JOKER track.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT and Grammarly in order to: Grammar and spelling check and Paraphrase and reword. Further, the authors used Gemini in order to: Generate images. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [3] L. Ermakova, T. Miller, A.-G. Bosser, V. M. Palma Preciado, G. Sidorov, A. Jatowt, Overview of JOKER – CLEF-2023 track on automatic wordplay analysis, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), volume 14163 of Lecture Notes in Computer Science, Springer, Cham, 2023, pp. 397–415. doi:10.1007/ 978-3-031-42448-9_26. [4] L. Ermakova, T. Miller, F. Regattin, A.-G. Bosser, C. Borg, Élise Mathurin, G. L. Corre, S. Araújo, R. Hannachi, J. Boccou, A. Digue, A. Damoy, B. Jeanjean, Overview of JOKER@CLEF 2022: Automatic wordplay and humour translation workshop, in: A. Barrón-Cedeño, G. D. S. Martino, M. D. Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022), volume 13390 of Lecture Notes in Computer Science, Springer, Cham, 2022, pp. 447–469. doi:10.1007/978-3-031-13643-6_27. [5] L. Ermakova, A.-G. Bosser, T. Miller, A. Jatowt, Overview of the CLEF 2024 JOKER task 1: Humouraware information retrieval, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. Seco de Herrera (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), volume 3740 of CEUR Workshop Proceedings, 2024, pp. 1775–1785. [6] L. Ermakova, A.-G. Bosser, T. Miller, R. Campos, Overview of the CLEF 2025 JOKER Task 2:

Wordplay Translation from English into French, in: [24], 2025. [7] L. Ermakova, T. Miller, Y. Naud, A.-G. Bosser, R. Campos, Overview of the CLEF 2025 JOKER Task 3: Onomastic Wordplay Translation, in: [24], 2025. [8] D. Gupta, M. Digiovanni, H. Narita, K. Goldberg, Jester 2.0 (demonstration abstract): Collaborative ifltering to retrieve jokes, in: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, Association for Computing Machinery, New York, NY, USA, 1999, p. 333. doi:10.1145/312624.312770. [9] L. Friedland, J. Allan, Joke retrieval: Recognizing the same joke told diferently, in: Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, Association for Computing Machinery, New York, NY, USA, 2008, p. 883–892. doi:10.1145/1458082.1458199. [10] Z. Xu, S. Escalera, A. Pavão, M. Richard, W.-W. Tu, Q. Yao, H. Zhao, I. Guyon, Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform, Patterns 3 (2022). doi:10.1016/ j.patter.2022.100543. [11] S. K. P, B. A, S. M, T. S, REC_Cryptix at JOKER CLEF 2025: Teaching Machines to Laugh:

Multilingual Humor Detection and Translation, in: [24], 2025. [12] A. Kreeft-Libiu, F. Helms, C. Selçuk, J. Bakker, J. Kamps, University of Amsterdam at the CLEF 2025 JOKER Track, in: [24], 2025. [13] I. Kuzmin, CLEF 2025 JOKER track: No pun left behind, in: [24], 2025. [14] P. Vachharajani, pjmathematician at the CLEF 2025 JOKER Lab Tasks 1, 2 & 3: A Unified Approach to Humour Retrieval and Translation using the Qwen LLM Family, in: [24], 2025. [15] B. Chen, C. Zhong, L. Kong, CLEF 2025 JOKER track enhancing humor-aware information retrieval with relevance-aware classification, in: [24], 2025. [16] T. Chaudhari, A. Vora, S. Hotha, S. Sonawane, PICT at CLEF 2025 JOKER Task 1: BERT-Enhanced

Ensemble Methods, in: [24], 2025. [17] L. Ermakova, T. Miller, A.-G. Bosser, V. M. Palma Preciado, G. Sidorov, A. Jatowt, Overview of JOKER – CLEF-2023 track on automatic wordplay analysis, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, volume 14163, Springer Nature Switzerland, Cham, 2023, pp. 397–415. doi:10.1007/978-3-031-42448-9_26. [18] L. Ermakova, A.-G. Bosser, A. Jatowt, T. Miller, The JOKER Corpus: English–French parallel data for multilingual wordplay recognition, in: SIGIR ’23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, 2023, pp. 2796–2806. doi:10.1145/3539618.3591885. [19] L. Ermakova, A.-G. Bosser, T. Miller, A. Jatowt, Overview of the CLEF 2024 JOKER Task 3: Translate puns from English to French, in: G. Faggioli, N. Ferro, P. Galuscakova, A. G. Seco de Herrera (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), volume 3740 of CEUR Workshop Proceedings, CEUR-WS.org, 2024, pp. 1800–1810. [20] L. Ermakova, T. Miller, A.-G. Bosser, V. M. Palma Preciado, G. Sidorov, A. Jatowt, Overview of JOKER 2023 Automatic Wordplay Analysis Task 2 – pun location and interpretation, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum, volume 3497 of CEUR Workshop Proceedings, 2023, pp. 1804–1817. [21] L. Ermakova, A.-G. Bosser, A. Jatowt, T. Miller, The JOKER Corpus: English–French parallel data for multilingual wordplay recognition, in: SIGIR ’23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, 2023, pp. 2796–2806. doi:10.1145/3539618.3591885. [22] H. Sousa, R. Almeida, P. Silvano, I. Cantante, R. Campos, A. Jorge, Enhancing Portuguese variety identification with cross-domain approaches, in: Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence (AAAI’25), volume 39, 2025, pp. 25192–25200. [23] J. a. Palotti, H. Scells, G. Zuccon, TrecTools: an open-source Python library for information retrieval practitioners involved in TREC-like campaigns, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, 2019, pp. 1325–1328. doi:10.1145/3331184.3331399. [24] G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025: Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2025. [25] E. Schuurman, M. Cazemier, L. Buijs, J. Kamps, University of Amsterdam at the CLEF 2024 JOKER track, in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024). CEUR Workshop Proceedings, 2024, pp. 1909–1922. [26] L. Ermakova, T. Miller, A.-G. Bosser, V. M. Palma Preciado, G. Sidorov, A. Jatowt, Overview of JOKER 2023 Automatic Wordplay Analysis Task 1 – pun detection, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of CLEF 2023 – Conference and Labs of the Evaluation Forum, volume 3497 of CEUR Workshop Proceedings, 2023, pp. 1785–1803. M 3 3 2 2

G T P 8 1 0 0 9 5 5 8 7 1 a t 0 5 9 1 1 9 3 8 4

4 8 3 3 3 3

9 6 M 6 6 6 6 @ 2 1 0 k P 4 4 4 4 s

0 5 R 1 5 7 7 6 4 9 8 .0 .6 .1 .1 .9 . # 9 9 9 2 2 9 9 6 9 7 9 6 9 7 2 6 9 6 6 6 6 6 6 1 6 6 6 6 6 6 6 1 6 2 2 0 M 4 4 4 4 u P 1 7 1 1 9 2 5 0 s 5 l

2 8 5 4 9 3 2 8 2 6 5 2 2 9 3 6 8 e e r 3 3 0 0 2 0 6 2 9 6 9 2 1 6 2 3 5 4 2 l # 9 9 9 9 2 5 5 2 9 2 4 2 5 2 2 1 6

8 5 3 3 0 1 6

6 . 3 b 5 5 8 8 1

1 5 4 9 5 2 3 4 6 5 1 . 0 5 6 1 1 3 3 1 3 0 0 3 1 1 3 7 4 0 R R a a - - t t 4 4 r r Q - 2 4 3 1

e e Q b b o o

R R Q + +

F F D I

Q n _ _ s s

n n u a R i c i a i c r r an an 25 s_ i a c i m o i t t a a m m S e e h h t t a a i jm jm sa a

s o i T T n n b e _ _ n n s_ o

h t eS t_ r

7 @ 2 1 1 1 1 2 1 1 1 P

0 1 1 1 1

8 .

6 .

0 . 5 5 4 5 5 4 4 3 2 1 2 2 1 2 1 2 2 1 1 1 1 1 1 1 1 1 g

9 8 3 0 9 2 7 6 6 2 3 9 1 9 9 9 9 2 4 7 M 6 7 7 7 6 7 7 5 4 3 3 5 3 4 3 3 3 2 2 2 3 3 3 2 1 6 5 2 0 0 9 7 M 5 3 3 3 3 3 2 2

3 4 0 4 9 6 8 2 G 6 6 5 5 5 4 5 3 2 2 2 2 2

1 0 6 6 2 2 2 1 1

1 1 1 1 1 0 .0 . 3 4 ) P a t n i n i a r t ( h s i l g n E T r o f 0 .

9 .

5 .

A .2 .

9 2 5 9 1 0 6 1 2 7 8 6 9 5 2 3 9 9 6 6 1 1 3 2 1 0 7 2 6 1 8 4 7 2 2 6 6 7 9 8 4 9 3 1 4 9 7 3 7 3 6 5 6 5 7 5 5 7 7 7 9 8 8 7 7 8 4 7 7 8 7 0 2 6 4 1 1 1 6 3 4 4 4 5 5 5 4 5 3 3 4 2 4 4 4 4 4 4 3 3 3 3 3 4 4 4 4

8 8

Q Q ID bo -2 2

+ Q Q uR sF _ _ n a i n n a a i 4 2 4 2 1 3 1 3 R 3 3 1 1 R

Q Q a - - t 8 8 r Q - 4 4 o

Q b Q _ _ s n

Q +

F a i c i i t t a a

R 8 Q M R _ T R

E k B n l k a o n _ r C a

e _ r n n T ia le R e e

n ia a _ l R c r R c b

R b _ 0 6 p o d r e d c _ n a a a T h T R p p p p R C p C C C C U U C U U U U U C C C C C C S U U C U C c S t y C d c 3 M lB

L i _R o _ M

C le _ i 5 _ in 3 2 T r m m R n a h n E d ed T _ b 5 e c c R e n E l

m _E ah l n o e

C s E

n _ E _ _ e T l R b _E ced s_

F o

K B b e 5 K 1 k

s m n 2 1 E n

C ra M E B _C 5_ e

R i 3 2 _ in le m m R r

R o _ c b s

s b E F E

m D lB e I I s n F

n D a

e b _ s h r k _ _

n n v r d a e e L

m o M n se r

A n

L 3 i M R _ a T R E

_ E r 5 B e 2 o d n i M _ 5 e g r a l _ a t r e M N bo B E r

_ _ oh ek l n _ r e s

k a c_ ra r

e x

n n m e e e

_ ss n i n _x li

e am ip h o t r n c a _ r e d o c 0 .0 . 8 0 1 1 2 3 0 8 4 0 9 3 0

8 5 . 5 4 4 @ .40 .51 .3 .8 .0 .8 .

3 7 0 8 G 0 3 5 0 6 0 00 .7 .4 4 .3 .0 .9 .

0 0 .3 0 0 0 1

0 7 9 9 2 0 2 6 2 0 9 3 10 .7 .0 5 .6 .8 .4 .

2 2 .4 3 0 6 3 5 .0 . 7 0 9 6 5 9 0 7 5 1 5 2 9 7 0 2 3 8 e R 8 1 u R .3 .1 9 .9 .

8 4 .1 5 g M u t 1 n r n uR s_ 25 fin t e a _ p z_ i _ t i

p n a p _ a m ic T lx t n _ a e am em S T r R t

M _

C n B an 5 _ a s_ i

c o i

2 3 T ra h t k a b r

_ m t m m _P h on tre t

_p e m bo ts th e

p p lm S m p m th s_ s_ x _ a _

_ on tre s m a m m u m m A A ea a o A

R y U A .9 .2 5 .4 .0 .7 .

2 0 .5 0 0 4 0 8 .0 . 9 0 e p b

a o t K R r

m 3 T lx r

m n _ _ e t R l : 7 # 2 1 0 1

1 9 5 re 1 1 0 0 2 8 9

2 2

0 0 0 4 5 0 0 0 5 0 7 0 0 0 0 7 0 0 0 0 0 0 0 0 7 0 0 0 0 0 a e 0 4 0 9 9 0 4 9 0 0 9 0 0 9 0 4 9 9 0 9 9 T r# 9 4 9 2 2 9 6 2 9 9 2 9 9 2 9 4 8 2 9 2 8 2 2 2 2 2 2 2 2 2 2 2 2 - t 4 r

e Q b s Q + o _ s

e F b Q o _

[1]

Ermakova , A.-G. Bosser,

Miller ,

Campos , Overview of JOKER: Humour in the machine , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), Lecture Notes in Computer Science, Springer, 2025 .

[2]

Ermakova , A.-G. Bosser,

Miller ,

V. M.

Palma Preciado ,

Sidorov ,

Jatowt , Overview of the CLEF 2024 JOKER track: Automatic humour analysis , in: L. Goeuriot , P.

Mulhem , G.

Quénot , D.

Schwab , L.

Soulier , G. M. D. Nunzio , P. Galuščáková , A. G. S. de Herrera , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024 ), volume 14959 of Lecture Notes in Computer Science, Springer, Cham, 2024 , pp. 165 - 182 . doi: 10 .1007/978-3-031 -71908-0_8. 4 0 0 0 4 6 0 0 0 e 0 0 0 7 5 0 0 0 0 9 0 0 0 9 5 0 0 0 0 r 0 0 0 5 8 0 0 9 0 9 0 9 0 9 8 9 0 9 9 00 .3 .2 .6 . 5 3 3 1 k j