<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PICT at CLEF 2025 JOKER Track: Humour-Aware Information Retrieval using BERT-Enhanced Ensemble Methods</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tanish Chaudhari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ansh Vora</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanjeev Hotha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sheetal Sonawane</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pune Institute of Computer Technology (PICT)</institution>
          ,
          <addr-line>Pune</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The CLEF 2025 JOKER Task 1 (Humour-aware information retrieval in EN) focuses on the automatic and eficient retrieval of humorous texts that are relevant to a given query. The nuanced nature of the retrieval process is introduced with the detection of wordplay and literary devices used for humour, enabling the identification of jokes revolving around the query. For this task, we employed an ensemble pipeline architecture, including traditional text analytics methods, composite indices and reranking using BERT models with configurable weighted scores for each stage to create a sophisticated information retrieval system.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Pipeline</kwd>
        <kwd>Ensemble</kwd>
        <kwd>Tokenization</kwd>
        <kwd>TF-IDF</kwd>
        <kwd>BM25</kwd>
        <kwd>RM3</kwd>
        <kwd>ColBERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Information Retrieval (IR) systems have seen considerable progress; we no longer rely on straightforward
keyword matching [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In recent times, we have been able to include contextual awareness, understand
semantics, and decipher user intent [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, a domain of information retrieval that has not yet
been explored in-depth is humour—which forms a significant part of human communication in various
forms such as sarcasm, irony, wit, and wordplay. These instances are deeply embedded in diferent
forms of digital content, from social media posts and forum discussions to news articles and creative
writing. For traditional IR systems, humour can be a considerable obstacle to accurate understanding of
text and subsequently, efective retrieval. Humour can lead these traditional IR systems to misinterpret
text, fail query satisfaction and result in a general lack of nuance in how the information is presented
to the user.
      </p>
      <p>The dificulty of efectively retrieving information while considering humour is its characteristic
ambiguity and its derivation from shared cultural knowledge. Adding to this, humour is often delivered
through subtle linguistic cues that may defy literal interpretation—something a traditional IR system
is built around. Due to this, the system may miss the true underlying intent or meaning of a phrase,
which it may have interpreted literally. On the flip side however, a query looking for humorous phrases
might retrieve inappropriate or unfunny results, due to the system’s failure to identify and categorize
comedic elements. This shortcoming underscores an urgent requirement for humour-aware information
retrieval (HAIR) systems that have the ability to identify humour as well as understand its qualities and
relevance to a user’s query.</p>
      <p>In order to develop efective HAIR systems, we must ensure that models can distinguish complex
linguistic patterns, identify nuances in cultural settings, and separate genuine humor from other
nonliteral language. Initial approaches to detect humour were founded on rule-based systems and traditional
machine learning models had limited success but could not cope with scalability, generalization and the
dynamic nature of humour. The rise of deep learning, particularly advances in the field of
transformerbased architectures, have introduced new methods of natural language understanding, helping us
process complex linguistics like humour.</p>
      <p>We propose a unique approach to humour-aware information retrieval through a sophisticated
ensemble framework. Our pipeline utilizes the strengths of classical IR methods with modern deep
learning models; we include features that help the system detect and leverage wordplay to enhance
performance. In the beginning, we performed robust indexing and initial retrieval, post which we
reranked the shortlisted texts that leverage neural models and humour-specific features. It was observed
that considerable improvements in IR performance can be achieved by explicitly considering humour
and by modeling the system around the same. This paper discusses the construction of a pipelined IR
system that can take into account the literal intent of queries and documents alike, but additionally
work with the nuances of humour.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The JOKER corpus [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a large parallel pun dataset used for Humour information retrieval [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
translation in English and Portuguese language. Another dataset used for similar tasks is the HaHackathon
corpus (Meaney et al., 2021) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that contains 10K humour annotated tweets that is used to establish
testbeds for humour intensity labels. These datasets form the foundation for most current humour
retrieval systems.
      </p>
      <p>
        Humour recognition and wordplay detection in text generally involves IR re-ranking or filtering as
explored in earlier research at CLEF. Gepalova et al. (2024) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] used fine-tuned T-5 models to detect single
]-word puns while Dsilva (2023) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used prompt based LLMs for pun detection and BERT tokenization.
Xie et al. (2020) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] introduce GPT-2-based uncertainty for humor-sensitive reranking signals. These
studies showed that transformer architecture models perform well at identifying textual ambiguity,
though neural approaches have varying efectiveness. Schuurman et al. (2024) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] trained separate
pun detecting classifiers that demonstrated greater quality of humour retrieval. They observed that
neural rankers perform inferiorly, and humour detectors can suppress false positives.
      </p>
      <p>
        Beyond lexical matching, another approach to humour retrieval is semantic reranking. Schuurman
et al. (2024) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] used a zero-shot MS-MACRO encoder to rerank top 25 hits for humour queries, but
it also boosted non-humorous relevance. University of Split and Malta (2024) used a cross encoder
trained on humour classification to rerank candidate jokes. Annamoradnejad and Zoghi (2024) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
used embeddings using parallel BERT encoders and dense embedding models for humour retrieval
to index jokes by semantics. Gupta et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] demonstrated the use of transformer ensembles for
humour and ofensiveness in text for sensitive systems. Ao et al. (2022) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] used multi-encoder
architectures to improve parody detection for composite humour features. Tang et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced
the NaughtyFormer that is trained on reddit humor for more nuanced classification of humour subtypes.
Al Omari et al. (2021) [15] showed transformer ensembles like BERTweet and RoBERTa excel and
are very eficient at capturing humour-based IR, but these semantic approaches often struggle with
contextual understanding especially when compared to lexical methods.
      </p>
      <p>
        Ensemble methods are hybrid methods that are viable in the long run as they improve accuracy by
increasing recall values. Baguian and Huynh (2024) [16] combined TFIDF with Logistic Regression
and showed significant improvement while Arampatzis’s group (2024) experimented with over ten
diferent models and architectures including random forest, LSTMs and XGBoost. No other model
performed as well as the ensemble methods in accuracy and results. Another method used was query
expansion—Gepalova et al. (2024) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] expanded queries via WordNet Synonyms to match the pun-based
language and applied a similarity threshold for detection. UvA (2024) applied relevance feedback (RM3)
to BM25 and significantly boosted recall results. Zhao et al. (2023) [ 17] present the RDVI framework that
combined SimCSE-based retrieval and irony detection. Berger’s humour-based stacking ensemble [18]
achieved high F-1 scores using SVM and Random Forest. Thus, ensemble methods that merge lexical
matching, semantic features and humour-specific classifiers perform consistently well at humour
retrieval. Fusion architecture can also have a multi-stage pipeline as seen in Baguian and Huynh’s
hybrid design. Our approach builds on these findings by combining multiple retrieval strategies in a
unified framework.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. System Description</title>
      <p>The proposed BERT-enhanced ensemble IR system is a hybrid system designed to perform eficient
and humour-aware retrieval of documents from a sizeable corpus of 77,656 documents.</p>
      <p>The corpus comprises a mix of humorous and non-humorous sentences of various types. The majority
of sentences are definitional and declarative in nature, with some narratives (dialogues or quotes) and a
few outliers in terms of explicit sentence structure. Most non-humorous sentences seem encyclopedic
in the corpus:
• “Bill refers to a proposed law that is presented for discussion and approval in a legislative body,
such as Congress in the United States or Parliament in the United Kingdom.”
• “In the 7th century, the Middle East and North Africa came under caliphal rule with the Arab
conquests.”</p>
      <p>Humour is observed to not be the prevailing theme in the corpus with only a small fraction of
the dataset being humorous—Tom Swifties (jokes involving dialogue and adverbs), puns and double
meanings. Few examples of humorous texts involving diferent sentence structures:
• “The farmer was surprised when his pumpkin won a blue ribbon at the State Fair. He shouted,
‘Oh, my gourd.”’
• “Sellers of dried grapes are always raising awareness.”</p>
      <p>The aim was to achieve this by combining multiple retrieval methods and sophisticated reranking
strategies (Encoder Hybrid—ColBERT with late interaction) [19]. A multi-method ensemble approach is
utilized for initial retrieval—instead of relying on a single retrieval model, the pipeline employs multiple
models and fuses the strengths of BM25 and TF-IDF to present a variety of initial “relevant” documents.
We worked on improving the preprocessing stage of the system, especially focusing on patterns found
in humorous texts. Specialized BERT models are used for reranking, contributing to a better overall
understanding of semantics and improved performance on HAIR. The architecture of the pipeline is
shown in Figure 1.</p>
      <p>The pipeline has seven main phases—Data Loading and Preprocessing, Index Building, Initial Ensemble
Retrieval, RM3 Query Expansion, Re-retrieval with Expanded Query (Optional Merge), ColBERT-based
Reranking and finally result formatting and output using weighted ensemble scores which may be
ifne-tuned to give optimized results.</p>
      <p>Preprocessing is a crucial stage that has a profound influence on the performance of information
retrieval systems. Generally, the text preprocessing step involves normalization of documents for
feature extraction (such as lowercasing, stemming, tokenization, etc.), noise reduction (removal of stop
words).</p>
      <p>For our focus on humour-aware information retrieval, we preserve cues to recognize humour during
preprocessing. This requires us to handle contractions (e.g., “’ll”, “n’t”) and possessives (e.g., “’s”) by
expanding or retaining them. Some specific stop words may also serve as important flags for wordplay
and must be dealt with carefully. For the ensemble pipeline, data was also ensured in a format suitable
for diferent models such as the n-gram TF-IDF and BM25.</p>
      <p>Once all the documents in the corpus have been preprocessed in a format friendly to the diverse range
of models, the intermediate stages of the ensemble pipeline are introduced to build indices (ofline)
using diferent types of TF-IDF and BM25 (as depicted in the architecture figure). Based on these indices,
the system performs initial retrievals—working with queries for the first time in the process. Post-initial
retrieval, reranking by diferent approaches ensures various linguistic mechanisms revolving around
humor.</p>
      <p>The indices are built ofline on the documents before queries are processed. The model creates
the TF-IDF Unigram Index for dealing with the significance of individual words, a Bigram Index for
capturing common pairings of words to understand phrasing, and the TF-IDF Char N-gram Index (range
3-5) to detect wordplay patterns and phonetic similarities. To efectively recognize humorous intent,
especially in the N-gram Index, we use the unstemmed/untokenized document text.</p>
      <p>The pipeline also utilizes Okapi BM25 to create indices on the stemmed/tokenized corpus to aid
relevance matching in the subsequent stages and the final ensemble scoring. The BM25 score of a
document with respect to a query is calculated as:
score(, ) = ∑︁ IDF() ·

=1</p>
      <p>(, ) · (1 + 1)
 (, ) + 1 · 1
︁(
−  +  · a|vgd|l )︁</p>
      <sec id="sec-3-1">
        <title>Where:</title>
      </sec>
      <sec id="sec-3-2">
        <title>Where:</title>
        <p>(1)
(2)
•  is the total number of documents in the corpus.</p>
        <p>• () is the number of documents containing term .</p>
        <p>Working with the pre-built indices, at this stage the model performs initial retrieval for each query.
The query receives documents from each of the indices and each set of scores from all four methods is
normalized (min-max scaling). To merge the normalized scores, the system employs a “Weighted Score
Fusion” using predefined ensemble_weights, which in turn produces a list of “Top-k Candidates” for
further reranking.
expansion and reranking.</p>
        <p>This concludes the “preparation and initial processing” phase of the corpus. The architecture diagram
depicts the flow of the pipeline from this phase to the next processes handling wordplay features, RM3</p>
        <p>Using the initial  documents from the previous stage, the system performs the Relevance Model
3 (RM3) query expansion. The most frequent terms from the top candidates are extracted after
preprocessing and then added to the original query. This enhances the scope of the query in terms of
•  (, ) is the frequency of term  in document .
• || is the length of document  in terms (words).
• avgdl is the average document length in the collection.
• 1 and  are hyperparameters, typically 1 ∈ [1.2, 2.0] and  ∈ [0.5, 0.8].
• IDF() is the inverse document frequency of term , calculated as:</p>
        <p>IDF() = log
︂(  − () + 0.5
() + 0.5
+ 1
︂)
vocabulary to retrieve more relevant documents, which may not have been possible with the original,
shorter query.</p>
        <p>The optional re-run comes into play if the query was indeed expanded by RM3; if so, then the previous
stage of ensemble retrieval is carried out again on the expanded query and the results are stored in
“Merged Results”. The system assigns a higher weight to the original result than the expanded query’s
result to maintain refinement.</p>
        <p>Following the conditional RM3 query expansion stage in the pipeline, we introduce the BERT model
to work on the refined “Top-k Candidates” to be reranked.</p>
        <p>• The system makes use of ColBERT (Contextualized Late interaction over BERT), a breakthrough
in the sphere of Document Retriever Models, which dealt with the efectiveness-computational
cost trade-of particularly well. The query and document are encoded separately into embeddings
using a SentenceTransformer model that calculates cosine similarity scores. The two embeddings
act as vectors with attributes, say A and B, the cosine similarity, cos( ), is represented using a
dot product and magnitude as
cosine_similarity(, ) = √︁∑︀
=1 2 √︁∑︀</p>
        <p>=1 2
∑︀=1 
(3)</p>
      </sec>
      <sec id="sec-3-3">
        <title>Where:</title>
        <p>–  and  are the -th components of A and B, respectively.</p>
        <p>The two sets of encodings (query and document) together come up with a score indicative of
relevance for each query-document pair.
• Another rerank is performed by a Cross-Encoder which processes the query-document pairs and
outputs a fine-grained relevance score taking into consideration context.
• Wordplay features such as text length, word count, quotes, exclamation/question marks, dialogue
indicators and unique words and their repetitions and ratio are also considered. Even alliteration
contributes to all these features and makes up a “Wordplay Score”.</p>
        <p>Throughout the IR system, there have been multiple methods with final weighted scores to output
a comprehensive conclusion to the respective stage’s processing. In order to utilize each method’s
strengths, the system had weights associated with each method, distributing influence over the final
ensemble/composite score among the diverse methods.</p>
        <p>In the initial ensemble retrieval stage, various indices had contributed to an initial retrieval score
which was then passed on to the RM3 stage which expanded the query whenever applicable. The updated
retrieval score was yet again combined with the original in a weighted fashion to maintain refinement
and avoid complete dominance of one stage’s output/determination over the other. This helped the
system, pragmatically avoiding complete collapses associated with a single method’s drawbacks as well
as incorporating strengths and biases of each method in a controlled manner.</p>
        <p>Finally, even the reranking methods had scores that were weighted as they covered diferent aspects
of information retrieval with the focus of capturing humour as well as maintaining relevance between
the query and the documents. The final composite score merged not only semantics gathered from
dense embeddings, but also fine-grained relevance (which was further reinforced by Cross-Encoder
reranking) and explicit humour indicators.</p>
        <p>The weights can be fine-tuned further to optimize performance and gain better results.</p>
        <p>The composite scores and ranks observed with train qrels against performance metrics suggested by
the lab organizers served to be the primary method of evaluation and selection of methods and weights
for ensemble scores at each stage of the IR system.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Metrics Used</title>
      <sec id="sec-4-1">
        <title>1. Mean Average Precision (MAP)</title>
        <p>MAP is a key evaluation metric for ranked retrieval systems. It calculates average precision
across the relevant documents for a query and then averages this value over all queries. It helps
by rewarding systems that retrieve relevant documents. A higher MAP indicates better overall
precision across queries.
(4)
(5)
(6)
 (︃
MAP = 1 ∑︁</p>
        <p>1 ∑︁  () · rel()
 =1 || =1
)︃
where  is the number of queries,  is the set of relevant documents for query ,  () is the
precision at rank , and rel() is 1 if the document at rank  is relevant.</p>
      </sec>
      <sec id="sec-4-2">
        <title>2. Geometric Mean Average Precision (GMAP)</title>
        <p>GMAP computes the geometric mean of the average precision in all queries. It penalizes low
scores more heavily than MAP:</p>
        <p>⎛ 
GMAP = ⎝∏︁ AP()⎠
=1
⎞1/</p>
      </sec>
      <sec id="sec-4-3">
        <title>3. Binary Preference (BPref)</title>
        <p>Binary Preference measures how many times the relevant documents are ranked higher than the
non-relevant ones. It’s defined as:</p>
        <p>BPref =
1 ∑︁ (︂
 =1
1 −
| ranked higher than | )︂

where  is the number of relevant documents. A Bpref of 0.0 here suggests incomplete judgments
or relevance annotations incompatible with this metric.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4. Precision at  (P@k)</title>
        <p>Precision at rank  is the fraction of relevant documents among the top  retrieved:</p>
      </sec>
      <sec id="sec-4-5">
        <title>5. R-Precision (Rprec)</title>
        <p>R-Precision is the precision at the R-th rank, where  is the total number of relevant documents
for a query:</p>
        <p>Rprec =</p>
        <p>Number of relevant documents in top</p>
      </sec>
      <sec id="sec-4-6">
        <title>6. Mean Reciprocal Rank (MRR)</title>
        <p>MRR measures the inverse of the rank at which the first relevant document is found:
where rank is the rank position of the first relevant document for query .</p>
      </sec>
      <sec id="sec-4-7">
        <title>7. Normalized Discounted Cumulative Gain (NDCG@k)</title>
        <p>NDCG measures the usefulness of documents based on their position in the result list, with gains
discounted logarithmically:

MRR = 1 ∑︁</p>
        <p>1
 =1 rank

@ = ∑︁ 2 − 1
=1 log2( + 1)
(8)
(9)
(10)
(11)
where:
NDCG measures ranking quality by including graded relevance and position sensitivity. Relevance
scores are calculated logarithmically according to their increasing rank and @ is the
ideal DCG for the top  documents. We report NDCG@5, NDCG@10, NDCG@15, NDCG@20,
NDCG@30, NDCG@100, NDCG@200, NDCG@500, and NDCG@1000. These metrics indicate
the efectiveness over increasing amounts of retrieval.</p>
      </sec>
      <sec id="sec-4-8">
        <title>8. Retrieval Statistics</title>
        <p>We also track standard retrieval statistics for reporting purposes:
• num_ret: Total number of documents retrieved.
• num_rel: Number of known relevant documents.
• num_rel_ret: Number of relevant documents retrieved.
• num_q: Total number of queries used for evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Task 1: Humour-Aware Information Retrieval</title>
        <p>We evaluated our BERT-Enhanced Ensemble pipeline using train qrels, but blind tests were carried out
through test qrels which were available to run against on Codabench. During multiple runs, we were
able to understand the importance of limiting the diversity of methods for a specific purpose to achieve
maximum performance.</p>
        <p>A high weight or absolute scoring often led to detrimental changes in performance, leading to a
composite scoring system that was able to capture the strengths of each method while avoiding complete
bias toward an individual method. The run submissions and their analysis compared to the overall
JOKER laboratory are also documented [20].</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Result Analysis</title>
        <p>RM3 allowed us to fetch more relevant documents that might not have been linked to the original query
term by expanding the query’s representation. This enabled us to access a larger volume of candidates
for ranking. Subsequent re-retrieval helped us refine the candidate set.</p>
        <p>The ColBERT reranking improved performance of the pipeline slightly; more fine-tuning is needed
to deal with the sensitive nature of texts. The late ColBERT interaction pushed humorous texts up in
rankings, observed with the first query “change” where a humorous text “I wanted change, but all I got
was coins”. The sentence (docid: 38) ranked 35 after reranking, but ranked 134 without it.</p>
        <p>Another query “deal” had a corresponding humorous sentence that read “She was only a Coal dealer’s
daughter, but, oh, where she had bin.” which was found in the reranking pipeline (albeit with a low
score) but was absent without BERT.</p>
        <p>• MAP remained around 0.14. This was highly sensitive to the ranking of relevant documents and
heavily penalized highly relevant documents appearing lower in ranking. The observation was
further enhanced by significant deviations caused by changes in scores’ weights.
• GMAP was lower than MAP, suggesting that there was an observable variability in performance,
depending on the query. Contextual understanding, as well as cultural knowledge—understanding
connections between diferent words in a certain context to form a joke—could improve this
metric to accommodate a wider variety of queries. For example, “chemistry” and “reaction” in “I
tried to make a chemistry joke, but there was no reaction.”
• MRR of 0.3369 indicated that on average at least one relevant document was placed high in the
search results; the upper end of results performed fairly well.
• R-Precision of 0.1629 failed to meet standards, with the model searching (and selecting) a large
amount of irrelevant documents before it could meet the threshold for relevant documents.
• Precision showed an uptick from P@5 to P@10, however, it depreciated at higher K values. The
system was relatively successful in assessing relevance within the top results but declined as it
went down the rankings. Compared to scores from other participants, the model struggled to put
emphasis on humorous relevant documents—opting to weigh heavily on relevance for ranking
sentences. A higher bias/weight for humour-specific features would improve the pipeline in this
regard.
• NDCG values demonstrated a steady growth alongside K; the system was able to find relevant
documents and could contribute positively to cumulative gain. Most of the relevant documents
were found around K=500 as visible by a plateau around K=500,1000.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The results of our research are very promising and show the success of ensemble methods in the
ifeld of humour detection. Our BERT-enhanced ensemble system achieved a MAP of 0.14 and MRR
of 0.3369, and ranked among the top scores in the track. The scores indicate a strong performance in
placing relevant documents in the top results. The GMAP score is a bit low, suggesting a high degree
of query-dependent variability. The ensemble method with its multistage architecture leveraged the
strengths of TF-IDF, BM25 and ColBERT at various levels along with n-gram indexing for efective
detection of wordplay patterns.</p>
      <p>The greatest strength of our proposed method lies in its ability to capture diversity in humour.
This is indicated through the weighted score fusion that prevents over-reliance on any single method
in the architecture. At the same time, our model also highlights the challenges while dealing with
humour understanding—cultural specificity, contextual dependency, and subjective interpretations.
These challenges extend beyond the confines of traditional IR, and while our model demonstrates
feasible results, the scores also highlight the need for humor-specific language models.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used eraser.io for figure 1 to generate an image.
After using this tool/service, the author(s) reviewed and edited the content as needed and take full
responsibility for the publication’s content.
Semantic Evaluation (SemEval-2022), Association for Computational Linguistics, 2022, pp. 673–679.</p>
      <p>URL: https://aclanthology.org/2022.semeval-1.91/.
[15] H. Al-Omari, I. AbedulNabi, R. Duwairi, Dljust at semeval-2021 task 7: Linking humor and ofense
using transformer ensembles, in: Proceedings of the 15th International Workshop on Semantic
Evaluation (SemEval-2021), Association for Computational Linguistics, 2021, pp. 1062–1068.
[16] H. Baguian, N. A. Huynh, Joker track @ clef 2024: the jokesters’ approaches for retrieving,
classifying, and translating wordplay, in: Working Notes of CLEF 2024 – Conference and Labs of
the Evaluation Forum, CEUR-WS, 2024. Vol.3740, pp.1811–1817.
[17] T. Zhao, et al., Rdvi: A retrieval–detection framework for verbal irony detection, Electronics 12
(2023) 4830. doi:10.3390/electronics12234830.
[18] J. Bielaniewicz, P. Kazienko, An automatic humor identification model with novel features
from berger’s typology and ensemble models, Decision Analytics Journal 11 (2024) 100450.
doi:10.1016/j.dajour.2024.100450.
[19] R. Jha, B. Wang, M. Günther, G. Mastrapas, S. Sturua, I. Mohr, A. Koukounas, M. K. Akram, N. Wang,
H. Xiao, Jina-colbert-v2: A general-purpose multilingual late interaction retriever (2024). URL:
https://arxiv.org/abs/2408.16672. arXiv:2408.16672.
[20] L. Ermakova, Overview of the clef 2025 joker task 1: Humour-aware information retrieval, in:
G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of the Conference and Labs of the
Evaluation Forum (CLEF 2025), CEUR Workshop Proceedings, CEUR-WS.org, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Hambarde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Proença</surname>
          </string-name>
          , Information retrieval:
          <article-title>Recent advances and beyond</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>76581</fpage>
          -
          <lpage>76604</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2023</year>
          .
          <volume>3295776</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>MERROUNI</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>FRIKH</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. OUHBI</surname>
          </string-name>
          ,
          <article-title>Toward contextual information retrieval: A review and trends</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>148</volume>
          (
          <year>2019</year>
          )
          <fpage>191</fpage>
          -
          <lpage>200</lpage>
          . URL: https://www.sciencedirect.com/ science/article/pii/S1877050919300365. doi:https://doi.org/10.1016/j.procs.
          <year>2019</year>
          .
          <volume>01</volume>
          . 036,
          <string-name>
            <surname>tHE</surname>
            <given-names>SECOND</given-names>
          </string-name>
          <source>INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING IN DATA SCIENCES, ICDS2018.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kamil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Çakır</surname>
          </string-name>
          ,
          <article-title>Advances in transformer-based semantic search: Techniques, benchmarks, and future directions</article-title>
          ,
          <source>Turkish Journal of Mathematics and Computer Science</source>
          <volume>17</volume>
          (
          <year>2025</year>
          )
          <fpage>145</fpage>
          -
          <lpage>166</lpage>
          . doi:
          <volume>10</volume>
          .47000/tjmcs.1633092.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          , A.-G. Bosser,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <article-title>Clef 2025 joker lab: Humour in the machine</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
          </string-name>
          , N. Tonellotto (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>389</fpage>
          -
          <lpage>397</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Bosser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jatowt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>The joker corpus: English-french parallel data for multilingual wordplay recognition</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>2796</fpage>
          -
          <lpage>2806</lpage>
          . URL: https://doi.org/10.1145/ 3539618.3591885. doi:
          <volume>10</volume>
          .1145/3539618.3591885.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Meaney</surname>
          </string-name>
          , S. Wilson,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiruzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lopez</surname>
          </string-name>
          , W. Magdy,
          <article-title>Semeval 2021 task 7: Hahackathon - detecting and rating humor and ofense</article-title>
          ,
          <source>in: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)</source>
          , Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gepalova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chifu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fournier</surname>
          </string-name>
          ,
          <article-title>Clef 2024 joker task1: Exploring pun detection using the t5 transformer model</article-title>
          , in: Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          ,
          <year>2024</year>
          . Vol.
          <volume>3740</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Dsilva</surname>
          </string-name>
          ,
          <article-title>Augmenting large language models with humor theory to understand puns</article-title>
          ,
          <source>Master's thesis</source>
          , Purdue University,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <article-title>Uncertainty and surprisal jointly deliver the punchline: Exploiting incongruitybased features for humor recognition</article-title>
          , arXiv preprint arXiv:
          <year>2011</year>
          .
          <volume>01120</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Schuurman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cazemier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Buijs</surname>
          </string-name>
          , J. Kamps, University of amsterdam at the clef
          <year>2024</year>
          <article-title>joker track</article-title>
          , in: Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          ,
          <year>2024</year>
          . Vol.
          <volume>3740</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I.</given-names>
            <surname>Annamoradnejad</surname>
          </string-name>
          , G. Zoghi,
          <article-title>Colbert: Using bert sentence embedding in parallel neural networks for computational humor</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>249</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2024</year>
          .
          <volume>123685</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Sheikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <article-title>Identifying ofensive and humorous posts using ifne-tuned transformer ensembles</article-title>
          ,
          <source>in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum</source>
          , volume
          <volume>2936</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2936</volume>
          /paper-81.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Sánchez</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Preo t,iuc-</article-title>
          <string-name>
            <surname>Pietro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Aletras</surname>
          </string-name>
          ,
          <article-title>Combining humor and sarcasm for improving political parody detection</article-title>
          ,
          <source>arXiv preprint arXiv:2205.05505</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          , Naughtyformer at semeval
          <article-title>-2022 task 6: Transformer with fine-grained classification head for humor detection</article-title>
          ,
          <source>in: Proceedings of the 16th International Workshop on</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>