<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DUTH at CLEF JOKER 2025 Tasks 2 and 3: Translating Puns and Proper Names with Neural Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Georgios Arampatzis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Avi Arampatzis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Democritus University of Thrace, Department of Electrical and Computer Engineering</institution>
          ,
          <addr-line>Xanthi</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the participation of Team DUTH (Democritus University of Thrace) in the CLEF 2025 JOKER shared tasks on computational humor, specifically focusing on the translation of puns (Task 2) and onomastic wordplay (Task 3) from English to French. These tasks pose significant challenges for neural machine translation (NMT) systems due to semantic ambiguity, linguistic creativity, and cultural specificity. For Task 2, we employed a hybrid architecture that combines multiple NMT systems with a manually curated fallback lexicon. This approach yielded a BLEU score of 41.11 and a BERTScore F1 of 86.96, demonstrating improved robustness and humor preservation compared to neural-only baselines. In Task 3, we addressed the translation of fictional and culturally marked character names using multilingual generative models integrated with rule-based dictionaries and a tiered fallback strategy. Evaluation included both automatic metrics and expert human assessments, revealing key insights into the limits of reference-based evaluation in creative translation tasks. Our findings suggest that hybrid NMT systems enriched with linguistic insight provide tangible benefits for humor-aware translation. We conclude by outlining directions for future work, including dynamic fallback strategies, humor-sensitive evaluation, and culturally adaptive generation techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Wordplay Translation</kwd>
        <kwd>Humor in NLP</kwd>
        <kwd>Neural Machine Translation</kwd>
        <kwd>Fallback Mechanisms</kwd>
        <kwd>Onomastics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The translation of wordplay represents a longstanding challenge in both computational linguistics and
translation studies. Puns and onomastic humor exploit linguistic ambiguity, phonological similarity,
cultural references, and multi-layered meanings—features that often resist straightforward cross-linguistic
transfer [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Within machine translation (MT), these characteristics demand models with refined
semantic, pragmatic, and cultural sensitivity.
      </p>
      <p>
        The CLEF 2025 JOKER shared tasks serve as a benchmark for evaluating multilingual humor
processing systems. This edition comprises two subtasks: Task 2, focusing on the translation of English puns
into French, and Task 3, addressing the translation of culturally and phonetically marked character
names (onomastic wordplay) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Task 2 emphasizes the preservation of ambiguity and humorous efect,
while Task 3 involves creative name generation that resonates across cultural and linguistic boundaries.
      </p>
      <p>
        As outlined in the oficial lab overview [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the 2025 JOKER campaign encourages approaches that
move beyond standard translation pipelines and explore linguistically-informed, culturally-aware
solutions. It further highlights the inadequacy of conventional evaluation metrics—such as BLEU—for
capturing humor-related phenomena, recommending hybrid evaluation schemes that integrate human
judgment [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In this context, our team from Democritus University of Thrace (DUTH) participated in both tasks
of the 2025 edition, investigating hybrid translation workflows that combine multilingual neural
machine translation (NMT) models with manually curated fallback strategies. Our previous work on
multilingual afective tasks has shown promising results using ensemble methods [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], suggesting that
hybrid architectures can improve performance in nuanced language processing. We hypothesize that
such hybrid systems can more efectively manage the ambiguity, creativity, and cultural specificity
inherent to humor. Our hybrid translation framework aligns with prior approaches that incorporate
discrete lexical knowledge into NMT pipelines to better handle rare or culturally marked terms, as in
Arthur et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], who used lexical probabilities derived from external resources to guide the translation
process.
      </p>
      <p>This work contributes to the growing field of computational humor within NLP, where tasks such as
pun translation and humorous name adaptation pose unique challenges. Standard MT systems often
fail to preserve semantic content alongside stylistic and cultural nuances, motivating the need for
workflows that blend neural generation with human-informed mechanisms.</p>
      <p>In Section 2, we describe our hybrid translation framework, outlining the datasets, multilingual
models, and fallback mechanisms used for Tasks 2 and 3. Section 3 presents the experimental results
obtained through both automatic and manual evaluation metrics, highlighting model performance and
limitations. Finally, Section 4 ofers concluding remarks and discusses directions for future research in
humor-aware machine translation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <sec id="sec-2-1">
        <title>2.1. Task 2: Translation of Puns</title>
        <p>The translation of puns presents a unique challenge in cross-linguistic humor processing, as it requires
not only the preservation of semantic content but also the recreation of linguistic ambiguity, wordplay,
or cultural references in the target language. Unlike standard translation tasks, pun translation often
involves creative reformulation to maintain both the humorous efect and the intended meaning
(Delabastita, 1996). In many cases, literal translation is insuficient, prompting translators—or models—to
opt for adaptation strategies such as compensation, modulation, or substitution. Evaluating pun
translations thus requires attention to both lexical accuracy and pragmatic impact, particularly when
multilingual humor is involved.
2.1.1. Dataset
The dataset used for Task 2 (Wordplay Translation) is composed of a training and a test partition, each
exhibiting distinct characteristics. The training set contains 5,838 English–French sentence pairs derived
from 1,405 unique English puns. Each pun is accompanied by multiple human-produced translations,
providing lexical and stylistic diversity that supports robust model training and evaluation. In contrast,
the test set includes 4,537 unique English puns, without any reference translations, and is designed
exclusively for blind evaluation. A summary of the dataset composition is provided in Table 1.</p>
        <p>Table 2 presents a comparative overview of the datasets used exclusively for Task 2 . The training set
consists of 1,405 unique English puns, each accompanied by multiple French translations, resulting in a
total of 5,838 entries. These multiple reference translations provide valuable linguistic diversity and are
essential for training and evaluating models on complex tasks such as pun translation.</p>
        <p>On the other hand, the test set includes 4,537 unique English puns without any associated reference
translations. This reflects the nature of the evaluation phase, where participants’ translations are
assessed independently against withheld references, either through human judgments or automated
evaluation metrics. The test set thus serves as a blind evaluation benchmark.</p>
        <sec id="sec-2-1-1">
          <title>2.1.2. Models Used</title>
          <p>Our translation system incorporated a diverse suite of neural machine translation (NMT) models to
ensure stylistic coverage, robustness, and fallback capability. The following models were used:</p>
          <p>
            We included Google Translate, a commercial-grade NMT engine commonly used as a
generalpurpose baseline in multilingual NLP tasks [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. In addition, we integrated Argos Translate, an
open-source, ofline translation system that supports lightweight and customizable deployment [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
          </p>
          <p>
            From the Helsinki-NLP/OPUS-MT family, we used the standard opus-mt-en-fr model, trained
on multilingual OPUS corpora [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. This model provides robust translation for multiple domains and
maintains compatibility with low-resource language scenarios.
          </p>
          <p>
            We employed MBART50, a multilingual encoder-decoder model trained on 50 languages, supporting
both supervised and zero-shot translation [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. We also used M2M100 in two configurations, including
the 1.2B parameter variant, which enables direct source-to-target multilingual translation without
pivoting through English [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. To further explore system diversity, we created a combined M2M100
configuration that integrates outputs from multiple M2M-based systems.
          </p>
          <p>
            Additionally, we utilized T5-base, a sequence-to-sequence transformer trained under the text-to-text
paradigm, enabling translation to be framed as a generative task [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. In select experiments, we tested
its instruction-tuned variant, FLAN-T5, as part of a fallback mechanism (google_flant5_fallback ) to
evaluate its responsiveness to stylistic prompts in translation [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
          </p>
          <p>
            To explore multilingual generative modeling, we experimented with BLOOMZ-3B, a multitask
instruction-tuned large language model (LLM) designed for zero- and few-shot transfer across both
languages and tasks [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ].
          </p>
          <p>Finally, we incorporated a custom hybrid_fusion configuration that combines multiple translation
engines into a single output through a fusion strategy, aiming to maximize lexical coverage and stylistic
adequacy across diverse inputs.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.3. Methodology</title>
          <p>
            To address the challenges of pun translation in Task 2, we adopted a hybrid methodology in which
manually curated translations were prioritized, and neural machine translation (NMT) systems served
as a fallback mechanism. We integrated a diverse suite of translation engines, including Google
Translate, Argos Translate, the Helsinki-NLP/opus-mt-en-fr model [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], Facebook’s M2M100
(418M and 1.2B) [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], MBART50 [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], and T5-base [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. We also evaluated its instruction-tuned
variant, FLAN-T5, as part of a fallback mechanism (google_flant5_fallback ) [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. Additionally, we
explored two composite configurations: hybrid_fusion, which combines multiple systems to maximize
lexical coverage and stylistic adequacy, and combined_m2m100, which integrates outputs from
diferent M2M-based models.
          </p>
          <p>Each system was embedded within a unified two-stage translation pipeline:
1. A static fallback dictionary containing manually crafted French translations for a selected subset
of semantically and stylistically complex English puns.
2. An NMT engine used for all remaining inputs.</p>
          <p>All manual translations were created and reviewed by the authors, with emphasis on preserving both
semantic fidelity and humorous efect.</p>
          <p>At inference time, the system first checked whether the input pun matched an entry in the fallback
dictionary. If so, the corresponding manual translation was used directly. Otherwise, the input was
processed by the designated MT model. This approach ensured consistent handling of dificult cases
and improved overall robustness.</p>
          <p>
            Our hybrid translation framework aligns with prior approaches that incorporate discrete lexical
knowledge into NMT pipelines to better handle rare or culturally marked terms, as in Arthur et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ],
who used lexical probabilities derived from external resources to guide the translation process.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task 3: Onomastic Wordplay Translation</title>
        <p>2.2.1. Dataset
The dataset used in Task 3 centers around the translation of fictional, humorous, and culturally marked
character names from English to French. It is divided into a training and a test partition. The training
set includes 353 entries, each consisting of an English name, a short descriptive context, and a
humanproduced French translation. These annotated pairs serve as the basis for training or guiding generative
systems. The test set comprises 2,696 English entries without accompanying French references, reflecting
the task’s creative nature and supporting blind evaluation. Table 3 presents the distribution of entries
across the two sets.</p>
        <p>To assess the linguistic and creative complexity of the character names in the Task 3 training set,
we conducted a pattern-based classification. Each name was manually or semi-automatically assigned
to one of four categories: (i) Alliteration (e.g., names with repeated initial sounds), (ii) Wordplay (e.g.,
names based on puns, homophones, or semantic ambiguity), (iii) Realistic names with no humorous
intent or transformation, and (iv) Unclassified names that either defy clear categorization or require
deeper cultural or contextual interpretation. Table 4 presents the approximate distribution of these
categories within the training set.</p>
        <sec id="sec-2-2-1">
          <title>2.2.2. Models Used</title>
          <p>
            To support onomastic translation in Task 3, we employed a diverse set of multilingual and generative
models with diferent but complementary capabilities in lexical control, generative capacity, and phonetic
variation:
• Helsinki-NLP/opus-mt-en-fr and opus-mt-tc-big-en-fr: Transformer-based MarianMT
models trained on OPUS corpora, ofering robust translation across many language pairs [16].
• facebook/nllb-200-distilled-600M and facebook/nllb-200-1.3B: Large-scale multilingual
models trained on over 200 languages, developed by Meta to support low-resource translation
via self-supervised dense representation learning [17].
• facebook/m2m100_418M and facebook/m2m100_1.2B: Multilingual models by Meta AI
that support direct translation between 100+ language pairs without pivoting through English,
improving semantic alignment across typologically diverse languages [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
• T5-base and T5-small: Models from the Text-to-Text Transfer Transformer family [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], treating
all NLP tasks as text generation problems, and ofering flexible generative capabilities suited for
creative name adaptation.
• MarianMT_BLOOM: A custom configuration combining the MarianMT architecture with
          </p>
          <p>BLOOM embeddings to support high-variance translation behavior.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.3. Methodology</title>
          <p>We used the oficial datasets and evaluation procedures provided by the organizers of Tasks 2 and
3 [18, 19]. Task 3 addressed the challenge of translating fictional character names and stylized proper
nouns from English into French, a process demanding not only semantic accuracy but also sensitivity
to cultural context and humorous intent. To this end, we developed a hybrid translation framework that
integrates several multilingual pretrained neural machine translation (NMT) models with a manually
curated dictionary. This approach seeks to combine the broad generalization capacity of large-scale
language models with human-informed linguistic and cultural insights.</p>
          <p>
            The translation system utilized a diverse array of pretrained NMT models sourced from
the Hugging Face Transformers library. Specifically, we employed the following models:
facebook/m2m100_1.2B and facebook/m2m100_418M [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], facebook/nllb-200-1.3B
and facebook/nllb-200-distilled-600M [17], t5-small and t5-base [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ],
Helsinki-NLP/opus-mt-en-fr and Helsinki-NLP/opus-mt-tc-big-en-fr [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], as well
as MarianMT_BLOOM. All models were used with their default pretrained weights and standard
inference settings.
          </p>
          <p>For models based on the T5 architecture, we adopted task-specific prompting using the format
translate English to French: &lt;name&gt;. In the case of the M2M100 and NLLB-200 families, we
explicitly specified language codes during tokenization (e.g., eng_Latn → fra_Latn). Translations
were generated using greedy decoding via the model.generate(...) function, with a maximum
token length capped at 128.</p>
          <p>The translation pipeline implemented a structured three-stage fallback strategy. First, if the input
string matched an entry in a curated bilingual dictionary of proper names, the corresponding French
equivalent was directly retrieved. If no match was found, the string was sequentially processed by the
NMT models, with their outputs being evaluated for errors, malformations, or runtime failures (e.g.,
outof-memory exceptions). When model inference was unsuccessful or produced invalid results, the system
defaulted to an external translation service using the Google Translate API, accessed programmatically
via the Python library deep-translator.</p>
          <p>A central component of the pipeline was the manually constructed bilingual dictionary, which
included 25 character names selected for their cultural specificity and stylistic complexity. These
translations were produced using domain expertise in literary and humorous translation. Where
applicable, established canonical French translations were retained (for instance, from Astérix or Charlie
et la chocolaterie). In other cases, novel translations were devised to maintain cultural relevance, phonetic
plausibility, and humorous resonance. Illustrative examples include the transformation of Dogmatix
into Idéfix, Slugworth into Espionix, and Violet Beauregarde into Violetix.</p>
          <p>
            This design choice aligns with prior research on integrating symbolic lexical knowledge into NMT
systems. Arthur et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] demonstrated the utility of using lexical probabilities from external resources
to enhance translations involving rare or culturally marked terms. Similarly, our curated dictionary
supports translation robustness in cases where standard generative models tend to underperform.
          </p>
          <p>Finally, each translated name was tagged with a binary metadata flag ( manual: 1 or manual: 0)
to indicate whether it originated from the manual dictionary or from model-based inference. This
annotation schema facilitated downstream analysis, particularly in evaluating the performance and
reliability of automatic versus human-guided translation pathways.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Implementation and Environment</title>
        <p>All systems were implemented in Python 3.10, using the PyTorch deep learning framework (version
2.7.0) in combination with the Transformers library (version 4.52.4) for model loading, tokenization,
and inference. For multilingual translation tasks, we employed pretrained sequence-to-sequence
models such as facebook/m2m100 (418M and 1.2B), facebook/nllb-200 (1.3B and distilled-600M),
MBART50, Helsinki-NLP/opus-mt-en-fr and opus-mt-tc-big-en-fr, t5-small, t5-base,
and MarianMT_BLOOM, all accessed through the Hugging Face Transformers API.</p>
        <p>Inference was executed on an NVIDIA RTX A6000 GPU with 48 GB of memory, under CUDA 12.5,
ensuring fast and stable model performance. All translation outputs were generated in batch mode
using eficient GPU memory allocation strategies. A fallback dictionary was incorporated into each
system to override specific translation cases with manually curated references.</p>
        <p>Complementary translation backends included Argos Translate (via ctranslate2) and the Google
Translate API, both integrated for comparative evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <sec id="sec-3-1">
        <title>3.1. Evaluation Metrics</title>
        <p>For Task 2, system outputs are initially ranked using the BLEU metric [20], which computes n-gram
overlap between the machine-generated translation and one or more human references. While BLEU
remains a standard benchmark for translation quality, it has well-known limitations when applied to
pun translation, as it does not account for semantic adequacy, humor preservation, or stylistic variation.</p>
        <p>To address this, the final evaluation will be performed manually by expert annotators. Human judges
will assess each translation based on criteria including semantic field preservation, sense equivalence,
syntactic and lexical fluency, and the degree to which the original wordplay is retained or efectively
adapted in French. This hybrid evaluation procedure ensures that systems are not only rewarded for
lexical similarity, but also for their ability to capture the linguistic creativity inherent in puns.</p>
        <p>For Task 3, the oficial evaluation procedure is based on exact match accuracy between the submitted
French names and a set of manually curated reference translations. This automatic scoring rewards
systems that generate character names identical to one of the provided references, thus emphasizing
lexical precision.</p>
        <p>Nevertheless, given the creative and culturally embedded nature of onomastic wordplay, a final
manual evaluation will be conducted by expert annotators. The human evaluation will consider several
qualitative aspects, including preservation of the semantic field, cultural appropriateness, linguistic
creativity, and the extent to which the translated name preserves or efectively transforms the original
wordplay. This two-stage evaluation protocol ensures a balanced assessment that goes beyond
surfacelevel lexical overlap.In addition to BLEU, we employed complementary automatic metrics to obtain
a broader evaluation perspective. BERTScore [21] was used to assess contextual semantic similarity
between system outputs and references, providing finer-grained signals than surface n -gram overlap.
For Task 2, we also evaluated pun location detection accuracy, measuring each model’s ability to
correctly identify the position of wordplay within the sentence.</p>
        <p>These additional metrics allowed us to triangulate translation performance from both lexical and
semantic standpoints. Combined with the expert-driven manual evaluation, this multi-layered
framework ensured a more comprehensive and nuanced assessment of systems’ ability to preserve meaning,
humor, and creativity in multilingual pun translation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task 2: Translation of Puns</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Experimental Results</title>
          <p>Table 5 presents the BLEU scores and -gram precision metrics for the systems submitted by team DUTH
in Task 2, sorted by BLEU score. The top-performing model was hybrid_fusion (BLEU = 41.11),
followed closely by helsinki (BLEU = 41.01). Both exhibited high unigram and bigram precision,
indicating strong lexical overlap with the reference translations.</p>
          <p>Close behind were several Google Translate variants—including GoogleTranslate_fallback and
google_flanT5_fallback—which consistently achieved BLEU scores above 40.7 and maintained
solid performance across -gram orders up to  = 4. This suggests their efectiveness in capturing
both surface-level and moderately complex lexical patterns.</p>
          <p>argos and m2m100_1_2B also performed competitively, with BLEU scores of 40.49 and 36.46,
respectively. Transformer-based models such as mbart50 and t5_base showed more modest performance
(BLEU 32–33), while combined_m2m100 scored 30.00.</p>
          <p>At the lower end, bloomz3b scored significantly lower (BLEU = 16.68), reflecting limited transfer
capabilities in low-resource or stylistically marked inputs.</p>
          <p>Overall, these results confirm that pretrained multilingual NMT models—especially those enhanced
via fallback mechanisms—remain strong baselines for pun translation. Incorporating hybrid or ensemble
methods, as seen in hybrid_fusion, appears to ofer further gains in robustness and lexical coverage.</p>
          <p>The results are sorted in descending order of location accuracy. The highest-performing
model, helsinki, achieved 6.72% accuracy (113 correct identifications). It was followed by
GoogleTranslate_fallback, hybrid_fusion, and google_flant5_fallback, each reaching
6.66% (112 correct), while GoogleTranslate obtained 6.60% and argos scored 6.48%.</p>
          <p>Mid-performing systems included t5_base (5.65%) and m2m100_1_2B (5.41%), whereas mbart50
and combined_m2m100 achieved 4.82% and 4.22%, respectively. The lowest performance was by
bloomz3b, at 2.68% (45 correct). These results underscore a gap between top-ranked systems and
multilingual or instruction-tuned models that lack task-specific adaptation.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Task 3: Onomastic Wordplay Translation</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Experimental Results</title>
          <p>Table 8 presents the evaluation results of twelve machine translation systems for Task 3, assessed across
three metrics: Automatic, Manual, and Identical.</p>
          <p>The Automatic score indicates the percentage of lexical matches with the reference translations,
based on string-level comparisons. While informative, it does not guarantee semantic adequacy. The
Manual score reflects the proportion of outputs deemed acceptable by human evaluators, taking into
account meaning preservation, discourse fluency, and cultural appropriateness. The Identical metric
captures the percentage of outputs that are exact copies of the source input—typically signaling failure
to translate rather than successful output generation.</p>
          <p>The results reveal substantial variation in system performance. The system Helsinki achieved the
highest automatic score (14.83%) and shared the highest manual score (18.88%), yet also exhibited a
77.67% identical rate. This suggests that a large portion of its outputs were simple copies of the input,
which happened to match the references, artificially inflating its automatic score without necessarily
indicating high translation quality.</p>
          <p>Several systems, such as the facebook-nllb-200 variants and MarianMT_BLOOM, displayed more
balanced behavior across all three metrics (10–11% Automatic, 13–16% Manual, around 45% Identical).
These results point to partial translation adequacy with a conservative output strategy that favors safe
but modest transformations.</p>
          <p>In contrast, low-performing models like facebook-m2m100_1.2B and facebook-m2m100_418M
showed weak performance across all metrics. Their low Identical scores (below 21%) suggest that failure
stemmed not from excessive copying, but from an inability to generate semantically valid translations.</p>
          <p>At the other extreme, one system from the Helsinki family exhibited a 100% identical rate alongside
a manual score of just 2.55%. Although it achieved a non-zero automatic score (11.83%), this clearly
reflects a complete failure in semantic transformation, highlighting the limitations of automatic metrics
when lexical overlap lacks meaningful equivalence.</p>
          <p>Overall, the findings underscore that no single metric sufices for evaluating performance on
linguistically creative tasks such as pun translation. A comprehensive approach that combines automatic
scoring, human judgment, and behavioral indicators like source copying is essential for reliably assessing
translation quality in semantically complex and stylistically sensitive contexts.
The results reveal several important observations:</p>
          <p>One system exhibits a 100% identical rate with almost negligible performance in the manual evaluation
(2.55%). This outcome strongly suggests a complete failure in the translation process, despite registering
a non-zero automatic score (11.83%). This highlights the misleading nature of automatic scores
when systems achieve lexical overlap without true semantic transformation.</p>
          <p>Conversely, another system reaches a 77.45% identical rate while achieving the highest automatic
score (14.66%). This suggests that much of the copied content happened to coincide with the reference
output, potentially inflating the automatic evaluation and masking translation inadequacy when
judged by humans.</p>
          <p>Several systems demonstrate moderate values across all three metrics—with automatic scores between
10–11%, manual scores between 13–14%, and identical outputs around 45%. This distribution implies
partial translation adequacy accompanied by a conservative output strategy, where systems attempt
meaningful translation but lean toward lexical safety.</p>
          <p>Other systems show consistently low performance across all metrics (e.g., scores below 10% in both
automatic and manual evaluations) combined with low identical rates. This pattern suggests not copying,
but rather a limited capacity to generate semantically valid translations.</p>
          <p>In summary, the results reveal substantial variation in system behavior, ranging from source copying
to partial adequacy to complete failure. It becomes evident that no single metric is suficient
for evaluating performance in translation tasks that involve linguistic creativity. A comprehensive
evaluation approach—one that integrates automatic scores, human judgments, and behavioral indicators
such as copying—ofers a more reliable and multidimensional assessment of translation quality in
contexts where semantic ambiguity, creativity, and nuanced expression play a central role.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future Work</title>
      <p>Building upon insights from previous CLEF JOKER evaluations [22, 23], this study advances the
exploration of hybrid methods for humor-aware machine translation. Through our participation
in Tasks 2 and 3 of the 2025 edition, we demonstrated the potential of fallback-enhanced systems to
manage ambiguity, cultural specificity, and linguistic creativity in multilingual contexts.</p>
      <p>This paper explored hybrid approaches to humor-aware translation through participation in the
CLEF 2025 shared tasks JOKER. Task 2 focused on the translation of puns from English into French,
while Task 3 addressed the creative rendering of culturally marked fictional character names. In both
tasks, our hybrid system—combining neural machine translation (NMT) models with manually curated
fallback mechanisms—demonstrated enhanced robustness and stylistic fidelity compared to purely
neural systems.</p>
      <p>In Task 2, hybrid systems consistently outperformed neural baselines in BLEU and BERTScore
metrics and received favorable human judgments for preserving humor and meaning. Nevertheless,
the divergence between automatic and manual evaluations confirmed the inadequacy of surface-level
metrics in humor translation. In Task 3, our manually constructed bilingual dictionary proved essential
for handling phonologically and culturally complex names, particularly when established or contextually
appropriate translations were required.</p>
      <p>Future work will proceed along several directions. First, we plan to expand our dictionaries through
corpus mining and crowd-sourcing techniques, especially for culturally rich entries. Second, we aim to
develop classifiers capable of detecting linguistic patterns in character names (e.g., alliteration, puns,
cultural references) to dynamically guide translation strategies. Third, we will experiment with prompt
engineering and stylistic control techniques in instruction-tuned models (e.g., FLAN-T5, mT5, BLOOMZ)
to enable zero- and few-shot name translation capabilities.</p>
      <p>This study contributes to the emerging field of computational humor and highlights the importance
of integrating symbolic knowledge, linguistic expertise, and creative reasoning in multilingual NLP
systems.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgements</title>
      <p>We thank the organizers of the CLEF 2025 JOKER shared task for providing high-quality data, tools, and
a clear, well-designed evaluation framework for computational humor detection. Their contribution to
promoting reproducible experimentation, robust comparative evaluation, and scientific progress in the
rapidly evolving field of computational humor is greatly appreciated.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[16] J. Tiedemann, S. Thottingal, Opus-mt — building open translation services using neural machine
translation, https://github.com/Helsinki-NLP/OPUS-MT, 2020. Accessed: 2025-07-03.
[17] M. R. Costa-jussà, et al., No language left behind: Scaling human-centered machine translation,
arXiv preprint arXiv:2207.04672 (2022).
[18] L. Ermakova, et al., Overview of the CLEF 2025 JOKER task 2: Wordplay translation from english
into french, in: Working Notes of CLEF 2025 Labs and Workshops, CEUR Workshop Proceedings,
CEUR-WS.org, 2025.
[19] L. Ermakova, et al., Overview of the CLEF 2025 JOKER task 3: Onomastic wordplay translation, in:
Working Notes of CLEF 2025 Labs and Workshops, CEUR Workshop Proceedings, CEUR-WS.org,
2025.
[20] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics, Association for Computational Linguistics, 2002, pp. 311–318. doi:10.3115/1073083.
1073135.
[21] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation
with bert, in: International Conference on Learning Representations (ICLR), 2020. URL: https:
//openreview.net/forum?id=SkeHuCVFDr.
[22] L. Ermakova, et al., Overview of the JOKER track at CLEF 2023: Automatic wordplay analysis, in:</p>
      <p>CLEF 2023 Working Notes, CEUR-WS.org, 2023.
[23] L. Ermakova, T. Poibeau, et al., JOKER@CLEF 2024: Challenges in cross-lingual humor translation,
in: CLEF 2024 Working Notes, CEUR-WS.org, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Delabastita</surname>
          </string-name>
          , Introduction, in: Wordplay and
          <article-title>Translation: Essays on Punning and Translation, St</article-title>
          .
          <source>Jerome Publishing</source>
          ,
          <year>1996</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chiaro</surname>
          </string-name>
          ,
          <article-title>Translation and humour, humour and translation</article-title>
          , in: D.
          <string-name>
            <surname>Chiaro</surname>
          </string-name>
          (Ed.), Translation, Humour and Literature, Continuum,
          <year>2010</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Overview of the clef 2025 joker lab: Humour in machine, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer,
          <year>2025</year>
          . URL: https://www.joker-project.com/, to appear.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025:
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Arampatzis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arampatzis</surname>
          </string-name>
          , Duth at semeval
          <article-title>-2023 task 2: Multilingual complex named entity recognition with cross-lingual ensemble learning</article-title>
          ,
          <source>in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1238</fpage>
          -
          <lpage>1243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Arthur</surname>
          </string-name>
          , G. Neubig,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <article-title>Incorporating discrete translation lexicons into neural machine translation</article-title>
          ,
          <source>in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1557</fpage>
          -
          <lpage>1567</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Google</surname>
          </string-name>
          , Google translate, https://translate.google.com,
          <year>2025</year>
          .
          <source>Accessed July</source>
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Technologies</surname>
          </string-name>
          , Argos translate, https://www.argosopentech.com/,
          <source>2025. Version 1</source>
          .8,
          <string-name>
            <surname>Accessed</surname>
            <given-names>July</given-names>
          </string-name>
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thottingal</surname>
          </string-name>
          ,
          <article-title>Opus-mt - building open translation services for the world</article-title>
          ,
          <source>in: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <article-title>Multilingual denoising pre-training for neural machine translation, in: Transactions of the Association for Computational Linguistics (TACL</article-title>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          , A.
          <string-name>
            <surname>El-Kishky</surname>
          </string-name>
          , et al.,
          <article-title>Beyond english-centric multilingual machine translation</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          , et al.,
          <article-title>Scaling instruction-finetuned language models</article-title>
          ,
          <source>arXiv preprint arXiv:2210.11416</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          , et al.,
          <article-title>Crosslingual generalization through multitask finetuning</article-title>
          ,
          <source>arXiv preprint arXiv:2211.01786</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thottingal</surname>
          </string-name>
          ,
          <article-title>Opus-mt: Building open translation services for the world</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>11867</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>