<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team Pratham at TextDetox CLEF 2025/Multilingual Text Detoxification 2025: Exploring diferent Methods for Multilingual Text Detoxification via Prompted MT0-XL,Lexical Filtering and Backtranslation.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pratham Shah</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vatsal Shah</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sunil Kale</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CS Department</institution>
          ,
          <addr-line>VJTI</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents our solution to the Multilingual Text Detoxification task at PAN 2025, developed by Team Pratham. We implemented a hybrid system that combines prompted inference using the MT0-XL-detox-orpo1 model with targeted lexical filtering to improve detoxification quality across 15 languages. To address challenges in code-mixed and morphologically rich languages, we introduced a backtranslation-based pipeline for six languages and developed handcrafted toxic word dictionaries (for Hinglish) and used multilingual-toxic-lexicon2 for fine-grained filtering. Our system relies on zero-shot prompting without additional fine-tuning, leveraging multilingual transformer generalization along with rule-based post-processing. Evaluated on both automatic and human metrics, our approach achieves strong performance, including a J-score of 0.676 on the final test set, demonstrating near state-of-the-art performance across several languages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2025</kwd>
        <kwd>Detoxification</kwd>
        <kwd>MT0-XL</kwd>
        <kwd>Lexical Filtering</kwd>
        <kwd>XLM-R</kwd>
        <kwd>LaBSE</kwd>
        <kwd>Multilingual NLP</kwd>
        <kwd>Backtranslation</kwd>
        <kwd>NLLB600M</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Toxic language on the internet has become a growing concern, afecting how people communicate
and engage online. From hate speech to subtle ofensive remarks, such content can make digital
spaces feel hostile and unsafe. As online platforms become more global, tackling toxicity in multiple
languages becomes even more important—but also more complex due to diferences in grammar,
slang, and cultural nuance. One solution that goes beyond simply blocking or deleting content is text
detoxification—rewriting harmful language into a more neutral, respectful form while preserving the
original meaning. This not only helps maintain healthy conversations but also reduces the need for
harsh censorship. In this paper, we present our solution submitted to the PAN 2025 Multilingual Text
Detoxification task. Our system is built upon a two-stage hybrid pipeline that integrates both neural and
rule-based approaches. First, we leverage the multilingual instruction-tuned language model MT0-XL
to generate detoxified outputs using task-specific prompting strategies. MT0-XL operates in zero-shot
inference mode, eliminating the need for additional fine-tuning. Second, we apply language-specific
lexical filtering techniques to identify and sanitize residual toxic expressions that may bypass model
ifltering, especially in morphologically complex or code-mixed inputs.</p>
      <p>To further enhance performance for challenging languages such as Hinglish, Hebrew, Japanese, Italian,</p>
      <p>French, and Tatar, we introduce a backtranslation-based pipeline involving intermediate detoxification
in a high-resource pivot language(English) followed by retranslation into the target language using
NLLB-600M1 model. This approach is supported by custom toxic word dictionaries(for Hinglish) and
multilingual lexicons for finer-grained control.</p>
      <p>Our system supports detoxification in 15 languages, covering both high- and low-resource settings.
The evaluation results based on standard automatic metrics - XCOMET2/Chrf for fluency, LaBSE 3
for semantic similarity, and a fine-tuned XLM-R 4 classifier for toxicity - demonstrate that our hybrid
method achieves strong performance, including a final J score of 0.676 on the PAN 2025 test set, securing
4th position in the final Leaderboard. These results highlight the efectiveness of combining large
multilingual transformers with lightweight, language-aware filtering mechanisms.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Previous Work</title>
      <p>Text detoxification has been approached through a variety of modeling strategies, broadly categorized
into token-level filtering , sequence-to-sequence rewriting, and controlled generation. Early eforts relied on
encoder-only models such as BERT, which were used to identify and filter toxic content at the token or
sentence level. These approaches typically lacked generative capabilities and were limited to binary
classification or masking.</p>
      <p>With the advent of powerful encoder-decoder models, sequence-to-sequence (seq2seq) approaches
became more popular. Models like T5, mT5, and mBART5 enabled the rewriting of toxic sentences
into non-toxic paraphrases by training on parallel corpora or through fine-tuned task-specific pipelines.
These methods allowed for greater flexibility in preserving meaning while transforming sentence tone
or style.</p>
      <p>Several shared tasks have accelerated progress in this domain. Notably, the RUSSE-2022 and PAN
2024 detoxification tracks highlighted the efectiveness of instruction-tuned models such as MT0,
which showed strong performance across multiple languages without the need for extensive fine-tuning.
These tasks also emphasized the importance of semantic preservation, toxicity reduction, and fluency in
generated outputs.</p>
      <p>Our work builds on these foundations but introduces two key distinctions. First, we apply MT0
in a zero-shot inference mode, leveraging its instruction-following capability without additional
ifne-tuning. This reduces dependence on task-specific datasets and allows for broader generalization
across languages. Second, we enhance this with a handcrafted, language-specific lexical filtering
layer, which targets residual toxic expressions not efectively handled by generic models—particularly
in low-resource and code-mixed scenarios.</p>
      <p>While prior works have explored backtranslation or multilingual adaptation, few have combined
large instruction-tuned models with custom, linguistically informed detoxification strategies. By doing
so, our approach addresses persistent challenges in multilingual detoxification , particularly for languages
with limited training data, complex morphology, or informal syntactic patterns.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Prompted Detoxification with MT0-XL</title>
        <p>We used the s-nlp/mt0-xl-detox-orpo model from Hugging Face Transformers as the backbone of our
multilingual detoxification pipeline. This model is an instruction-tuned variant of MT0-XL, optimized
using Direct Preference Optimization (ORPO) specifically for detoxification tasks. We chose this model
1facebook/nllb-200-distilled-600M on HuggingFace: https://huggingface.co/facebook/nllb-200-distilled-600M
2myyycroft/XCOMET-lite on HuggingFace: https://huggingface.co/myyycroft/XCOMET-lite
3sentence-transformers/LaBSE on HuggingFace: https://huggingface.co/sentence-transformers/LaBSE
4textdetox/xlmr-large-toxicity-classifier-v2 on HuggingFace:
https://huggingface.co/textdetox/xlmr-largetoxicity-classifier-v2
5textdetox/mbart-detox-baseline on HuggingFace: https://huggingface.co/textdetox/mbart-detox-baseline
due to its robust multilingual support, strong zero-shot performance, and proven ability to generate
lfuent and semantically consistent detoxified text using natural language prompts.</p>
        <p>The model was deployed in inference-only mode, with no additional fine-tuning. Instead, we leveraged
its instruction-following capabilities by designing a set of language-specific prompts such as “Detoxify:”,
“Entgiften:”, and “Desintoxicar:”, covering all 15 languages in the PAN 2025 task. These prompts were
organized in a manually curated dictionary and dynamically injected into each input at runtime.</p>
        <p>We implemented a modular inference pipeline in PyTorch using Hugging Face Transformers. Our
custom DetoxificationDataset class handled multilingual tokenization, batch collation, and prompt
templating. This enabled eficient large-scale inference across all languages in a unified framework.
Our contributions also included:
• Crafting a multilingual prompt strategy to optimize zero-shot detoxification
• Designing and integrating a batch-wise inference system for scalable deployment
• Developing post-processing modules including lexical filtering and toxicity masking
• Conducting detailed evaluation using XLM-R for toxicity, LaBSE for semantic similarity, and</p>
        <p>XCOMET and Chrf for fluency.</p>
        <p>Although MT0-XL was able to generate outputs for all 15 target languages, it did not perform equally
well across them. Specifically, its efectiveness dropped for few languages—Hinglish, Hebrew, Japanese,
Chinese, and Tatar. In these cases, the model often struggled to completely eliminate toxic expressions
or to maintain fluency and semantic coherence. This can be attributed to factors such as linguistic
complexity, code-mixing, morphological richness, and the model’s limited exposure to these languages
during pretraining.</p>
        <p>To overcome these limitations, we extended our system with two complementary strategies:
languagespecific lexical filtering and a backtranslation-based detoxification pipeline. Initially, these approaches
were developed and evaluated independently. However, we ultimately combined them to form a hybrid
model that addressed both structural and linguistic challenges more efectively.</p>
        <p>This hybrid approach significantly improved output quality—particularly for the more complex or
underrepresented languages—by first translating toxic inputs into a high-resource pivot language for
intermediate detoxification, and then retranslating them back into the original language. To further
refine the results, we applied custom toxic word dictionaries (e.g., for Hinglish) and multilingual
lexicon-based filters to detect and eliminate residual toxicity in the final output.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. MT0-XL + Language-Specific Lexical Filtering</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Language-Specific Lexical Filtering</title>
          <p>For languages like Hinglish, we curated custom toxic word dictionaries tailored to the unique challenges
of code-mixed language. The decision to build a handcrafted list arose from the limited coverage of
existing lexicons, which often missed informal, transliterated, or culturally specific slurs commonly
used in toxic Hinglish expressions. Since we are familiar with the language, we manually analyzed
toxic sentences from the dataset and extracted additional toxic terms that were frequently missed by
the model. These words were then included in our dictionary to ensure the model would delete or mask
them efectively during post-processing.</p>
          <p>For other languages such as Hebrew, Japanese, and Chinese, we leveraged publicly available toxic
lexicons from Hugging Face and similar resources. During post-processing, we applied a combination
of strategies—including masking, deletion, and soft lexical replacements—to eliminate residual toxic
expressions that were not fully addressed by MT0-XL, as illustrated in Figure 1 below. This
postprocessing step proved critical for enhancing detoxification quality, especially in morphologically rich
or low-resource languages where generative models alone were insuficient.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Evaluation Results</title>
          <p>Our system demonstrated robust performance across the 15 evaluated languages, with an overall J-score
average of 0.634, and an XCOMET fluency score average of 0.846.(Table 1)</p>
          <p>High-resource languages such as Ukrainian (J = 0.776), Russian (0.755), German (0.750), French (0.752),
and Italian (0.749) achieved the highest J-scores, supported by strong performance across all sub-metrics,
including fluency (XCOMET greater than 0.89), semantic similarity (LaBSE greater than 0.91), and high
toxicity neutrality (STA greater than 0.92).</p>
          <p>In contrast, Hinglish (0.339), Hebrew (0.413), Chinese (0.530), and Amharic (0.486) still scored
significantly lower in J-score, reflecting challenges in handling code-mixed, low-resource, or morphologically
complex languages. These results reafirmed the necessity of our backtranslation pipeline, which helped
mitigate but not entirely resolve residual toxicity and fluency degradation in such languages.</p>
          <p>The average J-score for non-pivot languages was 0.656, while non-pivot languages scored 0.571 when
using only MT0-XL, demonstrating the efectiveness of our adaptive strategy in challenging linguistic
settings.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. MT0-XL + Backtranslation based Detoxification</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Backtranslation-Based Detoxification</title>
          <p>While MT0-XL exhibited strong performance in multilingual detoxification tasks, its direct application
to certain languages revealed notable shortcomings. Specifically, for Italian, French, Japanese, Tatar,
Hinglish, and Hebrew, we found that direct inference often failed to suficiently reduce toxicity. These
failures were especially pronounced in cases involving code-mixed expressions (as seen in Hinglish),
informal or non-standard usage, and morphological or syntactic complexity that MT0-XL may not have
been adequately trained to handle.</p>
          <p>To address these challenges, we designed a backtranslation-based detoxification pipeline that
incorporates a high-resource pivot language—either English or Arabic, depending on the language
family and characteristics of the input. English was used for Indo-European and code-mixed languages
(e.g., French, Italian, Hinglish, Tatar, Japanese), while Arabic served as a more linguistically aligned
pivot for Semitic (e.g., Hebrew). This choice aimed to maximize semantic fidelity during translation
and to improve detoxification efectiveness by operating in a language space where models were better
optimized.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Pipeline Design and Rationale:</title>
          <p>1. Translation to Pivot Language (English or Arabic):</p>
          <p>The first step involved translating the original toxic sentence into a high-resource pivot language.
The rationale here was to project the sentence into a more semantically normalized and
resourcerich space where detoxification models could operate more reliably. This process helped reduce
linguistic noise, addressed irregularities such as code-mixing or morphological inflection, and
enabled a more consistent application of detoxification techniques.</p>
          <p>The final step of backtranslation was performed using NLLB-600M. Although larger models like
NLLB-3.3B were available, we opted for NLLB-600M due to hardware limitations. Despite its
smaller size, NLLB-600M ofered reliable performance across a wide range of languages, making
it a practical choice for multilingual backtranslation in resource-constrained settings.
2. Detoxification in Pivot Language:</p>
          <p>The translated sentence was then passed through a dedicated detoxification model. We evaluated:
• s-nlp/mt0-xl-detox-orpo
• s-nlp/bart-base-detox
The choice of detoxification model was informed by performance metrics such as semantic
preservation, fluency, and toxicity reduction. MT0-XL-detox-orpo consistently produced better
outputs, particularly across pivot languages. Its instruction-tuned nature allowed it to efectively
follow detoxification commands without needing task-specific fine-tuning.
3. Backtranslation to Original Language:</p>
          <p>Once detoxified, the sentence in the pivot language was backtranslated into the original language
using NLLB-600M. This model was chosen for its strong multilingual capabilities and support
for low-resource languages. Backtranslation served two purposes:
• It restored the original language while maintaining the non-toxic transformation.
• It minimized information loss or distortion by leveraging a robust multilingual model trained
on diverse parallel corpora.</p>
          <p>Experiments and Analysis The detoxification pipeline was designed with a pivot-language strategy
to address the performance disparity between high-resource and low-resource languages. English and
Arabic were chosen as pivots due to their strong alignment with specific language families, which aids
in preserving both sentence structure and meaning during translation. Detoxifying content in the pivot
language allows for the use of well-trained detoxification models without the need to fine-tune for
each individual target language. Furthermore, translation back to the original language is handled by
NLLB-600M, which ensures that the detoxified sentence is rendered fluently and faithfully, thereby
improving the overall quality of the final output.
Observed Improvements: This approach led to noticeable improvements in detoxification, especially
for linguistically complex or low-resource languages that exhibited residual toxicity when directly
prompted. The pipeline demonstrated particular efectiveness in handling code-mixed inputs, such as
Hinglish, due to the normalization efects introduced during the pivot translation process. Semantic
ifdelity was well-maintained, with detoxified outputs retaining the original intent while reducing or
eliminating ofensive content. Overall, the use of English and Arabic as pivot languages not only
simplified the implementation but also enhanced the generalizability of the solution across diverse
language families.</p>
          <p>In summary, the proposed MT0-XL + Backtranslation strategy with language-aware pivot
selection provides a scalable and efective solution for multilingual detoxification, especially in settings
where direct inference with instruction-tuned models proves insuficient.</p>
          <p>Languages Using Backtranslation:
• Italian
• French
• Japanese
• Tatar
• Hinglish
• Hebrew</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.2. Direct Prompt-Based Detoxification</title>
          <p>For the remaining nine languages—English, Hindi, Russian, Spanish, German, Arabic, Ukrainian,
Chinese, and Amharic—we used direct prompting with MT0-XL without any intermediate translation.
This was efective due to the availability of training data for these languages and better generalization
of MT0’s multilingual capabilities.</p>
          <p>The prompt-only method provided fluent and contextually relevant detoxified outputs in most cases,
with minimal need for further correction.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>3.3.3. Evaluation Results</title>
          <p>Our system demonstrated strong overall performance across the 15 target languages, with high J-scores
in high-resource languages such as Ukrainian (0.786), Russian (0.758), German (0.746), French (0.777),
and English (0.725). These results reflect high fluency (XCOMET &gt; 0.89), semantic similarity (LaBSE &gt;
0.87), and efective toxicity reduction (STA &gt; 0.90) in those languages.(Table 2)</p>
          <p>Languages such as Amharic (0.488), Hebrew (0.486), Chinese (0.530), and particularly Hinglish
(0.323) showed lower J-scores, indicating that these morphologically rich or code-mixed languages
required additional post-processing. The impact of our backtranslation pipelines was especially visible
in improving scores for challenging languages like Tatar (0.650) and Japanese (0.584).</p>
          <p>The XCOMET across all languages was consistently higher than (0.84), confirming the fluency of the
outputs, while ChrF was mainly used for early-stage validation. The composite J-score validated our
hybrid model’s robustness, with a clear margin between pivot and non-pivot languages, justifying the
tailored detoxification strategies.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Hybrid Model: Prompted MT0-XL, Lexical Filtering, and Backtranslation</title>
        <p>To further enhance detoxification quality across diverse linguistic scenarios, we developed a hybrid
pipeline that combines three complementary components: prompted inference using MT0-XL,
backtranslation, and language-specific lexical filtering . Each component addresses specific
limitations when used in isolation and, when integrated, provides a more robust and adaptive detoxification
mechanism across both high-resource and low-resource languages.</p>
        <p>Prompted MT0-XL Inference: At the core of our hybrid strategy lies the
s-nlp/mt0-xl-detox-orpo model, used in zero-shot mode with carefully crafted
detoxification prompts for each target language. This approach leverages MT0-XL’s multilingual generalization
capabilities without requiring task-specific fine-tuning, making it a scalable and language-agnostic
solution. While MT0-XL alone achieved strong performance on many languages, it often struggled
to eliminate deeply embedded or culturally nuanced toxic phrases, particularly in code-mixed or
morphologically rich languages.</p>
        <p>Backtranslation: To further improve output quality—particularly in terms of fluency and deep
detoxification—we integrated a backtranslation stage into the pipeline. The MT0-XL output was first
translated into a high-resource pivot language (typically English or Arabic), then detoxified again
using MT0-XL, and finally backtranslated into the original language using NLLB-600M. This step
helped restore syntactic fluency and eliminate persistent toxic artifacts, especially in low-resource
or syntactically irregular languages. We applied this technique to six languages where direct MT0
prompting was insuficient: Italian, French, Japanese, Tatar, Hebrew, and Hinglish.
Lexical Filtering: To address remaining toxicity that persisted even after backtranslation or
prompting, we added a final language-specific lexical filtering step. This component scans the generated output
for known toxic terms using curated dictionaries and either removes or replaces them with neutral
alternatives. The filtering process is tailored to each language:
• For Hinglish, we manually constructed a synthetic toxic-neutral dictionary based on common
codemixed expressions. This dictionary captured informal, transliterated, and culturally grounded
toxic phrases frequently missed by generative models.
• For other languages, we utilized publicly available toxicity lexicons hosted on Hugging Face.</p>
        <p>These lexicons were manually vetted and compiled from authoritative sources to ensure broad
and language-specific coverage of ofensive expressions.</p>
        <p>In summary, our final system applies:
• Direct prompted MT0-XL inference for 9 pivot languages,
• Backtranslation refinement for six selected languages (Italian, French, Japanese, Tatar, Hebrew,</p>
        <p>Hinglish), and
• Lexical filtering as a final sweep to sanitize residual toxicity in all outputs.</p>
        <p>This hybrid pipeline ensures a balanced trade-of between scalability, generalization, and linguistic
sensitivity, enabling strong performance across a wide variety of linguistic contexts.
Benefits of the Hybrid Approach:
• Combining MT0-XL’s generative power with explicit lexical filtering significantly improved
detoxification efectiveness for code-mixed and informal inputs.
• Lexical resources enhanced coverage of language-specific toxicity , especially in edge cases
where generative models underperformed.
• Backtranslation further improved fluency and semantic stability , serving as a final layer of
refinement.
• The pipeline remained scalable and language-agnostic, with minimal dependence on fine-tuned
models.</p>
        <p>In summary, this hybrid framework demonstrated superior performance across a wide variety of
languages and toxicity types, particularly for informal, low-resource, or culturally specific inputs. The
integration of curated lexical knowledge and generation-based detoxification proved to be a powerful
combination for robust multilingual text detoxification.</p>
        <sec id="sec-3-4-1">
          <title>3.4.1. Evaluation Results</title>
          <p>Our multilingual detoxification system achieved a strong overall performance with an average J-score
of 0.640 and a high XCOMET fluency score of 0.846 . The model excelled particularly on
highresource languages such as French (J = 0.773), Ukrainian (J = 0.770), German (J = 0.758), Russian
(J = 0.753), Italian (J = 0.739), and English (J = 0.728), where it delivered fluent, semantically similar,
and efectively detoxified outputs.(Table 3)</p>
          <p>In contrast, performance was noticeably lower for code-mixed or morphologically rich languages
like Hinglish (J = 0.333), Hebrew (0.480), Chinese (0.541), Amharic (0.489), and Japanese (0.565),
which posed greater challenges due to limited training data, informal phrasing, or complex grammar.</p>
          <p>Notably, the subset of languages processed with our backtranslation pipeline—referred to as pivot
languages—achieved a higher average J-score of 0.675, compared to 0.587 for non-pivot languages
that used direct MT0-XL prompting. These results validate the efectiveness of our hybrid strategy,
where backtranslation and lexical filtering significantly improve detoxification quality for linguistically
complex or low-resource languages.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Evaluation Pipeline</title>
        <p>We evaluated outputs using:
• Toxicity – using a fine-tuned XLM-R classifier
• Similarity – using LaBSE cosine similarity
• Fluency – using XCOMET and Chrf
Our primary evaluation was conducted using the oficial metrics provided by the PAN 2025 organizers:
XCOMET for fluency, LaBSE for semantic similarity, and a fine-tuned XLM-R toxicity classifier for
toxicity reduction. We acknowledge that XCOMET is the oficial fluency metric, as it is specifically
designed to reflect human preferences in multilingual generation tasks. However, during the early
stages of development—prior to integration of the XCOMET evaluation module—we also used ChrF as
a lightweight proxy for fluency, due to its wide availability, simplicity, and relatively strong correlation
with fluency in prior multilingual text generation tasks.</p>
        <p>No external annotated dataset was used for evaluation; instead, we used the development set provided
by the PAN 2025 task organizers, which includes pairs of toxic and reference-neutral sentences in 15
languages. All similarity and fluency metrics were computed on this set. Once the oficial evaluation
codebase and XCOMET integration were available, we adopted it as our primary fluency metric, and
updated our analysis accordingly. The reported results and final system evaluation are therefore fully
aligned with the oficial PAN evaluation protocol.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Our submitted system ranked 4th overall on the oficial PAN 2025 Multilingual Text Detoxification
task leaderboard during the final test phase. Despite being a lightweight and inference-only approach
without any model fine-tuning, our hybrid pipeline performed competitively against other
state-ofthe-art systems. This placement reflects the robustness and efectiveness of combining prompted
MT0-XL inference, backtranslation, and language-specific lexical filtering across both high-resource
and low-resource languages.(Table 4)</p>
      <p>However, the organizers also introduced a complementary evaluation using LLM-as-a-Judge,
leveraging a fine-tuned LLaMA-3.1 -8B-Instruct model to simulate human-like assessment. Under this subjective,
human-aligned evaluation, our system ranked 15th out of 32 teams on non-pivot languages, and 7th
out of 32 teams on pivot languages.</p>
      <p>This discrepancy highlights an important insight: while our approach excels under automatic metrics,
particularly for high-resource languages, its relative efectiveness decreases in cross-lingual and
codemixed scenarios when judged by human preferences. Despite this, our system still outperformed several
baselines (e.g., baseline_mt0, baseline_gpt4) and remained competitive across both evaluation settings.</p>
      <p>These results validate the strength of our hybrid methodology while also pointing toward future
improvements—especially in enhancing style transfer fluency and human alignment in low-resource or
informal language settings.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to assist with language editing
and minor rephrasing. After using this tool, the author(s) reviewed and edited the content as needed
and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2025:
          <article-title>Generative AI Detection, Multilingual Text Detoxification, Multiauthor Writing Style Analysis, and Generative Plagiarism Detection</article-title>
          ,
          <source>in: Advances in Information Retrieval, Lecture Notes in Computer Science</source>
          , Springer Nature Switzerland,
          <year>2025</year>
          , pp.
          <fpage>434</fpage>
          -
          <lpage>441</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Protasov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rizwan</surname>
          </string-name>
          , I. Alimova,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Konovalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liebeskind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litvak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Takeshita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vanetik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ayele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yimam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elnagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <source>Overview of the Multilingual Text Detoxification Task at PAN</source>
          <year>2025</year>
          , in: Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>