<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>K. C. Fraser, H. Dawkins, S. Kiritchenko, Detecting ai-generated text: Factors influencing de-
tectability with current methods, Journal of Artificial Intelligence Research</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Enhancing AI Text Detection with Frozen Pretrained Encoders and Ensemble Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shushanta Pudasaini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Miralles-Pechuán</string-name>
          <email>luis.miralles@TUDublin.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Lillis</string-name>
          <email>david.lillis@ucd.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marisa Llorens Salvador</string-name>
          <email>marisa.llorens@TUDublin.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technological University Dublin</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University College Dublin</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>82</volume>
      <issue>2025</issue>
      <fpage>236</fpage>
      <lpage>241</lpage>
      <abstract>
        <p>As AI systems become increasingly capable of generating text, distinguishing it from human-written content remains an ongoing research challenge. This paper proposes a simple yet efective ensemble-based approach for detecting AI-generated text using pre-trained encoders. Six diferent Large Language Models (LLMs) were ifne-tuned with the PAN CLEF 2025 training set, and six ensemble learning approaches were applied on top of the five best-performing LLMs. These models were evaluated on the PAN CLEF validation dataset and a subset of the COLING 2025 dataset to ensure the model's performance across multiple datasets and domains. Experiments on benchmark datasets show that ensemble approaches significantly outperform individual models, achieving improved F1 scores and robustness across diverse AI-generated text samples. The best configuration (Bagging with support vector classifier on top of the results achieved from the top 5 performing individual LLMs) was able to achieve an F1 score of 0.9886 on the PAN CLEF 2025 benchmark compared to the F1 score of 0.9767 from the individual Deberta-v3-large model on the same benchmark dataset. Likewise, the preservation of pre-trained knowledge through frozen encoder layers consistently improved detection performance, demonstrated by the Deberta-v3-large model's 2.67% F1 scores improvement compared to its fully fine-tuned version. From this research, ensemble learning algorithms applied on top of LLMs were found to improve the performance of the AI-generated text detection task as experimented in the Voight-Kampf Generative AI Detection 2025 [ 1], which was a part of the PAN at CLEF 2025 [2] submission made through the TIRA platform [3]. The research is publicly available on GitHub under https://github.com/ShushantaTUD/Ensemble-Based-AI-Generated-Text-Detection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>AI Generated Text Detection</kwd>
        <kwd>Ensemble Learning</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Encoders</kwd>
        <kwd>Ensemble Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>AI-generated text refers to content generated by Large Language Models (LLMs) like ChatGPT, which
are trained on large datasets of human-written text. These LLMs can create essays, articles, and even
research papers that mimic human writing styles, making it dificult to distinguish them from
humanwritten content [4]. With the rapid development and availability of LLMs to the general public, their
efects have reached across education, professional, and personal contexts, raising important questions
about originality, authenticity, and intellectual integrity.</p>
      <p>The educational sector faces challenges from AI-generated text, as it leads to AI-based plagiarism.
Students submit partially or completely generated assignments using AI tools and use these AI tools to
generate answers during online examinations, creating an unfair learning environment [4].</p>
      <p>The ability of AI systems to generate convincing fake news articles, social media posts, and technical
content is creating information disorder and reducing trust in legitimate sources. LLMs can bring
inaccuracies, fabricated citations, or flawed reasoning that humans cannot detect [ 5]. As AI systems
become increasingly capable of generating text, developing robust and adaptable detection systems
is not just a technical challenge but a necessary step to maintain trust, fairness and the integrity of
information in society.</p>
      <p>Existing methods for detecting AI-generated text involve various strategies, such as supervised
detection, zero-shot detection, retrieval-based detection, watermarking methods, and discriminating
features [6]. Supervised detection involves models that are fine-tuned on AI-generated and
humanwritten text. This approach typically requires large datasets, making it dificult to collect suficiently
large and diverse sample collections. Another approach to detecting AI-generated text is zero-shot
detection, which uses pre-trained algorithms, eliminating the process of collecting a large dataset [7].
Retrieval-based detection is another method of detecting AI-generated text. This method compares
the semantic similarities of the given text with predefined AI-generated text. It relies heavily on an
extensive and up-to-date database of AI-generated texts.</p>
      <p>AI-generated texts can be embedded with a model signature that is invisible to the human eye,
allowing it to be detected only by a computer. This method is known as watermarking. Another
approach to detecting AI-generated text involves identifying distinctive traits that classify AI-generated
texts and human-written texts, such as statistical features or linguistic features [6]. Despite these various
detection methods, the evolving capabilities of AI language models continue to present challenges
for reliable detection, highlighting the need for ongoing research and development of more robust
identification techniques.</p>
      <p>The dificulty in detecting AI-generated text stems from the basic architecture of LLMs, which is
optimised to generate text that mimics human-written text. LLMs are trained on vast amounts of
human-written text, making AI-generated texts almost indistinguishable from human-written texts [4].
Current AI detection tools show limited efectiveness. OpenAI’s detector properly identifies only 26%
of AI-generated texts, indicating the technical complexity of this task [8].</p>
      <p>Detecting AI-generated text is more complicated as language models evolve rapidly, while detection
tools rely on outdated methods and data [9]. Because detection methods cannot be tested until the new
LLMs are launched, they always go one step behind. Simple techniques like paraphrasing AI-generated
text can easily bypass many detectors [10]. Several challenges further complicate the identification of
AI-generated text: the absence of standardized benchmarks for evaluating detection accuracy, the high
computational cost of these tools, inherent biases that may unfairly flag texts written by non-native
English speakers as AI-generated, the rapid advancement of large language models (LLMs) that outpaces
detector development, and their susceptibility to adversarial attacks [9, 11].</p>
      <p>To address the limitations of individual models, ensemble learning has become a powerful strategy
for enhancing the detection of AI-generated content (AIGC). Ensemble methods combine the strengths
of multiple models, each capable of identifying diferent patterns or compensating for the weaknesses
of others. By aggregating predictions through voting, averaging, or weighted combinations, ensemble
approaches help reduce errors caused by model bias (oversimplification) or variance (over-sensitivity to
data). As a result, the detection of AI-generated text becomes more accurate, robust, and reliable [4].</p>
      <p>Ensemble methods use techniques such as bagging and boosting. This collective approach is especially
valuable for complex detection tasks where single models struggle to capture all relevant features, and
research suggests that ensembles are more resilient to adversarial attacks and generalise better between
diferent types of AI-generated content [ 4]. Thus, ensemble learning ofers a practical path forward in
addressing the technical and evolving challenges of detecting AI-generated text.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <p>The challenge of distinguishing human-written text from machine-generated content has grown rapidly
with the widespread use of large language models. However, this problem did not emerge suddenly; it
evolved from earlier research in related areas such as plagiarism detection.</p>
      <p>Early approaches to detecting machine-generated text were inspired by plagiarism detection methods.
Techniques such as part-of-speech (POS) tag n-grams and perplexity-based measures were used to
identify paraphrased or automatically rewritten content. These models performed well in texts with
high and low levels of obfuscation and achieved competitive results using the Plagdet evaluation metric
[12].</p>
      <p>As research progressed, more eficient models were proposed. One such model was the Weighted
Neural Bag-of-n-grams (WNB-ngram), a lightweight neural network designed for text classification
tasks. It performed well on datasets like Yelp Reviews and IMDB, demonstrating that even small models
could capture meaningful linguistic patterns [12].</p>
      <p>With the rise of deep learning, researchers began framing AI text detection as a classification problem.
Large pre-trained language models such as RoBERTa and DeBERTa were fine-tuned on specially curated
datasets to detect subtle patterns in text that distinguish between human and AI-written content [13].</p>
      <p>In a diferent approach, Harika Abburi et al. [ 14] applied classical machine learning techniques
like Gradient Boosting, Stacking, and Voting. Instead of raw text, these models used the probability
outputs from various pre-trained LLMs as input features. Their system achieved high performance in
the AuTexTification shared task, ranking first in model attribution for both English and Spanish texts.</p>
      <p>In parallel, real-world tools for AI text detection started to appear. OpenAI released its classifier in
2023, but it was later discontinued due to poor accuracy on short or factual inputs. Other tools, such as
GLTR, GPTZero, and DetectGPT, experimented with analysing token-level likelihoods and distribution
shifts to identify AI-generated text [13].</p>
      <p>Together, these developments show how the field has moved from early rule-based techniques to
classical machine learning, and now to advanced fine-tuned language models. Despite these advances,
reliable AI text detection, especially in open-domain settings, remains an ongoing research challenge.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our study proposes a simple yet efective ensemble-based approach for detecting AI generated text
using large language models. The methodology consists of four main components: dataset preparation,
base model selection, feature engineering and ensemble learning. The overall methodology of the
experiment is represented in Fig. 1.</p>
      <p>Datasets for Training and Evaluation</p>
      <p>The benchmark datasets used contain both human-written and AI-generated texts. These include
data sets from COLING-2025 and PAN CLEF, which provide labelled samples in English, and each data
set is preprocessed using tokenisers from diferent models. We split the training dataset into training
(80%), validation (20%), and made predictions on the test set.</p>
      <p>Base LLMs
The following LLMs were used in the experiment:
• microsoft/deberta-v3-large [15]
• FacebookAI/XLM roberta-large [16]
• openai-community/roberta-large-openai-detector [17]
• lmsys/vicuna-7b-v1.5 (RADAR-vicuna)[18]
• google-bert/bert-base-multilingual-cased [19]
• allenai/longformer-base-4096</p>
      <p>The selection of models is based on the results of the experiment conducted by Harika Abburi et al.
[4]. In addition, the choice to include models like Deberta-v3-large and RADAR-Vicuna-7B is due to
their strong performance in the classification task.</p>
      <p>Ensemble Techniques</p>
      <p>Six diferent ensemble techniques were implemented to improve the performance of AI-generated text
detection. These included a Voting Classifier, a Stacking Classifier, and a Gradient Boosting Classifier.
Instead of using raw text features, these classifiers were trained on the class probability scores generated
by large language models for each text sample. The six ensemble approaches used in this paper are:
1. Custom Ensemble</p>
      <p>The custom ensemble first trains multiple models (Random Forest, XGBoost, LightGBM) and
evaluates their performance using cross-validation. It then assigns weights to each model based
on their cross-validation scores - the better-performing models get higher weights. The final
prediction is made by taking a weighted average of all model predictions.
2. Bagging (Decision Tree Classifier)</p>
      <p>This technique works by leveraging the principle of Bootstrap Aggregating to reduce the variance
of a base model, which is the Decision Tree in this case. It generates multiple versions of the
training dataset by sampling with replacement (bootstrap), ensuring that each base estimator is
trained on a slightly diferent subset of the data.
3. Bagging (SVC)</p>
      <p>By bagging with SVC, multiple SVC models are trained on diferent bootstrap samples of the data.
The final prediction combines all SVC predictions through majority voting, reducing the variance
while maintaining SVC’s strong classification boundaries.
4. Voting (Soft)</p>
      <p>In soft voting, the final prediction is made by averaging the predicted probabilities of all models
and choosing the class with the highest average probability.
5. Gradient Boosting Classifier</p>
      <p>Gradient boosting builds models sequentially, where each new model learns to correct the errors
made by the previous models. It starts with a simple model and iteratively adds new models that
focus on the misclassified examples from previous iterations. Each subsequent model is trained
on the residual errors.</p>
      <p>The models were evaluated using standard classification metrics, including accuracy, precision, recall,
and F1-score.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, we present and analyse the experimental findings of our AI-generated content detection
research. Our evaluation demonstrates the efectiveness of individual transformer-based models and
ensemble methods across multiple benchmarks. The results highlight significant performance diferences
between individual models and ensemble strategies, providing valuable insights for developing robust
detection systems for AI-generated text.</p>
      <p>The following results are organised to provide clear performance comparisons between diferent
detection approaches. First, we analyse the performance metrics of standalone transformer-based
architectures to establish baseline capabilities. Then, we explore how combining these models through
various ensemble techniques afects detection performance. The analysis includes both standard
ensemble methods applied to all models and specialised ensembling of only top-performing models to
determine optimal integration strategies.</p>
      <sec id="sec-4-1">
        <title>4.1. Results from Individual Models</title>
        <p>Initially, experiments were performed to evaluate whether the fully fine-tuned LLM or fine-tuning while
keeping the first five layers frozen, preserving the pre-trained knowledge, would achieve good results
in the PAN 2025 dataset. The Deberta-v3-large model with frozen first five encoder layers (0.8347)
outperformed its fully fine-tuned model (0.8080), suggesting that preserving pre-trained knowledge
improves detection performance. The comparison of the performance of fine-tuning on the full model
and fine-tuning keeping some layers frozen, preserving pre-trained knowledge, is shown in Table 2.</p>
        <p>Following this the COLING 2025 benchmark was used to test the models. On this dataset, the
longformer model achieved the highest F1 score of 0.8377, showing excellent detection capabilities. The
Roberta-large model achieved the lowest performance among the transformer models with an F1 score
of 0.7293. The average F1 score across all individual models on this benchmark was 0.748, achieved by
Xlm-roberta (frozen first five encoder layers) and Roberta-openai-detector (frozen first five encoder
layers).</p>
        <p>Deberta-v3-large achieved the top performance with an F1 score of 0.9767 for the PAN CLEF
benchmark. The Vicuna-7b model and the Roberta-large model also displayed robust capabilities with F1
scores of 0.9751 and 0.9654, respectively. For the PAN CLEF dataset, the longerformer model showed the
lowest performance with a score of 0.9580. The average F1 score across all models for this benchmark
was 0.9582, achieved by bert-base-multilingual-cased. Table 3 shows the results obtained using COLING
2025 and the PAN CLEF 2025 validation datasets.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results after the Ensemble Approach</title>
        <p>After applying ensemble methods to the COLING 2025 Test Set, Bagging with Decision Tree Classifier
achieved the highest F1 score of 0.8399. Both Bagging with SVC and soft Voting methods achieved
similar scores of 0.8324, while Stacking achieved the lowest performance with an F1 score of 0.8199.
The average F1 score across all ensemble methods was 0.832449.</p>
        <p>When evaluating ensemble techniques using only the top 4 performing models on the PAN CLEF
validation set, Bagging with SVC presented the best performance with an F1 score of 0.9886. Gradient
Boosting Classifier, soft voting and bagging with Decision Tree Classifier showed similar performance
with F1 scores of 0.9876. Stacking again produced the lowest results with an F1 score of 0.9866.
The average performance across these top-model ensemble approaches was 0.9876, showing a clear
improvement over ensembles using all models. The results for the ensemble learning algorithms can be
seen in Table 4</p>
        <p>Our results highlight that the optimal approach for AIGC detection involves combining strategically
frozen transformer models with ensemble methods, particularly those using support vector classifiers
with bagging. The performance gain achieved through the ensemble method indicates that diferent
model architectures capture complementary linguistic features of AI-generated content, enabling more
robust detection when combined.</p>
        <p>The proposed ensemble methodology can be efectively deployed in various applications requiring
reliable AI content detection, including academic integrity systems, news verification platforms, and social
media content moderation. Its high performance on diverse benchmarks suggests strong generalisation
capability across diferent types of AI-generated text, making it particularly valuable for educational
institutions and media organisations needing to distinguish between human and machine-written
content.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ensemble Learning Full Result on PAN CLEF Dataset</title>
        <p>The full metric results of the ensemble learning approaches applied on the PAN CLEF 2025 dataset have
been represented in the Table 5.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Analysis of the Results</title>
      <p>The experimental results reveal several key patterns in model performance across diferent test datasets.
The longformer model demonstrated superior efectiveness with the highest F1 score (0.8377) on the
COLING 2025 benchmark, establishing it as the most reliable detector in our evaluation. Notably, the
Deberta-v3-large model with frozen first five encoder layers (0.8347) significantly outperformed its fully
ifne-tuned counterpart (0.8080), suggesting that preserving pre-trained knowledge structures enhances
detection capabilities.</p>
      <p>Performance consistency varied considerably across models when tested on diferent datasets. While
some models maintained relatively stable performance metrics, other models showed noticeable changes,
indicating sensitivity to dataset-specific characteristics. This variability shows the importance of
comprehensive evaluation across diverse test conditions when deploying AI-generated text detection
systems in real-world applications.</p>
      <p>Based on the experiments, this paper concludes the following insights:
• Freezing encoder layers improved detection performance. Models with the first five encoder
layers frozen consistently outperformed their fully fine-tuned counterparts across multiple
architectures. For instance, DeBERTa-v3-large demonstrated a performance gain of approximately
2.67%. This suggests that retaining the linguistic knowledge embedded during pre-training—while
allowing higher layers to adapt to the detection task—results in a more efective framework for
distinguishing between human and AI-generated content.
• Domain-specific performance gaps reveal detection challenges. Model performance varied
significantly across texts from two datasets: PAN CLEF 2025 and COLING 2025. This dataset
and domain sensitivity highlight the need for either domain-specific fine-tuning or ensemble
methods that incorporate specialised detectors tailored to diferent content types.
• Ensemble approaches improved classification performance. Both bagging and stacking
ensemble techniques were evaluated across two benchmark datasets—COLING 2025 and PAN CLEF
2025. On the COLING 2025 test set, individual fine-tuned LLMs achieved an average F1 score of
0.7953, with the best-performing model being Longformer (F1 = 0.8377). In comparison, ensemble
models achieved a higher average F1 score of 0.8299, with the best ensemble method—Bagging
(Decision Tree Classifier)—reaching 0.8399. This marks an approximate relative improvement of
4.3% over the average individual model and a marginal gain over the top single model.
On the PAN CLEF 2025 validation set, the advantage of ensembles was even more pronounced. The
average F1 score of individual fine-tuned models was 0.9660, while ensemble methods achieved an
average of 0.9878. The best ensemble—Bagging (SVC)—achieved an F1 of 0.9886, outperforming
the top individual model (DeBERTa-v3-large, F1 = 0.9767) by 1.2 percentage points. These results
demonstrate that ensemble methods not only improve generalisation but also help reduce the
variance and overfitting tendencies of individual models, especially in high-stakes classification
tasks.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This research demonstrates the efectiveness of ensemble learning approaches for AI-generated text
detection. Our findings show that strategically organised ensemble methods significantly outperform
individual models, with the best configuration (Bagging with SVC using top 4 models) achieving an
F1 score of 0.87248 on the COLING 2025 benchmark, 3.5% improvement over the best single model.
The preserved pre-trained knowledge through frozen encoder layers consistently enhanced detection
performance, demonstrated by the Deberta-v3-large model’s 2.67% F1 score improvement compared to
its fully fine-tuned version.</p>
      <p>The best performance of ensembles, particularly when combining only top-performing models,
confirms that diferent architectures capture complementary linguistic patterns distinguishing
AIgenerated from human-written text. Despite these achievements, our experiment identifies important
challenges, including cross-domain performance variability and the need for continuous adaptation to
evolving language models. Future work should focus on developing adaptive ensemble approaches,
exploring domain-specific detection modules, and investigating interpretability methods to enhance
trust in these systems for educational and professional applications.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>This publication has emanated from research conducted with the financial support of/supported in part
by a grant from Science Foundation Ireland under Grant number 18/CRT/6183. For Open Access, the
author has applied a CC by public copyright licence to any Author Accepted Manuscript version arising
from this submission.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT and Grammarly to: Grammar and
spelling check. Further, the author(s) used Claude for Figure 1 to generate images. After using these
tools (s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility
for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsivgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abassy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mansurov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Ta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Elozeiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. V.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Habash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the “VoightKampf” Generative AI Authorship Verification Task at PAN</article-title>
          and
          <article-title>ELOQUENT 2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2025:
          <article-title>Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality</source>
          , Multimodality, and Interaction.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>