1. Introduction

K. C. Fraser, H. Dawkins, S. Kiritchenko, Detecting ai-generated text: Factors influencing de- tectability with current methods, Journal of Artificial Intelligence Research

Enhancing AI Text Detection with Frozen Pretrained Encoders and Ensemble Learning

Shushanta Pudasaini

Luis Miralles-Pechuán

luis.miralles@TUDublin.ie 0

David Lillis

david.lillis@ucd.ie 1

Marisa Llorens Salvador

marisa.llorens@TUDublin.ie 0 0 Technological University Dublin , Dublin , Ireland 1 University College Dublin , Dublin , Ireland

2025

82 2025 236 241

As AI systems become increasingly capable of generating text, distinguishing it from human-written content remains an ongoing research challenge. This paper proposes a simple yet efective ensemble-based approach for detecting AI-generated text using pre-trained encoders. Six diferent Large Language Models (LLMs) were ifne-tuned with the PAN CLEF 2025 training set, and six ensemble learning approaches were applied on top of the five best-performing LLMs. These models were evaluated on the PAN CLEF validation dataset and a subset of the COLING 2025 dataset to ensure the model's performance across multiple datasets and domains. Experiments on benchmark datasets show that ensemble approaches significantly outperform individual models, achieving improved F1 scores and robustness across diverse AI-generated text samples. The best configuration (Bagging with support vector classifier on top of the results achieved from the top 5 performing individual LLMs) was able to achieve an F1 score of 0.9886 on the PAN CLEF 2025 benchmark compared to the F1 score of 0.9767 from the individual Deberta-v3-large model on the same benchmark dataset. Likewise, the preservation of pre-trained knowledge through frozen encoder layers consistently improved detection performance, demonstrated by the Deberta-v3-large model's 2.67% F1 scores improvement compared to its fully fine-tuned version. From this research, ensemble learning algorithms applied on top of LLMs were found to improve the performance of the AI-generated text detection task as experimented in the Voight-Kampf Generative AI Detection 2025 [ 1], which was a part of the PAN at CLEF 2025 [2] submission made through the TIRA platform [3]. The research is publicly available on GitHub under https://github.com/ShushantaTUD/Ensemble-Based-AI-Generated-Text-Detection.

eol>Large Language Models AI Generated Text Detection Ensemble Learning Machine Learning Encoders Ensemble Learning

1. Introduction

AI-generated text refers to content generated by Large Language Models (LLMs) like ChatGPT, which are trained on large datasets of human-written text. These LLMs can create essays, articles, and even research papers that mimic human writing styles, making it dificult to distinguish them from humanwritten content [4]. With the rapid development and availability of LLMs to the general public, their efects have reached across education, professional, and personal contexts, raising important questions about originality, authenticity, and intellectual integrity.

The educational sector faces challenges from AI-generated text, as it leads to AI-based plagiarism. Students submit partially or completely generated assignments using AI tools and use these AI tools to generate answers during online examinations, creating an unfair learning environment [4].

The ability of AI systems to generate convincing fake news articles, social media posts, and technical content is creating information disorder and reducing trust in legitimate sources. LLMs can bring inaccuracies, fabricated citations, or flawed reasoning that humans cannot detect [ 5]. As AI systems become increasingly capable of generating text, developing robust and adaptable detection systems is not just a technical challenge but a necessary step to maintain trust, fairness and the integrity of information in society.

Existing methods for detecting AI-generated text involve various strategies, such as supervised detection, zero-shot detection, retrieval-based detection, watermarking methods, and discriminating features [6]. Supervised detection involves models that are fine-tuned on AI-generated and humanwritten text. This approach typically requires large datasets, making it dificult to collect suficiently large and diverse sample collections. Another approach to detecting AI-generated text is zero-shot detection, which uses pre-trained algorithms, eliminating the process of collecting a large dataset [7]. Retrieval-based detection is another method of detecting AI-generated text. This method compares the semantic similarities of the given text with predefined AI-generated text. It relies heavily on an extensive and up-to-date database of AI-generated texts.

AI-generated texts can be embedded with a model signature that is invisible to the human eye, allowing it to be detected only by a computer. This method is known as watermarking. Another approach to detecting AI-generated text involves identifying distinctive traits that classify AI-generated texts and human-written texts, such as statistical features or linguistic features [6]. Despite these various detection methods, the evolving capabilities of AI language models continue to present challenges for reliable detection, highlighting the need for ongoing research and development of more robust identification techniques.

The dificulty in detecting AI-generated text stems from the basic architecture of LLMs, which is optimised to generate text that mimics human-written text. LLMs are trained on vast amounts of human-written text, making AI-generated texts almost indistinguishable from human-written texts [4]. Current AI detection tools show limited efectiveness. OpenAI’s detector properly identifies only 26% of AI-generated texts, indicating the technical complexity of this task [8].

Detecting AI-generated text is more complicated as language models evolve rapidly, while detection tools rely on outdated methods and data [9]. Because detection methods cannot be tested until the new LLMs are launched, they always go one step behind. Simple techniques like paraphrasing AI-generated text can easily bypass many detectors [10]. Several challenges further complicate the identification of AI-generated text: the absence of standardized benchmarks for evaluating detection accuracy, the high computational cost of these tools, inherent biases that may unfairly flag texts written by non-native English speakers as AI-generated, the rapid advancement of large language models (LLMs) that outpaces detector development, and their susceptibility to adversarial attacks [9, 11].

To address the limitations of individual models, ensemble learning has become a powerful strategy for enhancing the detection of AI-generated content (AIGC). Ensemble methods combine the strengths of multiple models, each capable of identifying diferent patterns or compensating for the weaknesses of others. By aggregating predictions through voting, averaging, or weighted combinations, ensemble approaches help reduce errors caused by model bias (oversimplification) or variance (over-sensitivity to data). As a result, the detection of AI-generated text becomes more accurate, robust, and reliable [4].

Ensemble methods use techniques such as bagging and boosting. This collective approach is especially valuable for complex detection tasks where single models struggle to capture all relevant features, and research suggests that ensembles are more resilient to adversarial attacks and generalise better between diferent types of AI-generated content [ 4]. Thus, ensemble learning ofers a practical path forward in addressing the technical and evolving challenges of detecting AI-generated text.

2. Literature Review

The challenge of distinguishing human-written text from machine-generated content has grown rapidly with the widespread use of large language models. However, this problem did not emerge suddenly; it evolved from earlier research in related areas such as plagiarism detection.

Early approaches to detecting machine-generated text were inspired by plagiarism detection methods. Techniques such as part-of-speech (POS) tag n-grams and perplexity-based measures were used to identify paraphrased or automatically rewritten content. These models performed well in texts with high and low levels of obfuscation and achieved competitive results using the Plagdet evaluation metric [12].

As research progressed, more eficient models were proposed. One such model was the Weighted Neural Bag-of-n-grams (WNB-ngram), a lightweight neural network designed for text classification tasks. It performed well on datasets like Yelp Reviews and IMDB, demonstrating that even small models could capture meaningful linguistic patterns [12].

With the rise of deep learning, researchers began framing AI text detection as a classification problem. Large pre-trained language models such as RoBERTa and DeBERTa were fine-tuned on specially curated datasets to detect subtle patterns in text that distinguish between human and AI-written content [13].

In a diferent approach, Harika Abburi et al. [ 14] applied classical machine learning techniques like Gradient Boosting, Stacking, and Voting. Instead of raw text, these models used the probability outputs from various pre-trained LLMs as input features. Their system achieved high performance in the AuTexTification shared task, ranking first in model attribution for both English and Spanish texts.

In parallel, real-world tools for AI text detection started to appear. OpenAI released its classifier in 2023, but it was later discontinued due to poor accuracy on short or factual inputs. Other tools, such as GLTR, GPTZero, and DetectGPT, experimented with analysing token-level likelihoods and distribution shifts to identify AI-generated text [13].

Together, these developments show how the field has moved from early rule-based techniques to classical machine learning, and now to advanced fine-tuned language models. Despite these advances, reliable AI text detection, especially in open-domain settings, remains an ongoing research challenge.

3. Methodology

Our study proposes a simple yet efective ensemble-based approach for detecting AI generated text using large language models. The methodology consists of four main components: dataset preparation, base model selection, feature engineering and ensemble learning. The overall methodology of the experiment is represented in Fig. 1.

Datasets for Training and Evaluation

The benchmark datasets used contain both human-written and AI-generated texts. These include data sets from COLING-2025 and PAN CLEF, which provide labelled samples in English, and each data set is preprocessed using tokenisers from diferent models. We split the training dataset into training (80%), validation (20%), and made predictions on the test set.

Base LLMs The following LLMs were used in the experiment: • microsoft/deberta-v3-large [15] • FacebookAI/XLM roberta-large [16] • openai-community/roberta-large-openai-detector [17] • lmsys/vicuna-7b-v1.5 (RADAR-vicuna)[18] • google-bert/bert-base-multilingual-cased [19] • allenai/longformer-base-4096

The selection of models is based on the results of the experiment conducted by Harika Abburi et al. [4]. In addition, the choice to include models like Deberta-v3-large and RADAR-Vicuna-7B is due to their strong performance in the classification task.

Ensemble Techniques

Six diferent ensemble techniques were implemented to improve the performance of AI-generated text detection. These included a Voting Classifier, a Stacking Classifier, and a Gradient Boosting Classifier. Instead of using raw text features, these classifiers were trained on the class probability scores generated by large language models for each text sample. The six ensemble approaches used in this paper are: 1. Custom Ensemble

The custom ensemble first trains multiple models (Random Forest, XGBoost, LightGBM) and evaluates their performance using cross-validation. It then assigns weights to each model based on their cross-validation scores - the better-performing models get higher weights. The final prediction is made by taking a weighted average of all model predictions. 2. Bagging (Decision Tree Classifier)

This technique works by leveraging the principle of Bootstrap Aggregating to reduce the variance of a base model, which is the Decision Tree in this case. It generates multiple versions of the training dataset by sampling with replacement (bootstrap), ensuring that each base estimator is trained on a slightly diferent subset of the data. 3. Bagging (SVC)

By bagging with SVC, multiple SVC models are trained on diferent bootstrap samples of the data. The final prediction combines all SVC predictions through majority voting, reducing the variance while maintaining SVC’s strong classification boundaries. 4. Voting (Soft)

In soft voting, the final prediction is made by averaging the predicted probabilities of all models and choosing the class with the highest average probability. 5. Gradient Boosting Classifier

Gradient boosting builds models sequentially, where each new model learns to correct the errors made by the previous models. It starts with a simple model and iteratively adds new models that focus on the misclassified examples from previous iterations. Each subsequent model is trained on the residual errors.

The models were evaluated using standard classification metrics, including accuracy, precision, recall, and F1-score.

4. Results

In this section, we present and analyse the experimental findings of our AI-generated content detection research. Our evaluation demonstrates the efectiveness of individual transformer-based models and ensemble methods across multiple benchmarks. The results highlight significant performance diferences between individual models and ensemble strategies, providing valuable insights for developing robust detection systems for AI-generated text.

The following results are organised to provide clear performance comparisons between diferent detection approaches. First, we analyse the performance metrics of standalone transformer-based architectures to establish baseline capabilities. Then, we explore how combining these models through various ensemble techniques afects detection performance. The analysis includes both standard ensemble methods applied to all models and specialised ensembling of only top-performing models to determine optimal integration strategies.

4.1. Results from Individual Models

Initially, experiments were performed to evaluate whether the fully fine-tuned LLM or fine-tuning while keeping the first five layers frozen, preserving the pre-trained knowledge, would achieve good results in the PAN 2025 dataset. The Deberta-v3-large model with frozen first five encoder layers (0.8347) outperformed its fully fine-tuned model (0.8080), suggesting that preserving pre-trained knowledge improves detection performance. The comparison of the performance of fine-tuning on the full model and fine-tuning keeping some layers frozen, preserving pre-trained knowledge, is shown in Table 2.

Following this the COLING 2025 benchmark was used to test the models. On this dataset, the longformer model achieved the highest F1 score of 0.8377, showing excellent detection capabilities. The Roberta-large model achieved the lowest performance among the transformer models with an F1 score of 0.7293. The average F1 score across all individual models on this benchmark was 0.748, achieved by Xlm-roberta (frozen first five encoder layers) and Roberta-openai-detector (frozen first five encoder layers).

Deberta-v3-large achieved the top performance with an F1 score of 0.9767 for the PAN CLEF benchmark. The Vicuna-7b model and the Roberta-large model also displayed robust capabilities with F1 scores of 0.9751 and 0.9654, respectively. For the PAN CLEF dataset, the longerformer model showed the lowest performance with a score of 0.9580. The average F1 score across all models for this benchmark was 0.9582, achieved by bert-base-multilingual-cased. Table 3 shows the results obtained using COLING 2025 and the PAN CLEF 2025 validation datasets.

4.2. Results after the Ensemble Approach

After applying ensemble methods to the COLING 2025 Test Set, Bagging with Decision Tree Classifier achieved the highest F1 score of 0.8399. Both Bagging with SVC and soft Voting methods achieved similar scores of 0.8324, while Stacking achieved the lowest performance with an F1 score of 0.8199. The average F1 score across all ensemble methods was 0.832449.

When evaluating ensemble techniques using only the top 4 performing models on the PAN CLEF validation set, Bagging with SVC presented the best performance with an F1 score of 0.9886. Gradient Boosting Classifier, soft voting and bagging with Decision Tree Classifier showed similar performance with F1 scores of 0.9876. Stacking again produced the lowest results with an F1 score of 0.9866. The average performance across these top-model ensemble approaches was 0.9876, showing a clear improvement over ensembles using all models. The results for the ensemble learning algorithms can be seen in Table 4

Our results highlight that the optimal approach for AIGC detection involves combining strategically frozen transformer models with ensemble methods, particularly those using support vector classifiers with bagging. The performance gain achieved through the ensemble method indicates that diferent model architectures capture complementary linguistic features of AI-generated content, enabling more robust detection when combined.

The proposed ensemble methodology can be efectively deployed in various applications requiring reliable AI content detection, including academic integrity systems, news verification platforms, and social media content moderation. Its high performance on diverse benchmarks suggests strong generalisation capability across diferent types of AI-generated text, making it particularly valuable for educational institutions and media organisations needing to distinguish between human and machine-written content.

4.3. Ensemble Learning Full Result on PAN CLEF Dataset

The full metric results of the ensemble learning approaches applied on the PAN CLEF 2025 dataset have been represented in the Table 5.

5. Discussion and Analysis of the Results

The experimental results reveal several key patterns in model performance across diferent test datasets. The longformer model demonstrated superior efectiveness with the highest F1 score (0.8377) on the COLING 2025 benchmark, establishing it as the most reliable detector in our evaluation. Notably, the Deberta-v3-large model with frozen first five encoder layers (0.8347) significantly outperformed its fully ifne-tuned counterpart (0.8080), suggesting that preserving pre-trained knowledge structures enhances detection capabilities.

Performance consistency varied considerably across models when tested on diferent datasets. While some models maintained relatively stable performance metrics, other models showed noticeable changes, indicating sensitivity to dataset-specific characteristics. This variability shows the importance of comprehensive evaluation across diverse test conditions when deploying AI-generated text detection systems in real-world applications.

Based on the experiments, this paper concludes the following insights: • Freezing encoder layers improved detection performance. Models with the first five encoder layers frozen consistently outperformed their fully fine-tuned counterparts across multiple architectures. For instance, DeBERTa-v3-large demonstrated a performance gain of approximately 2.67%. This suggests that retaining the linguistic knowledge embedded during pre-training—while allowing higher layers to adapt to the detection task—results in a more efective framework for distinguishing between human and AI-generated content. • Domain-specific performance gaps reveal detection challenges. Model performance varied significantly across texts from two datasets: PAN CLEF 2025 and COLING 2025. This dataset and domain sensitivity highlight the need for either domain-specific fine-tuning or ensemble methods that incorporate specialised detectors tailored to diferent content types. • Ensemble approaches improved classification performance. Both bagging and stacking ensemble techniques were evaluated across two benchmark datasets—COLING 2025 and PAN CLEF 2025. On the COLING 2025 test set, individual fine-tuned LLMs achieved an average F1 score of 0.7953, with the best-performing model being Longformer (F1 = 0.8377). In comparison, ensemble models achieved a higher average F1 score of 0.8299, with the best ensemble method—Bagging (Decision Tree Classifier)—reaching 0.8399. This marks an approximate relative improvement of 4.3% over the average individual model and a marginal gain over the top single model. On the PAN CLEF 2025 validation set, the advantage of ensembles was even more pronounced. The average F1 score of individual fine-tuned models was 0.9660, while ensemble methods achieved an average of 0.9878. The best ensemble—Bagging (SVC)—achieved an F1 of 0.9886, outperforming the top individual model (DeBERTa-v3-large, F1 = 0.9767) by 1.2 percentage points. These results demonstrate that ensemble methods not only improve generalisation but also help reduce the variance and overfitting tendencies of individual models, especially in high-stakes classification tasks.

6. Conclusion

This research demonstrates the efectiveness of ensemble learning approaches for AI-generated text detection. Our findings show that strategically organised ensemble methods significantly outperform individual models, with the best configuration (Bagging with SVC using top 4 models) achieving an F1 score of 0.87248 on the COLING 2025 benchmark, 3.5% improvement over the best single model. The preserved pre-trained knowledge through frozen encoder layers consistently enhanced detection performance, demonstrated by the Deberta-v3-large model’s 2.67% F1 score improvement compared to its fully fine-tuned version.

The best performance of ensembles, particularly when combining only top-performing models, confirms that diferent architectures capture complementary linguistic patterns distinguishing AIgenerated from human-written text. Despite these achievements, our experiment identifies important challenges, including cross-domain performance variability and the need for continuous adaptation to evolving language models. Future work should focus on developing adaptive ensemble approaches, exploring domain-specific detection modules, and investigating interpretability methods to enhance trust in these systems for educational and professional applications.

7. Acknowledgments

This publication has emanated from research conducted with the financial support of/supported in part by a grant from Science Foundation Ireland under Grant number 18/CRT/6183. For Open Access, the author has applied a CC by public copyright licence to any Author Accepted Manuscript version arising from this submission.

8. Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT and Grammarly to: Grammar and spelling check. Further, the author(s) used Claude for Figure 1 to generate images. After using these tools (s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Bevendorf ,

Wang ,

Karlgren ,

Wiegmann ,

Fröbe ,

Tsivgun ,

Su ,

Xie ,

Abassy ,

Mansurov ,

Xing ,

M. N.

Ta ,

K. A.

Elozeiri ,

Gu ,

R. V.

Tomar ,

Geng ,

Artemova ,

Shelmanov ,

Habash ,

Stamatatos , I. Gurevych ,

Nakov ,

Potthast ,

Stein , Overview of the “VoightKampf” Generative AI Authorship Verification Task at PAN and ELOQUENT 2025 , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org , 2025 .

[2]

Bevendorf ,

Dementieva ,

Fröbe ,

Gipp ,

Greiner-Petter ,

Karlgren ,

Mayerl ,

Nakov ,

Panchenko ,

Potthast ,

Shelmanov ,

Stamatatos ,

Stein ,

Wang ,

Wiegmann , E. Zangerle, Overview of PAN 2025: Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality , Multimodality, and Interaction.