1. Introduction

Generative AI Authorship Verification Based on Contrastive-Enhanced Dual-Model Decision System

Junlang Liu

liujunlang2015@Gmail.com 0

Leilei Kong

kongleilei@fosu.edu.com 0

Zhenyu Peng

pengzhenyu1411@163.com 0

Feifan Chen

chenfeifan0203@163.com 0 0 Foshan University , Foshan , China

2025

Detecting human-written text from content produced by large language models (LLMs) remains a moving target, especially when detectors face unseen generators. We formalize the CLEF PAN 2025 Generative-AI Authorship Verification task as a text classification problem, employing a contrastive-enhanced ModernBERTlarge approach, a Qwen3-based approach, and a fusion method combining both approachs. Specifically, to implement contrastive learning in the contrastive-enhanced method, we applied the large language model ChatGPT-4.1 for data augmentation, rewriting 1,000 human-written sentences. On the oficial validation set, our contrastive-enhanced method achieves a 0.997 mean score, with all five PAN metrics above 0.99. On the hidden test set our submitted single-model ModernBERT-large(CE + SCL) achieves a 0.871 mean score (ROC-AUC = 0.822, F1 = 0.855), ranking 3rd out of 24 teams. The results suggest that the contrastive-enhanced method yields competitive results, even without relying on large-scale ensemble systems.

eol>Generative AI Detection Pre-trained Model Contrastive-Enhanced Text Classification

1. Introduction

Large language models (LLMs) have drastically lowered the cost of generating fluent text, but this progress intensifies the need to verify whether content is authored by humans or machines[ 1, 2 ]. Experience from the PAN 2024 lab shows that detectors fine-tuned on one generator family often underperform when faced with unseen models or domains[ 3 ].

Current detection methods face three core challenges: • Models trained solely with cross-entropy loss tend to focus on surface-level features and struggle to capture subtle semantic diferences between human-written and AI-generated texts. • Previous ensemble methods, while improving accuracy, depend on multiple large models, resulting in low inference eficiency and significant deployment barriers due to hardware constraints. • Inspired by the success of noise-based perturbation strategies in computer vision tasks—where slight input transformations help models generalize better—we explore analogous perturbation strategies for textual data to improve robustness and semantic representation learning[ 4, 5 ].

To address these issues, we propose a lightweight ensemble composed of a bidirectional encoder model (ModernBERT-large) and an autoregressive decoder model (Qwen3-4B). The encoder branch is ifne-tuned using a joint cross-entropy and supervised contrastive loss to improve discrimination in borderline cases, while the decoder branch is trained with standard cross-entropy. During inference, the outputs are combined via mean-probability soft voting, incurring only minimal additional computational overhead.

Our experiments show that the proposed method achieves state-of-the-art robustness on the PAN25 validation set, with all five oficial metrics—ROC-AUC, Brier, C@1, F 1, and F0.5—exceeding 0.99.

The remainder of this paper is organised as follows: Section 2 reviews related work; Section 3 details our model design and training; Section 4 presents results and discussion.

2. Related Work

The fast rise of LLMs has made reliably telling human- from machine-authored text a pressing NLP problem. Earlier eforts fall into (i) supervised classification, (ii) zero-shot detection, and (iii) multi-model decision aggregation.Classical lexical-feature classifiers can still rank highly—e.g. a plain SVM built on TF-IDF matched or beat neural baselines—yet their robustness drops once generators evolve. Conversely, zero-shot signals such as cross-perplexity generalise well but lag in absolute accuracy.

2.1. Supervised Classification Models

Traditional machine-learning methods remain remarkably competitive. Lorenz et al. employ a linear SVM trained on TF-IDF features and achieve performance close to the top[ 6 ]. Meanwhile, several teams ifne-tune Transformer-based classifiers. Cao et al. enhance their model by augmenting the training set with additional human-written samples[ 7 ]. The Tri-Sentence Analysis method splits each long document into three shorter segments and averages their individual predictions to stabilise the final decision[ 8 ]. Lin et al. incorporate R-Drop regularisation to reduce the variance caused by dropout during inference[ 9 ]. Overall, supervised models achieved some of the highest mean scores in the PAN-24 competition. However, despite strong results on validation sets during training, these models often show reduced robustness when applied to out-of-domain test data, leading to noticeable performance drops in generalisation scenarios.

2.2. Zero-Shot Detection Models

Unsupervised techniques avoid costly annotation by exploiting statistical irregularities in machine text. Compression-based detectors such as PPMd-CDM treat lower entropy as an AI signature and require only a generic compressor[ 10 ]. The Binoculars framework measures the ratio between an observer model’s perplexity and that of a performer model to expose hidden over-repetition in generated text[ 11 ]. However, their average performance in PAN-24 competition was notably lower than that of supervised systems, underscoring an inherent trade-of between broad generality and fine-grained accuracy.

2.3. Multi-model Decision Aggregation Models

To enhance robustness, some teams opted to combine multiple detection strategies. BinocularsLLM integrates two QLoRA-fine-tuned language models with Binoculars-style perplexity scoring, applying soft voting across all components to reach a final decision[ 12 ]. This ensemble achieved the top rank in the competition. LAVA takes a diferent approach by training separate adapters for diferent families of generative models and employs a conservative “unanimous agreement” rule—only predicting human authorship when all modules concur—efectively reducing false positives[ 13]. These ensemble-based systems demonstrated high mean scores in the evaluation, but their improved performance comes at the cost of increased inference time and memory usage, highlighting the trade-of between speed and accuracy.

3. System Overview

To build a robust generative-AI authorship verifier, three strategies are developed: 1. ModernBERT-large is fully fine-tuned as a classifier using both cross-entropy and supervised contrastive loss. 2. Qwen3-4B is fully fine-tuned with cross-entropy loss.

3. ModernBERT-large and Qwen3-4B are fused via weighted soft voting.

Our design aims to achieve the following main goals: • Compare the performance of two diferent model architectures on the generative AI detection task after supervised fine-tuning. • To enhance the overall robustness and generalization of the system by incorporating two structurally diferent models.

Let ℋ = { ℎ}=1 be the set of human-written texts (ℎ ∈ Σ * ). Let = { }=1 be the set of AIgenerated texts. For 1000 texts ℎ ∈ ℋ we obtain an augmented paraphrase using the GPT-4.1 model. The set of all augmented texts is = { }=1. We assign the paraphrase set the machine-generated label (1), while the corresponding original texts in ℋ retain the human-written label (0). Unless stated otherwise we denote the complete corpus by = ℋ ∪ ∪ and a generic sample by ∈ .

3.1. Contrastive-Enhanced ModernBERT-large 3.1.1. Data Augmentation

To expose the detector to challenging near-human counterfeits, we first sampled 1000 sentences from the human class and then asked ChatGPT-4.1 (04-01-2025) to rewrite each sentence in its own words while preserving the original meaning. We call these rewrites paraphrases and assign them the label 1 (machine-generated); their source sentences retain label 0 (human). Because the two versions of every sentence convey the same idea yet belong to opposite classes, they form hard positive-negative pairs that sharpen the contrastive objective.

Balanced mini-batches. Purely shufling the data can yield mini-batches containing only positives or only negatives, which dilutes the contrastive signal. Therefore, we deterministically interleave samples in the order human → paraphrase → human → machine, aiming to keep the class ratio within each batch as close to 1:1 as possible.

Prompt System Prompt: This is a piece of text generated by a human. I want to express the same meaning as this sentence, but without changing its writing style. Please help me rephrase it. Just output the rephrased sentence directly.

User Prompt: I approach a corner in the hallway as the door to a classroom in front of me opens and a girl steps out. She is wearing a form fitting black shirt with ...

Answer: I round a corner in the hallway just as the door of a classroom ahead swings open and a girl steps out. She’s dressed in a fitted black shirt, snug yet ...

For the augmented dataset, we first separated human-written texts and AI-generated texts. We then alternately inserted them one by one into the training dataset.Additionally, the remaining AI-generated texts were randomly inserted, and the 1000 augmented samples were ensured to be included in the same training batches as the original human-written texts. The final statistics of the training data are presented in Table 1.

3.1.2. Supervised Fine-Tuning with Joint Loss

To better capture nuanced semantic diferences, we adopt a supervised fine-tuning strategy combined with contrastive learning, training the LLM directly on labeled data. Specifically, we attach a fully connected classification head to the hidden representation of the [CLS] token, allowing the model to output a binary label given an input text—where 0 denotes a human-written text and 1 denotes a machine-generated one.

To enhance the model’s ability to discriminate between subtle semantic patterns, we incorporate supervised contrastive learning following the formulation proposed by Beliz Gunel et al[14]. The overall training objective is a weighted combination of cross-entropy loss and supervised contrastive loss. The ifnal loss function is defined as follows: (1) (2) (3) (4) ℒ = (1 − ) · ℒ CE + · ℒ SCL

1 ∑︁ log ( | x) ℒCE = −

=1 ℒSCL = 1 ∑=︁1 |1()| ∈∑︁() log

exp (︀ z⊤z / )︀ ∑︀=1 exp (︀ z⊤z/ )︀ ̸= Specifically, () denotes the set of positive samples that share the same class label as the anchor sample , represents the hidden representation (feature vector) extracted by the model, and ∈ R+ is a temperature hyperparameter that controls the concentration level of the similarity distribution. This formulation encourages the model to bring semantically similar samples closer in the representation space while pushing apart dissimilar ones, thereby improving class-level discrimination.

Equation (1) represents the overall loss, Equation (2) corresponds to the standard cross-entropy loss, and Equation (3) denotes the contrastive learning loss.

3.2. Supervised Fine-Tuning with LLMs

For the decoder-based model, we adopt Qwen3-4B as our backbone. The fine-tuning strategy is similar to that used in the encoder-based model. Specifically, we add a fully connected classification head to the output vector of the last token after decoding, and perform binary classification—predicting whether a given input text is human-written or machine-generated.

Unlike the encoder-based model, this decoder-only model is trained using only the standard crossentropy loss, as the limited GPU memory prevented us from incorporating the contrastive-learning loss.

3.3. Contrastive-enhanced Dual-Model Decision(CeDD)

To combine the strengths of both the encoder-based and decoder-based models, we aggregate their prediction outputs using a soft voting strategy. Specifically, the final prediction probability is computed as the mean of the individual classification probabilities: {︃1 if final ≥ 0.5 (machine-generated)

0 if final < 0.5 (human-written)

This simple yet efective fusion mechanism leverages the complementary inductive biases of the two model architectures. It improves prediction robustness without introducing significant computational overhead and helps mitigate model-specific errors on borderline or ambiguous samples.

4. Results and Discussion

In this section, we present the implementation details, evaluation metrics, and provide a comprehensive analysis of the results. We utilize the TIRA platform to evaluate our three methods using test datasets[15].

4.1. Implementation Details

In this research, the training CeDD was implemented in PyTorch and executed on a single Nvidia A800 GPU. The model was trained using full bf16 precision to ensure numerical stability and training eficiency. The fine-tuning process lasted for 3 epochs, using the AdamW optimizer with a learning rate of 2e-5. The batch size was set to 32 without employing gradient accumulation. For the ModernBERTlarge model, the training objective combined standard cross-entropy loss with supervised contrastive loss, with a lambda weight of 0.9 and a temperature of 0.3. The warm-up ratio was set to 0.1, and training logs were recorded every 50 steps. An independent validation set was used during training for evaluation. All experiments were conducted under a fixed random seed and employed cosine learning rate scheduling to ensure reproducibility.

4.2. Evaluation Metrics

To evaluate the performance of our proposed model, we used the evaluation metrics provided by PAN25, which include the following metrics: • ROC–AUC: The area under the ROC (Receiver Operating Characteristic) curve. • Brier: The complement of the Brier score (mean squared loss). • C@1: A modified accuracy score that assigns non-answers (score = 0.5) the average accuracy of the remaining cases. • F1: The harmonic mean of precision and recall. • F0.5: A modified F 0.5 measure (precision-weighted F measure) that treats non-answers (score = 0.5) as false negatives.

4.3. Validation-set Results

As mentioned earlier, we compare three approaches for detecting AI-generated text: classification using an encoder-based model, classification using a decoder-based model, and a contrastive-enhanced dualmodel decision strategy that combines both. The performance of various LLMs under these approaches is summarized in Table 2, based on evaluations on the validation dataset. 0.984 0.757 0.844

Upon analyzing the results shown in Table 2, it is evident that ModernBERT-large delivers the most stable and consistent performance across all evaluation metrics. Notably, it achieves an F1 score of 0.998 and an F0.5 score of 0.999, highlighting its eficiency and accuracy in text classification tasks.

Qwen3-4B also performs competitively, especially in the Brier and mean scores, reflecting its strength in handling order-sensitive or generative-context inputs. This supports the efectiveness of the decoderonly architecture.

Our final system CeDD integrates both models and demonstrates near-optimal results across all six metrics. This confirms the efectiveness of our CeDD in enhancing the robustness, stability, and accuracy of generative authorship verification.

4.4. Test-set Results 5. Conclusion

This work presents a supervised contrastive learning approach built upon the ModernBERT-large model for the CLEF PAN 2025 Generative-AI Authorship Verification task. By jointly optimizing cross-entropy loss and supervised contrastive loss, our method improves the model’s ability to distinguish between human-written and AI-generated texts.

• On the oficial validation set, ModernBERT-large (CE+SCL) achieved a near-perfect mean score of 0.998 across all PAN metrics. • On the hidden test set, this single-model approach obtained a mean score of 0.871, ranking 3rd out of 24 teams, confirming the efectiveness of our design.

In addition to the above results, we summarize the following key insights: (i) Supervised contrastive learning substantially enhances class separability and semantic discrimination; (ii) A single wellregularized encoder model can outperform complex ensembles while remaining eficient and scalable; (iii) Paraphrased data generated by GPT-4.1 serves as highly efective contrastive pairs during training, especially in narrowing the gap between human-like machine outputs and real human writing.

Overall, our findings show that a contrastively fine-tuned ModernBERT encoder can achieve strong performance on generative authorship verification, even without relying on large-scale ensemble systems or decoder-based large language models.

Acknowledgments

This work is supported by the National Social Science Foundation of China (No. 22BTQ101).

Declaration on Generative AI

During the preparation of this work, the author(s) used GPT-o3 in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. P. Galuščáková, A. G. S. Herrera (Eds.), Working Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS.org, 2024, pp. 2901–2912. URL: http://ceur-ws.org/Vol-3740/paper-281.pdf. [13] Z. Chen, Y. Han, Y. Yi, Team chen at PAN: Integrating R-Drop and Pre-trained Language Model for Multi-author Writing Style Analysis, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. Herrera (Eds.), Working Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS.org, 2024, pp. 2547–2553.

URL: http://ceur-ws.org/Vol-3740/paper-232.pdf. [14] B. Gunel, J. Du, A. Conneau, V. Stoyanov, Supervised contrastive learning for pre-trained language model fine-tuning, 2021. URL: https://arxiv.org/abs/2011.01403. arXiv:2011.01403. [15] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241.

[1]

Bevendorf ,

Dementieva ,

Fröbe ,

Gipp ,

Greiner-Petter ,

Karlgren ,

Mayerl ,

Nakov ,

Panchenko ,

Potthast ,

Shelmanov ,

Stamatatos ,

Stein ,

Wang ,

Wiegmann , E. Zangerle, Overview of PAN 2025: Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025 .

[2]

Bevendorf ,

Wang ,

Karlgren ,

Wiegmann ,

Fröbe ,

Tsivgun ,

Su ,

Xie ,

Abassy ,

Mansurov ,

Xing ,

M. N.

Ta ,

K. A.

Elozeiri ,

Gu ,

R. V.

Tomar ,

Geng ,

Artemova ,

Shelmanov ,

Habash ,

Stamatatos , I. Gurevych ,

Nakov ,

Potthast ,

Stein , Overview of the “VoightKampf” Generative AI Authorship Verification Task at PAN and ELOQUENT 2025 , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org , 2025 .

[3]

Bevendorf ,

X. B.

Casals ,

Chulvi ,

Dementieva ,

Elnagar ,

Freitag ,

Fröbe ,

Korenčić ,

Mayerl ,

Mukherjee ,

Panchenko ,

Potthast ,

Rangel ,

Rosso ,

Smirnova ,

Stamatatos ,

Stein ,

Taulé ,

Ustalov ,

Wiegmann , E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction . Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2024 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024 .

[4]

Chen ,

Kornblith ,

Norouzi ,

Hinton , A simple framework for contrastive learning of visual representations , in: H. D. III , A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning , volume 119 of Proceedings of Machine Learning Research, PMLR , 2020 , pp. 1597 - 1607 . URL: https://proceedings.mlr.press/v119/chen20j.html.

[5]

Zhang , S. Bengio,

Hardt ,

Recht ,

Vinyals , Understanding deep learning (still) requires rethinking generalization , Commun. ACM 64 ( 2021 ) 107 - 115 . URL: https://doi.org/10.1145/3446776. doi: 10 .1145/3446776.

[6]

Lorenz ,

F. Z.

Aygüler ,

Schlatt , N. Mirzakhmedova, BaselineAvengers at PAN 2024: OftenForgotten Baselines for LLM-Generated Text Detection , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. Herrera (Eds.), Working Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS .org, 2024 , pp. 2761 - 2768 . URL: http://ceur-ws. org/ Vol- 3740 /paper-262.pdf.

[7]

Cao ,

Han , J . Ye ,

Liu , Y. Han, Enhancing Human-Machine Authorship Discrimination in Generative AI Verification Task with BERT and Augmented Data , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. Herrera (Eds.), Working Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS .org, 2024 , pp. 2536 - 2540 . URL: http://ceur-ws. org/ Vol- 3740 /paper-230.pdf.

[8]

Huang ,

Chen ,

Luo ,

Li , Generative AI Authorship Verification Of Tri-Sentence Analysis Base On The Bert Model , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. Herrera (Eds.), Working Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS .org, 2024 , pp. 2632 - 2637 . URL: http://ceur-ws. org/ Vol- 3740 /paper-243.pdf.

[9]

Lin ,

Han , L . Kong,

Chen ,

Zhang ,

Peng ,

Sun ,

A Verifying

Generative Text Authorship Model With Regularized Dropout , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. Herrera (Eds.), Working Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS .org, 2024 , pp. 2728 - 2734 . URL: http://ceur-ws. org/ Vol- 3740 /paper-257.pdf.

[10]

Halvani ,

Winter ,

Graner , On the usefulness of compression models for authorship verification , in: Proceedings of the 12th International Conference on Availability, Reliability and Security , ARES '17, Association for Computing Machinery, New York, NY, USA, 2017 . URL: https://doi.org/10.1145/3098954.3104050. doi: 10 .1145/3098954.3104050.

[11]

Hans ,

Schwarzschild ,

Cherepanova ,

Kazemi ,

Saha ,

Goldblum ,

Geiping , T. Goldstein, Spotting llms with binoculars: Zero-shot detection of machine-generated text , 2024 . URL: https://arxiv.org/abs/2401.12070. arXiv: 2401 . 12070 .

[12]

Tavan , M. Najafi, MarSan at PAN: BinocularsLLM , fusing Binoculars' Insight with the Proficiency of Large Language Models for Machine-Generated Text Detection , in: G. Faggioli, N. Ferro,