1. Introduction

Team Nexus Interrogators at PAN: Voight-Kampf Generative AI Detection

Samiya Ali Zaidi

Huzaifah Tariq Ahmed

Sarrah Ali Akbar

Ziaullah Shakeel

Faisal Alvi

faisal.alvi@sse.habib.edu.pk 0

Abdul Samad

abdul.samad@sse.habib.edu.pk 0 0 Dhanani School of Science and Engineering, Habib University , Karachi , Pakistan

2025

The Voight-Kampf task at PAN CLEF 2025 challenges participants to detect and categorize AI-generated text in an era of increasingly human-like language models. In this work, we develop a two-stage system leveraging ifne-tuned transformer architectures to tackle both binary and multi-class authorship verification. For Subtask 1, we fine-tune a bert-base-uncased model to distinguish human-written from machine-generated text, achieving near-perfect performance across genres with minimal false positives. For Subtask 2, we address severe class imbalance in multi-class collaborative authorship detection by augmenting underrepresented categories using backtranslation, synonym/antonym replacement, and random deletion. Fine-tuning a roberta-large model on this enriched dataset yields significant gains, particularly in minority classes. Our results underscore the efectiveness of combining targeted data augmentation with robust transformer-based models to capture subtle distinctions in authorship, ofering a scalable foundation for detecting generative AI involvement in real-world texts.

eol>Voight-Kampf AI-Generated Text Detection Authorship Verification Transformer Models Data Augmentation Fine-tuning PAN Lab CLEF 2025

1. Introduction

The increasing use of large language models (LLMs) in content creation has introduced new challenges in distinguishing between human- and AI-generated text. While generative AI has shown remarkable capabilities in mimicking human writing, this raises concerns related to academic integrity, misinformation, and authorship transparency. As AI-assisted writing becomes more sophisticated, robust detection systems are needed to identify the degree of machine involvement in written texts.

The Voight-Kampf Generative AI Detection 2025 task [ 1, 2 ], part of the PAN shared task series with the ELOQUENT Lab [ 1 ], addresses this problem by evaluating detection systems across two key subtasks. Subtask 1 focuses on binary classification of texts as either entirely human-written or machine-generated, even in cases where the AI attempts to imitate a specific human writing style [ 2 ]. This tests the sensitivity and robustness of detection methods against adversarial obfuscation and unseen model outputs.

Subtask 2 extends the challenge by introducing multi-class classification of collaborative human-AI texts, requiring systems to detect nuanced degrees of machine involvement. This includes identifying when humans post-edit AI-generated drafts, co-write with AI models, or minimally edit machinegenerated outputs. The goal is not only to improve detection accuracy but also to understand the spectrum of human-AI collaboration [ 2 ].

To tackle these challenges, this paper explores a range of techniques, including data augmentation strategies, finetuning, ensemble methods, and neural classifiers. Our system builds upon prior research in authorship verification and leverages recent advances in supervised learning, fine-tuning, and hybrid modeling. We focus on robustness across genres and model types, addressing both fully and partially machine-generated content.

The rest of this paper is structured as follows: section 2 presents a review of the related works focusing on the approaches commonly used for authorship detection. Section 3 describes our approach to solving both subtasks. Section 4 presents our validation results. Lastly, section 5 concludes our paper.

2. Related Work

Authorship verification has evolved from stylistic analysis of human writing to the detection of AIgenerated content. Recent work has leveraged both traditional machine learning and deep learning models for this task. Fine-tuned transformer architectures such as DeBERTa [ 3 ] and RoBERTa have achieved high performance in binary classification of human vs. AI text [ 4 ], while hybrid models that combine BERT with CNNs enhance local and contextual feature extraction [ 5 ].

Some systems introduce data augmentation and R-Drop regularization [ 6 ] to improve robustness, employing loss functions that combine cross-entropy and KL divergence. Ensemble learning approaches using multiple transformer models (e.g., BERT, RoBERTa, DeBERTa) have shown further improvements in ROC-AUC scores [ 7 ]. Meanwhile, instructional prompting with T5 has been explored to reframe authorship detection as a sequence-to-sequence task [8].

Beyond transformers, research has explored lightweight classifiers with embeddings like LUAR for low-resource scenarios [9], and stylometric analysis using Graph Neural Networks (GNNs) alongside pre-trained models [10]. Approaches such as Tri-Sentence Analysis [11] and hybrid models like BertT [12] demonstrate efectiveness in handling short texts and improving generalization.

Despite promising results, many systems struggle with generalization to novel AI models or obfuscated styles, highlighting the importance of continual adaptation and diverse training data in generative AI authorship verification.

3. Methodology

In this section, we provide details about the datasets for each task, followed by our methodology for both subtasks individually.

3.1. Datasets

The datasets for this task are provided as newline-delimited JSON files. In subtask 1’s dataset, each entry includes an identifier, the text content, the originating model (human or specific AI model), a label (0 for human, 1 for AI), and a genre indicator (e.g., essays, news, fiction).

The dataset for subtask 2, on the other hand, comprises multi-domain documents drawn from academic sources, journalism, and social media. The data includes a mixture of human-written and machine-generated samples (produced by models such as GPT-4, Claude, and PaLM) and is annotated to indicate the type of human-AI collaboration. The dataset spans multiple languages and provides detailed labels for each collaboration category.

3.2. Subtask 1: Voight-Kampf AI Detection Sensitivity

In this task, our primary objective was to accurately distinguish between human-written and AIgenerated text. This binary classification problem required a robust modeling pipeline that could leverage the nuanced diferences between the two categories. The distribution of the dataset used for this task is illustrated in Figure 1, providing insight into the balance of the data across both classes.

The first phase of our workflow involved data preprocessing. The original dataset was provided in a .jsonl format, which is commonly used for storing structured data in a line-delimited manner. To facilitate data handling and analysis, we first converted this .jsonl file into a Pandas DataFrame. From this structure, we extracted only the essential fields required for our task: ‘id’, ‘text’, ‘label’. These fields represent, respectively, the unique identifier of each sample, the content of the text, and its associated label indicating whether the text was AI-generated or written by a human (0 means human-written, and 1 means AI-generated).

After isolating the relevant information, we transformed the dataset into the Hugging Face Dataset format. This conversion optimized the data pipeline for fine-tuning pre-trained models. The Hugging Face Dataset object also provides eficient shufling, batching, and tokenization utilities, which are particularly useful for handling text data at scale.

With the dataset prepared, we proceeded to the model fine-tuning phase. We leveraged the Hugging Face transformers library due to its modularity, ease of use, and strong support for state-of-the-art pre-trained language models. We used the AutoModelForSequenceClassification interface to load the bert-base-uncased model with two output labels (human and AI), and the AutoTokenizer for consistent input preprocessing. We selected this variant of BERT for its proven efectiveness in various natural language understanding tasks, particularly in text classification. The fine-tuning process involved training the model on the labeled dataset to adapt BERT’s pretrained representations to our specific task of authorship classification.

Training was carried out using the Trainer API, which provided integrated training and evaluation loops, model checkpointing, and metric logging. All hyperparameters used during training, including learning rate, batch size, and number of epochs, are detailed in Table 1. These parameters were chosen based on standard practices for fine-tuning transformer models and adjusted to fit the computational constraints and performance needs of our project.

We monitored performance after each epoch and retained the best-performing model. For evaluation, we used the evaluate library to compute micro-averaged F1 scores, ensuring that performance was balanced across both classes. At inference time, predictions were generated using the Trainer API and analyzed via a detailed classification report, giving us insights into precision, recall, and F1 score for both human and AI text classes. This setup ensured a reliable, reproducible training pipeline aligned with modern standards for fine-tuning transformer-based classifiers.

3.3. Subtask 2: Human-AI Collaborative Text Classification

For this sub-task, our objective was to determine the extent of AI involvement in the generation of a given piece of text. Unlike the binary classification task described earlier, this problem was framed as a multi-class classification challenge, where each sample was categorized into one of six distinct labels based on the degree and type of human-machine collaboration. The classification labels are as follows: • 0: fully human-written • 1: human-written, then machine-polished • 2: machine-written, then machine-humanized • 3: human-initiated, then machine-continued • 4: deeply-mixed text, where some parts are written by a human and some are generated by a machine • 5: machine-written, then human-edited

The distribution of samples across these six categories is visualized in Figure 2, which highlights a substantial class imbalance in the dataset. This imbalance posed a significant challenge, particularly for training a model capable of accurately distinguishing underrepresented categories.

As with the earlier task, the dataset was initially provided in .jsonl format. To facilitate preprocessing and further transformations, we first converted the data into a Pandas DataFrame. From the available ifelds, only the text and label columns were retained, as these were essential for the classification task.

Given the imbalance in class distribution, we implemented several data augmentation techniques targeting the three least represented classes – 3, 4, and 5. These augmentation strategies were designed to increase the diversity and volume of examples in the minority classes, thereby helping to mitigate the efects of class imbalance during training. The augmentation methods used include: • Backtranslation • Synonym Replacement • Antonym Replacement • Random Deletion

Each of these strategies was applied separately to the minority classes, after which the augmented datasets were merged to form an enriched and more balanced training set, as depicted in Table 2.

This enhanced dataset was utilized to fine-tune the state-of-the-art Roberta-Large Model. The large variant was chosen to efectively capture the nuances and nonlinearities present in such a complex dataset. By training on both the original and augmented data, the model became better equipped to generalize across all six categories of AI-human text interaction. The hyperparameters used are detailed in Table 1, and the entire workflow is visually represented in Figure 3.

4. Results and Discussion 4.1. Subtask 1: Voight–Kampf AI Detection Sensitivity

The exceptionally high recall for the AI class (0.9983) suggests that the detector rarely misses machinegenerated instances, even when those instances employ novel obfuscation methods. Conversely, the slight asymmetry in recall (0.9687) for the human-authored class highlights a small proportion of false positives—AI texts misclassified as human—which could stem from particularly human-like AI outputs. Overall, the model’s balanced precision and recall showcase its robustness and sensitivity in the face of adversarial style-mimicking.

The ROC curves and confusion matrix visualized in Figure 4 further reinforce the model’s high discriminative ability. The curves show excellent separation between the classes, and the confusion matrix reveals very few misclassifications, aligning with the reported metrics.

The scores obtained after running the model on TIRA [13] are presented in Table 4. It showcases our model’s flawless performance across all genres in the validation phase, achieving a perfect ROC-AUC of 1.0 and consistently high scores across C1, F1, F0.5U, and Brier metrics—underscoring both its discriminative power and calibration quality. 0.983 0.980 0.983 0.985

Furthermore, the confusion matrices for each genre in the test dataset are shown in Figure 5, which provides further insight through confusion matrices for each genre on the test set. The model demonstrates perfect recall in Essays and News (no false negatives), with only 13 and 16 false positives, respectively, highlighting its conservative and accurate labeling of AI-generated text. In Fiction, although a small number of misclassifications occur (28 false positives, 4 false negatives), the model still exhibits strong performance, efectively handling the complexity of creative writing.

Finally, Table 5 benchmarks our model against leading baselines on the test dataset, where it outperforms across all major metrics—achieving the highest ROC-AUC (0.865), F1 (0.860), and mean score (0.879), while maintaining the lowest False Positive Rate (0.131). These results confirm that the model generalizes well and remains reliable across genres, balancing precision and recall better than all competing approaches.

4.2. Subtask 2: Human–AI Collaborative Text Classification

Subtask 2 involves a multi-class classification challenge with six distinct levels of collaboration. To tackle the significant class imbalance, especially for Classes 3, 4, and 5, we implemented targeted data augmentation techniques. These techniques included back-translation, antonym/synonym substitution, and random deletion, all aimed at enhancing the representation of the underrepresented categories.

We fine-tuned a RoBERTa-Large model on the augmented dataset and observed decent scores, especially in the performance of minority classes. Table 6 summarizes the per-class precision, recall, F1-score, and overall performance metrics.

The macro-averaged F1-score of 0.632 shows balanced performance across classes, highlighting the success of our augmentation strategy in addressing bias toward majority classes. Classes 4 and 5, once underrepresented, have also seemed to perform well. Class 3 has high precision (0.899) but low recall (0.336), indicating conservative predictions potentially due to overlap with other classes. Class 1, on the other hand, has high recall (0.935) but low precision (0.403), suggesting overprediction.

Therefore, to evaluate the impact of each augmentation method, we fine-tuned separate models using one technique at a time. Table 7 displays the class-wise precision, recall, and F1-scores. Antonym replacement and random deletion enhanced macro-level performance, with random deletion achieving the highest macro F1-score of 0.590.

To contextualize our results, we compare our model’s performance with the oficial PAN shared task baseline on both the test and validation splits [ 2 ]. As shown in Table 8, while our test-time performance lags behind the baseline, our validation scores significantly exceed it, particularly in terms of macro F1-score and recall. This suggests that our model is capable of learning from the augmented data, but may sufer from domain shift or limited generalizability on the blind test set.

4.3. Summary of Findings

Our experiments confirm that Subtask 1 can be efectively solved with standard fine-tuning of a transformer-based model, achieving near-ceiling performance even under adversarial-style obfuscation. In contrast, Subtask 2’s multi-way classification remains challenging due to severe class imbalance and nuanced distinctions between collaboration levels. Data augmentation proves a viable strategy for boosting performance on underrepresented classes, but future work should explore complementary approaches—such as ensembling, stylometric feature fusion, or few-shot prompting with large language models—to further enhance robustness and fine-grained discrimination.

5. Conclusion

In this study, we focused on both binary and multi-class AI authorship detection tasks for the VoightKampf challenge at the CLEF PAN Lab 2025, utilizing a fine-tuned BERT base uncased model. For Subtask 1, our approach achieved an impressive accuracy of 98.77%, demonstrating robust F1 scores for both human and AI classes, which illustrates the model’s efectiveness in binary classification.

Subtask 2 posed a considerable challenge due to severe class imbalance. By applying targeted data augmentation—specifically focused on underrepresented classes—and fine-tuning a RoBERTa-Large model, we were able to significantly improve macro F1-score across the board. The largest gains were observed in minority classes, particularly Class 4 and Class 5, demonstrating that balancing strategies can efectively improve performance on rare collaboration levels without sacrificing overall accuracy.

Moreover, performance on high-support classes such as Class 0 (fully human-written) and Class 2 (minor AI assistance) remained robust, indicating that augmentation did not negatively impact the model’s understanding of dominant patterns. However, despite these gains, Class 3 continues to show low recall, suggesting persistent confusion in capturing intermediate collaboration levels. Future work could explore the use of contrastive learning, ensemble techniques, or stylometric features to help better disentangle nuanced authorial blends, especially with more powerful foundation models.

Acknowledgments

The authors would like to acknowledge the support provided by the Ofice of Research (OoR) at Habib University, Karachi, Pakistan, for funding this project through the internal research grant IRG-2235.

Declaration on Generative AI

During the preparation of this work, the authors utilized GPT-4 and Grammarly for grammar and spelling checks. After employing these tools, the authors independently reviewed and edited the content as necessary, taking full responsibility for the final publication. [8] Z. Lin, Y. Li, J. Huang, Voight-kampf generative ai authorship verification based on t5, Working

Notes of CLEF (2024). [9] A. Richburg, C. Bao, M. Carpuat, Automatic authorship analysis in human-ai collaborative writing, in: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 1845–1855. [10] A. Valdez-Valenzuela, H. Gómez-Adorno, Team iimasnlp at pan: leveraging graph neural networks and large language models for generative ai authorship verification, Working Notes of CLEF (2024). [11] J. Huang, Y. Chen, M. Luo, Y. Li, Generative ai authorship verification of tri-sentence analysis base on the bert model, Working Notes of CLEF (2024). [12] Z. Wu, W. Yang, L. Ma, Z. Zhao, Bertt: a hybrid neural network model for generative ai authorship verification, Working Notes of CLEF (2024). [13] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241.

[1]

Bevendorf ,

Dementieva ,

Fröbe ,

Gipp ,

Greiner-Petter ,

Karlgren ,

Mayerl ,

Nakov ,

Panchenko ,

Potthast ,

Shelmanov ,

Stamatatos ,

Stein ,

Wang ,

Wiegmann , E. Zangerle, Overview of PAN 2025: Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025 .

[2]

Bevendorf ,

Wang ,

Karlgren ,

Wiegmann ,

Fröbe ,

Tsivgun ,

Su ,

Xie ,

Abassy ,

Mansurov ,

Xing ,

M. N.

Ta ,

K. A.

Elozeiri ,

Gu ,

R. V.

Tomar ,

Geng ,

Artemova ,

Shelmanov ,

Habash ,

Stamatatos , I. Gurevych ,

Nakov ,

Potthast ,

Stein , Overview of the “VoightKampf” Generative AI Authorship Verification Task at PAN and ELOQUENT 2025 , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org , 2025 .

[3]

He ,

Liu ,

Gao , W. Chen, Deberta: Decoding-enhanced bert with disentangled attention , arXiv preprint arXiv: 2006 . 03654 ( 2020 ).

[4]

Yadagiri ,

Shree ,

Parween ,

Raj ,

Maurya ,

Pakray , Detecting ai-generated text with pre-trained models using linguistic features , in: Proceedings of the 21st International Conference on Natural Language Processing (ICON) , 2024 , pp. 188 - 196 .

[5]

Sun ,

Yang , L. Ma, Bcav: a generative ai author verification model based on the integration of bert and cnn , Working Notes of CLEF ( 2024 ).

[6]

Wu ,

Li ,

Wang ,

Meng ,

Qin ,

Chen ,

Zhang , T.-Y. Liu, et al., R-drop: Regularized dropout for neural networks , Advances in neural information processing systems 34 ( 2021 ) 10890 - 10905 .

[7]

Lin ,

Han , L . Kong,

Chen ,

Zhang ,

Peng ,

Sun , A verifying generative text authorship model with regularized dropout , Working Notes of CLEF ( 2024 ).