1. Introduction

T. Li);

Fine-Grained Human-AI Collaborative Text Classification Using DeBERTa

Tao Li

Guo Niu

0 0 Foshan University , Foshan, Guangdong , China

2025

000 0 0002

In this PAN-CLEF 2025 Subtask 2: Human-AI Collaborative Text Classification challenge, our objective is to categorize documents co-authored by humans and Large Language Models (LLMs). Specifically, we aim to classify texts into six distinct categories based on the nature of human and machine contributions. Utilizing the pre-trained language model DeBERTa-v3-large, we fine-tuned it for this specific classification task. Our experimental results demonstrate that this approach efectively distinguishes between diferent types of texts, contributing significantly to the understanding of human-AI collaboration and mitigating risks associated with synthetic text.

eol>PAN 2025 Generated Content Analysis Human-AI Collaborative Text Classification

1. Introduction 2. Related Work

In recent years, research on human-AI collaborative writing has gradually increased. Early work mainly focused on determining whether a text was generated by AI, such as the GPT-2 Detector developed by OpenAI [ 5 ]. However, as generation technology has advanced, simply distinguishing between "AI-generated vs. human-written" has become insuficient. Therefore, more granular classification tasks have emerged. Some studies have attempted to introduce multimodal features (e.g., syntactic structures, sentiment tendencies) to assist classification, but due to the high cost of data annotation, most research still relies primarily on plain text input. Additionally, some scholars have proposed staged classification strategies—first determining whether a text contains AI components, then further classifying its type. Regarding model selection, Transformer-based models (such as BERT, RoBERTa, DeBERTa) have been widely applied to text classification tasks [ 6, 7, 3 ]. Among them, the DeBERTa series, owing to its unique design for contextual awareness and position modeling, has demonstrated excellent performance across various NLP tasks. DeBERTa-v3-Large [ 4 ], in particular, exhibits stronger capabilities in long-text understanding and complex semantic modeling through improved relative positional encoding and discriminative pre-training approaches.

3. Methodology 3.1. Data Preparation and Preprocessing

We utilize the oficially provided training, validation, and test sets, which collectively contain six types of text samples corresponding to diferent human-AI collaborative writing styles. Each sample includes the original text content (text) and its associated class label (label). Data preprocessing mainly involves the following steps: • Uniform Formatting: The raw text is directly fed into the model to preserve potentially styledistinguishing linguistic features. • Text Encoding and Alignment: The text is tokenized using the tokenizer corresponding to DeBERTa-v3-large, and all sequences are truncated or padded to a maximum length of 512 tokens to meet the model’s input requirements. • Label Mapping: A bidirectional mapping between category names and integer IDs (id2label / label2id) is established to support multi-class classification.

3.2. Model Architecture and Training Strategy 3.2.1. Model Selection

Recently, advances in Natural Language Processing (NLP) have benefited significantly from progress in pre-trained language models, with DeBERTa standing out among them [ 3 ]. It has shown outstanding performance across multiple benchmark tasks. First, the core architectural characteristics of DeBERTa lay the foundation for its performance in complex text classification tasks. The model adopts an ELECTRA-style pre-training method, efectively improving training eficiency and performance through a generator-discriminator framework. At the same time, gradient-decoupled embedding sharing significantly reduces computational resource requirements. DeBERTa not only possesses strong semantic understanding capabilities but can also adapt well to multi-class text classification tasks. Moreover, its large vocabulary (128K tokens) and flexible input-output structure further expand its applicability.

DeBERTa-v3-Large is the third-generation DeBERTa model proposed by Microsoft, with 178 million parameters [ 4 ]. Its advantages include: first, the use of a mechanism that separates content and position representations, enhancing the model’s ability to understand context; second, the introduction of a hybrid of absolute and relative positional encoding, strengthening modeling of long-range dependencies; and finally, an improved Masked Language Modeling (MLM) objective function that boosts pre-training eficiency.

3.2.2. Fine-Tuning Strategy

During actual training, we employed the standard full-parameter fine-tuning strategy. Specifically, we loaded a pre-trained language model and added a classification head on top of it. Then, all parameters of the entire model were updated without freezing any layers. This strategy is suitable when the target task dataset is moderately sized and suficient computing resources are available. In this case, the dataset meets these conditions.

Throughout the training process, the AdamW optimizer and a relatively small learning rate were used to ensure that the language representations already learned by the pre-trained model would not be disrupted during fine-tuning. Meanwhile, weight decay was implemented to prevent overfitting. The model was trained for three epochs in total. Additionally, an early stopping mechanism was utilized to both prevent overfitting, verifying promptly at per epoch and stopping early on the dev set, which improve model performance. Micro F1 score was used as the primary evaluation metric to guide the optimization process. To enhance training eficiency and save GPU memory, all input texts were uniformly truncated or padded to a maximum length of 512 tokens.

4. Experiments 4.1. Experimental Setup

We firstly conducted tests on the oficial dev set. Evaluation metrics included macro-averaged recall, macro-averaged F1 score and macro-averaged precision.

Finally, we conducted tests on the oficial test set. Evaluation metrics included accuracy, macroaveraged recall (Macro-Recall), and macro-averaged F1 score (Macro-F1).

Recall

4.2. Result Analysis

In the PAN-CLEF 2025 Subtask 2, the Recall performance of the DeBERTa-v3-large model became a key evaluation criterion. From the results on the dev set (Table 1), it is evident that the model performs exceptionally well in classifying "human-written, then machine-polished" and "deeply-mixed text" categories but faces challenges in identifying "machine-written, then machine-humanized" texts. On the test set, experimental results (Table 2) showed that compared to the baseline model RoBERTabase, DeBERTa-v3-large achieved a Macro Recall value of 56.74% on the same task, representing an improvement of 8.42%, indicating its superiority in complex multi-category classification tasks. Nevertheless, despite overall performance improvements, the significance of the Recall metric and potential optimization paths warrant deeper exploration.

5. Conclusion

This study leverages the DeBERTa-v3-Large model to achieve eficient classification of six types of Human-AI collaborative texts. Experimental results demonstrate that this model outperforms baseline pre-trained models in terms of Recall and accuracy, especially excelling in handling texts with complex semantic structures and ambiguous category boundaries.

Future research directions include: • Exploring multimodal feature fusion, such as integrating syntactic, semantic, and emotional information; • Introducing contrastive learning or self-supervised learning strategies to enhance the model’s sensitivity to subtle diferences; • Constructing more representative datasets covering a broader range of real-world applications; • Developing lightweight versions of the model suitable for deployment on low-resource devices.

Acknowledgments

This work is supported by the Research Projects of OrdinaryUniversities in Guangdong Province under Grant2023KTSCX133, the Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515140103.

Declaration on Generative AI

The authors declare that the Qwen3 large language model was used during the preparation of this paper for text translation and language polishing. The final responsibility for the content, accuracy, and scientific integrity of the paper lies solely with the authors. [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.

[1]

Fröbe ,

Wiegmann ,

Kolyada ,

Grahm ,

Elstner ,

Loebe ,

Hagen ,

Stein ,

Potthast , Continuous Integration for Reproducible Shared Tasks with TIRA.io , in: J. Kamps , L.

Goeuriot , F.

Crestani , M.

Maistro , H.

Joho , B.

Davis , C.

Gurrin , U.

Kruschwitz , A . Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023 , pp. 236 - 241 . URL: https://link. springer.com/chapter/10.1007/978-3- 031 -28241-6_ 20 . doi: 10 .1007/978-3- 031 -28241-6_ 20 .

[2]

Bevendorf ,

Wang ,

Karlgren ,

Wiegmann ,

Fröbe ,

Tsivgun ,

Su ,

Xie ,

Abassy ,

Mansurov ,

Xing ,

M. N.

Ta ,

K. A.

Elozeiri ,

Gu ,

R. V.

Tomar ,

Geng ,

Artemova ,

Shelmanov ,

Habash ,

Stamatatos , I. Gurevych ,

Nakov ,

Potthast ,

Stein , Overview of the “VoightKampf” Generative AI Authorship Vericfiation Task at PAN and ELOQUENT 2025 , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org , 2025 .

[3]

He ,

Liu ,

Gao , W. Chen, Deberta: Decoding-enhanced bert with disentangled attention , in: International Conference on Learning Representations , 2021 . URL: https://openreview.net/forum? id=XPZIaotutsD.

[4]

He ,

Gao , W. Chen, Debertav3: Improving deberta using electra-style pre-training with gradientdisentangled embedding sharing , 2021 . arXiv: 2111 . 09543 .

[5]

Solaiman ,

Brundage ,

Clark ,

Askell ,

Herbert-Voss ,

Wu ,

Radford , G. Krueger,

J. W.

Kim ,

Kreps , et al., Release strategies and the social impacts of language models , arXiv preprint arXiv: 1908 . 09203 ( 2019 ).

[6]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , CoRR abs/ 1810 .04805 ( 2018 ). URL: http://arxiv.org/abs/ 1810 .04805. arXiv: 1810 .04805.