A Verifying Generative Text Authorship Model With Regularized Dropout Notebook for the PAN Lab at CLEF 2024 Zijie Lin1 , Zhongyuan Han1,* , Leilei Kong1 , Miaoling Chen1 , Shuyi Zhang1 , Jiangao Peng1 and Kaiyin Sun2 1 Foshan University, Foshan, China 2 Foshan Huaying School, Foshan, China Abstract Generative AI authorship verification aims to identify the text authored by a human within a given pair of texts. This paper presents our method for the PAN 2024 Generative AI Authorship Authentication Task. We framed this task as a binary classification problem for individual texts. Initially, we utilized data augmentation techniques to balance the originally imbalanced dataset and trained the model on single texts. Additionally, we employed the Regularized Dropout method to optimize model training further. For a given pair of texts, the model processed each text individually for inference. Finally, a fully connected layer was used for classification, selecting the text with the higher human-authorship score as the answer. Our method achieved a mean score of 0.99 on the official test set. Keywords PAN 2024, Generative AI Authorship Verification, Data Augmentation, Regularized Dropout 1. Introduction Generative AI authorship verification aims to identify the text in which the author is a human, given a text pair consisting of a large model and a human-generated text. In recent years, with the innovation caused by large-scale language models such as ChatGPT [1] in assisted writing, people have increasingly relied on AI for content creation. This trend is also accompanied by several challenges and problems, such as students submitting AI-generated assignments [2] and using AI to write false articles, etc.. By verifying the identity of the author, the above-mentioned adverse phenomena can be effectively curbed. This paper describes our method to the generative AI authorship verification task [3, 4, 5] at PAN 2024. This task requires us to identify the human-authored text within a given text pair, using a limited and unbalanced human-to-machine dataset. In this study, we framed the task as a binary classification problem for individual texts. Our method used public datasets to augment the original dataset, increasing the quantity of both human-generated and machine-generated texts to maintain equal proportions. This method of data augmentation ad- dressed the imbalance between human and machine authors in the dataset. Furthermore, we incorpo- rated the R-Drop [6] technique during training to enhance model robustness. During inference, the model processed each text pair as individual texts for prediction, selecting the text with the higher human authorship score as the answer. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ lamlovezz113@gmail.com (Z. Lin); hanzhongyuan@gmail.com (Z. Han); kongleilei@fosu.edu.cn (L. Kong); 271473791@qq.com (M. Chen); shuyipro@foxmail.com (S. Zhang); wyd1n910@gmail.com (J. Peng); sunkaiyin123@163.com (K. Sun)  0009-0009-1492-809X (Z. Lin); 0000-0001-8960-9872 (Z. Han); 0000-0002-4636-3507 (L. Kong); 0009-0007-4184-4487 (M. Chen); 0009-0006-7589-8187 (S. Zhang); 0009-0006-3780-5023 (J. Peng) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Related Work Due to the rapid development of large language models (LLMs), their text generation capabilities have reached a level comparable to human writing [7] . Developing effective methods to verify the authorship of generated texts is crucial for mitigating the misuse of LLMs and reducing the harmful impact of their content. In recent years, numerous studies have focused on machine text detection. For instance, Hans [8] proposed a method called Binoculars, which compares the scores of two related language models to determine whether a text is human-generated or machine-generated. Bao [9] introduced "Fast-DetectGPT," a zero-shot detection method for machine-generated text that leverages conditional probability curvature. Although these methods do not require training data and rely solely on analyzing specific textual features for detection, they may be ineffective when the characteristics distinguishing human and machine-generated texts are not prominent. Therefore, we adopted the R-Drop method to ensure consistency in the distribution of samples across different categories. The core idea of the R-Drop method is to regularize the consistency between the outputs of two different sub-models generated through dropout, thereby enhancing the model’s generalization ability and robustness. This method constrains the results of two forward passes obtained by applying dropout to the same input data, ensuring they remain consistent. 3. Method This section explains how to incorporate R-Drop to optimize our model during the training process. We use the pre-trained language model Bert [10] for training. We consider this task a binary classification problem for single text samples, thus employing the binary cross-entropy loss function as the foundation. On top of this, we incorporate the R-drop method to construct the final loss function. This final loss function is then used to train the model. The final loss function is expressed as follows: ℒ = (ℒ𝐵𝐶𝐸 (𝑝1 , y) + ℒ𝐵𝐶𝐸 (𝑝2 , y)) + 𝛼(𝒦ℒ(𝑝1 ‖ 𝑝2 ) + 𝒦ℒ(𝑝2 ‖ 𝑝1 )) (1) Where 𝛼 is a hyperparameter that controls the contribution of the KL divergence in the total loss. In this way, we consider the model’s prediction accuracy and enhance the consistency of the model’s results from different forward passes, thereby improving the model’s stability and robustness.The specific steps for creating the loss function are as follows: First, we input the data through the network and apply dropout to obtain two different forward propagation results 𝑝1 and 𝑝2 . Then, we calculate the binary cross-entropy loss ℒ𝐵𝐶𝐸 for these two results. The formula for binary cross-entropy loss ℒ𝐵𝐶𝐸 is as follows: ∑︁ ℒ𝐵𝐶𝐸 (p, y) = − [𝑦𝑖 log 𝑝𝑖 + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )] (2) 𝑖 where p is the predicted probability distribution of the model, and y is the actual label distribution. Binary cross-entropy loss measures the inconsistency between the actual labels and the predicted distribution and is a common loss function for binary classification problems. Next, we calculate the Kullback-Leibler (KL) divergence between the two results 𝑝1 and 𝑝2 ; the formula is: ∑︁ 𝑝1 𝒦ℒ(𝑝1 ‖ 𝑝2 ) = 𝑝1𝑖 log 𝑖2 (3) 𝑖 𝑝𝑖 Finally, the above KL divergence is added as a regularization term to the loss function. The final loss function includes the weighted sum of binary cross-entropy loss and KL divergence loss. The application of R-drop in the training process is shown in Figure 1. We selected the BERT model as the baseline model. We trained BERT using the training data that will be mentioned below and optimized the model using R-Drop. During the inference phase, we first split the input text pair into two separate texts. Each text is then individually fed into the BERT model for classification prediction. Finally, we select the text with the higher probability of being human-generated as the final answer. loss label label P!loss P "loss D𝒦ℒ =( 𝑃! || 𝑃" ) P! P" softmax softmax Self-Attention Feed-Forward units dropped units text Figure 1: The figure shows that an input text passes through the same model’s transformer block twice and obtains two distributions, 𝑃 1 and 𝑃 2 . The KL divergence between 𝑃 1 and 𝑃 2 is then calculated. Additionally, 𝑃 1 𝑙𝑜𝑠𝑠 and 𝑃 2 𝑙𝑜𝑠𝑠 represent the binary cross-entropy losses between 𝑃 1 and the label, 𝑃 2 and label, respectively 4. Experiment 4.1. Data Preprocessing In this task, we utilized two datasets. The first dataset is the guiding dataset provided by the organizers for the Generative AI Authorship Verification task, known as pan24-generative-authorship-news. The second dataset is sourced from the Kaggle platform, named DAIGT-V4-TRAIN-DATASET1 (hereinafter referred to as DAIGT-V4). The guiding dataset encompasses various genuine and fabricated news articles from American headlines in 2021. It comprises 14 JSONL files, with one containing text generated by human authors and the remaining 13 files containing text generated by different machine authors. The DAIGT-V4 comprises a collection of CSV files containing text generated by one human author and 11 machine authors, covering topics such as mobile phones and automobiles, with 27370 texts generated by humans and 46203 by machines. The minimum, maximum, and average lengths of texts in both pan24-generative-authorship-news and DAIGT-V4 are presented in Table 1. Table 1 Minimum, average, and maximum length of text in different datasets Dataset Minimum length Average length Maximum length pan24-generative-authorship-news 2 428 1389 DAIGT-V4 2 390 1671 Due to the proportion of human authors to machine authors being 1:13 in the dataset provided by the organizers, namely "pan24-generative-authorship-news," to expand the data volume and balance the ratio between human authors and machine authors, we utilized the DAIGT-V4 dataset to augment the original data. The preprocessing of the data involved extracting 1000 texts generated by human 1 You can find this dataset at https://www.kaggle.com/datasets/thedrcat/daigt-v4-train-dataset. authors from the pan24-generative-authorship-news dataset while retaining their respective topics. We randomly selected authors based on the same topics for the machine-generated texts. Subsequently, we extracted 20000 texts generated by both human and machine authors from the DAIGT-V4 dataset in a 1:1 ratio. We then combined these two sets of data and divided them into training and test sets at a ratio of 9:1. In the training set, a label of 1 denotes texts generated by human authors; In contrast, a label of 0 denotes texts generated by machine authors. All text will be truncated according to the maximum input length of the model. 4.2. Experimental setting We conducted the entire experiment using the Pytorch framework. The optimizer used was the Adam optimizer. During training, the loss function was a weighted sum of binary cross-entropy loss and KL divergence, with a weight of 4 for the KL divergence. Dropout was set to 0.3, the maximum text length was 512, the batch size was 32, the learning rate was 3e-5, and the number of epochs was 10. The composition of the dataset used in the experiment is shown in Table 2. Table 2 Composition and quantity of dataset DAIGT-V4 pan24-generative-authorship-news Total train 36000 1800 37800 test 4000 200 4200 After dividing the dataset, we send a single text to the model for training. We used the same indicators as the official PAN 2024 to evaluate our model and took the mean value as the final selection criterion for the model. We obtained the best model in the second epoch. 4.3. Other method We also employed an ensemble learning approach to complete this task. In addition to the previously mentioned dataset, we expanded our dataset using the SemEval subTask A dataset [11] . We utilized three pre-trained language models: Bert-base-uncased, Roberta-base-uncased [12] , and Deberta-base- uncased [13] . The training process was mainly similar to the method described above. During the inference phase, we split each text pair into two separate texts and input them into the three models. Each model predicts the text separately and obtains two scores; we choose the average score as the final score for a single text and select the one with the higher score as the human-generated text. 4.4. Results This subsection introduces the experimental results. Our team, Team lam in Table 3, submitted two systems: system 𝑏𝑙𝑖𝑠𝑡𝑒𝑟𝑖𝑛𝑔−𝑚𝑜𝑠𝑠 and system 𝑎𝑐𝑢𝑡𝑒−𝑤𝑖𝑟𝑒𝑓 𝑟𝑎𝑚𝑒. Table 3 shows an overview of the accuracy of our method and baseline methods in detecting whether humans write text in PAN 2024 (Voight-Kampff Generative AI Authorship Verification) Task 4. Among them, system 𝑏𝑙𝑖𝑠𝑡𝑒𝑟𝑖𝑛𝑔−𝑚𝑜𝑠𝑠 is our primary method, and the system 𝑎𝑐𝑢𝑡𝑒−𝑤𝑖𝑟𝑒𝑓 𝑟𝑎𝑚𝑒 is briefly introduced in Section 4.3. Com- pared to baseline methods, our methods demonstrate significant improvements across most metrics. For instance, the system 𝑏𝑙𝑖𝑠𝑡𝑒𝑟𝑖𝑛𝑔−𝑚𝑜𝑠𝑠 achieves an 𝑅𝑂𝐶 − 𝐴𝑈 𝐶 of 0.989, markedly higher than the highest value of 0.972 attained by all baseline methods (𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒 𝐵𝑖𝑛𝑜𝑐𝑢𝑙𝑎𝑟𝑠). Additionally, the system 𝑏𝑙𝑖𝑠𝑡𝑒𝑟𝑖𝑛𝑔−𝑚𝑜𝑠𝑠 achieves scores of 0.989 or higher in 𝐵𝑟𝑖𝑒𝑟, 𝐶@1, 𝐹 1, 𝐹 0.5𝑢, and the average value, indicating exceptionally high classification performance. Although the system 𝑎𝑐𝑢𝑡𝑒−𝑤𝑖𝑟𝑒𝑓 𝑟𝑎𝑚𝑒 slightly lags behind the the system 𝑏𝑙𝑖𝑠𝑡𝑒𝑟𝑖𝑛𝑔−𝑚𝑜𝑠𝑠, it maintains all metrics around 0.865, still surpassing most baseline methods. Notably, it performs comparably to the 𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒 𝐹 𝑎𝑠𝑡−𝐷𝑒𝑡𝑒𝑐𝑡𝐺𝑃 𝑇 (𝑀 𝑖𝑠𝑡𝑟𝑎𝑙) method in the 𝐹 1 and 𝐹 0.5𝑢 metrics (both 0.883). Table 3 Overview of the accuracy in detecting if a text is written by an human in task 4 on PAN 2024 (Voight-Kampff Generative AI Authorship Verification). We report ROC-AUC, Brier, C@1, F1 , F0.5𝑢 and their mean. Team System ROC-AUC Brier C@1 F1 F0.5𝑢 Mean lam blistering-moss 0.989 0.989 0.989 0.989 0.99 0.990 lam acute-wireframe 0.865 0.865 0.865 0.866 0.865 0.866 Baseline Binoculars - 0.972 0.957 0.966 0.964 0.965 0.965 Baseline Fast-DetectGPT (Mistral) - 0.876 0.8 0.886 0.883 0.883 0.866 Baseline PPMd - 0.795 0.798 0.754 0.753 0.749 0.77 Baseline Unmasking - 0.697 0.774 0.691 0.658 0.666 0.697 Baseline Fast-DetectGPT - 0.668 0.776 0.695 0.69 0.691 0.704 95-th quantile - 0.994 0.987 0.989 0.989 0.989 0.990 75-th quantile - 0.969 0.925 0.950 0.933 0.939 0.941 Median - 0.909 0.890 0.887 0.871 0.867 0.889 25-th quantile - 0.701 0.768 0.683 0.657 0.670 0.689 Min - 0.131 0.265 0.005 0.006 0.007 0.224 Table 4 presents the average accuracy across nine test set variations. The system 𝑏𝑙𝑖𝑠𝑡𝑒𝑟𝑖𝑛𝑔−𝑚𝑜𝑠𝑠 demonstrates outstanding performance across all metrics: the minimum value is 0.764, indicating high accuracy even under the worst conditions; the maximum value is 0.996, indicating near-perfect accuracy under optimal conditions. The system 𝑎𝑐𝑢𝑡𝑒−𝑤𝑖𝑟𝑒𝑓 𝑟𝑎𝑚𝑒 also performs exceptionally well across various metrics: the minimum value is 0.015, an outlier possibly due to extreme conditions in certain test sets; the maximum value is 1.000, demonstrating perfect accuracy under the best conditions. It is worth noting that the system 𝑎𝑐𝑢𝑡𝑒−𝑤𝑖𝑟𝑒𝑓 𝑟𝑎𝑚𝑒 outperforms system 𝑏𝑙𝑖𝑠𝑡𝑒𝑟𝑖𝑛𝑔−𝑚𝑜𝑠𝑠 on all metrics except for the two German datasets. Our methods exhibit superior performance on most metrics compared to other baseline methods. Specifically, our method significantly surpasses others in key metrics such as the median, 75th percentile, and maximum value, highlighting its robustness and efficiency. Table 4 Overview of the mean accuracy over 9 variants of the test set. We report the minumum, median, the maximum, the 25-th, and the 75-th quantile, of the mean per the 9 datasets. Team System Minimum 25-th Quantile Median 75-th Quantile Max lam blistering-moss 0.764 0.832 0.961 0.989 0.996 lam acute-wireframe 0.015 0.843 0.866 0.997 1.000 Baseline Binoculars - 0.342 0.818 0.844 0.965 0.996 Baseline PPMd - 0.270 0.546 0.750 0.770 0.863 Baseline Unmasking - 0.250 0.662 0.696 0.697 0.762 Baseline Fast-DetectGPT - 0.159 0.579 0.704 0.719 0.982 95-th quantile - 0.863 0.971 0.978 0.990 1.000 75-th quantile - 0.758 0.865 0.933 0.959 0.991 Median - 0.605 0.645 0.875 0.889 0.936 25-th quantile - 0.353 0.496 0.658 0.675 0.711 Min - 0.015 0.038 0.231 0.244 0.252 5. Conclusion To solve the task of generative artificial intelligence author authentication proposed by PAN 2024, we propose two methods in this article. One is to use data augmentation and R-Drop to train the BERT model. The other is to use the Ensemble learning voting method for author verification. The method of combining data augmentation with R-Drop yielded promising results. Despite the integrated model’s overall performance potentially being inferior to the former, it demonstrated superior effectiveness on certain test data subsets. Acknowledgments This work is supported by the Social Science Foundation of Guangdong Province, China (No.GD24CZY02) References [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023). [2] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, C. Finn, Detectgpt: Zero-shot machine-generated text detection using probability curvature, in: International Conference on Machine Learning, PMLR, 2023, pp. 24950–24962. [3] A. A. Ayele, N. Babakov, J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Korenčić, M. Mayerl, D. Moskovskiy, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, N. Rizwan, P. Rosso, F. Schneider, A. Smirnova, E. Stamatatos, E. Stakovskii, B. Stein, M. Taulé, D. Ustalov, X. Wang, M. Wiegmann, S. M. Yimam, E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024. [4] J. Bevendorff, M. Wiegmann, J. Karlgren, L. Dürlich, E. Gogoulou, A. Talman, E. Stamatatos, M. Potthast, B. Stein, Overview of the “Voight-Kampff” Generative AI Authorship Verification Task at PAN and ELOQUENT 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2024. [5] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/ 978-3-031-28241-6_20. [6] L. Wu, J. Li, Y. Wang, Q. Meng, T. Qin, W. Chen, M. Zhang, T.-Y. Liu, et al., R-drop: Regularized dropout for neural networks, Advances in Neural Information Processing Systems 34 (2021) 10890–10905. [7] J. Wu, S. Yang, R. Zhan, Y. Yuan, D. Wong, L. Chao, A survey on llm-gernerated text detection: Necessity, methods, and future directions (2023). [8] A. Hans, A. Schwarzschild, V. Cherepanova, H. Kazemi, A. Saha, M. Goldblum, J. Geiping, T. Goldstein, Spotting llms with binoculars: Zero-shot detection of machine-generated text, 2024. arXiv:2401.12070. [9] G. Bao, Y. Zhao, Z. Teng, L. Yang, Y. Zhang, Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature, in: The Twelfth International Conference on Learning Representations, 2023. [10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. arXiv:1810.04805. [11] Y. Wang, J. Mansurov, P. Ivanov, J. Su, A. Shelmanov, A. Tsvigun, O. M. Afzal, T. Mahmoud, G. Puccetti, T. Arnold, et al., Semeval-2024 task 8: Multidomain, multimodel and multilingual machine-generated text detection, arXiv preprint arXiv:2404.14183 (2024). [12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692. [13] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, 2021. arXiv:2006.03654.