Generative AI Authorship Verification Of Tri-Sentence Analysis Base On The Bert Model Notebook for the PAN Lab at CLEF 2024 Jijie Huang1,* , Yang Chen1 , Man Luo2 and Yonglan Li1 1 Foshan University, Foshan, China 2 Guangzhou City University of Technology, Guangzhou, China Abstract The task of generative AI authorship verification aims to determine if a text is written by a human or generated by AI. In this paper, we treat this task as a binary classification problem and introduce a method called Tri- Sentence Analysis (TSA). TSA captures fine-grained contextual information, enhancing the model’s ability to identify the text’s source. Additionally, we incorporate the MPU method to improve the model’s efficiency and differentiation for short texts.Finally, we integrated these methods into a pre-trained BERT model. On the test set, our performance metrics for the Minimum, 25-th Quantile, Median, 75-th Quantile, and Maximum scores are 0.883, 0.936, 0.976, 0.989, and 0.999, respectively. Keywords Authorship Verification, Tri-Sentence Analysis, Pre-trained Model 1. Introduction In Natural Language Processing (NLP), Authorship Verification is a fundamental task, with generative AI authorship verification being particularly crucial. Its purpose is to distinguish between human-written texts and those generated by AI, ensuring content authenticity and reliability. With PAN’s extensive research in authorship verification [1, 2], the PAN 2024 Generative AI Authorship Verification [3, 4] task aims to addresses this challenge. This task is organized in collaboration with the Voight-Kampff task of the ELOQUENT [5] laboratory, adopting a builder-breaker style. PAN participants must develop systems to differentiate between texts generated by large language models and those written by humans. We consider the PAN 2024 generative AI authorship verification task as an AI text detection task, aimed at distinguishing whether a text is machine-generated or human-written. In AI text detection, models like Binoculars [6] and Fast-DetectGPT [7] make good predictions, but their performance is unsatisfactory for short AI-generated texts. We aim to determine if targeted processing of short texts can improve AI text detection efficiency. Therefore, we propose a deep learning-based text classification method called Tri-Sentence Analysis (TSA). This method divides the text into multiple short segments, each containing three sentences, and independently analyzes each segment. Then, it combines these analyses to make a final classification of the entire text. Specifically, we divide the texts in the official dataset into multiple short segments, each inheriting the original text’s label, and extract features from them. By averaging the prediction values of these short segments, we determine if the original text is AI-generated or human-written. To improve short text classification, we employ a method called MPU [8], which enhances the efficiency of short text recognition. Finally, we use the pre-trained language model BERT [9] to complete this task and submit our model on TIRA [10]. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ 2112203035@stu.fosu.edu.cn (J. Huang); 980842599@qq.com (Y. Chen); luoman322@163.com (M. Luo); li_yonglan@163.com (Y. Li)  0000-0002-8462-3310 (J. Huang); 0009-0009-2368-3565 (Y. Chen); 0009-0004-1007-9100 (M. Luo); 0009-0008-5095-3404 (Y. Li) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Background The rapid development of artificial intelligence technology has led to significant advancements in large language models (LLMs) for text generation. These models can produce well-structured and grammatically correct text and are widely used in various fields, including advertising, news writing, storytelling, and code generation. However, some malicious individuals misuse LLMs for harmful purposes, such as creating credible fake news or cheating, which misleads readers and has severe negative societal impacts. Therefore, distinguishing AI-generated text from human-written text to prevent abuse has become urgent. Detection methods for text generated by large models can be divided into two main categories: The first method is zero-shot detection, which identifies AI-generated text by directly accessing the source model that created the text. This method does not need pre-trained datasets but uses the source model’s output logits or loss values to determine if the text is machine-generated. Examples include the methods by Mitchell et al. [11] and Yang et al. [12] The advantage of zero-shot detection is that it does not require large amounts of training data and can be applied directly to new text. However, its disadvantage is that it depends on the performance of the source model or a proxy model. If there is a significant difference between the proxy model and the source model, the detection effectiveness may be low. The second method is based on deep neural network (DNN) classifiers, which detect human-written and AI-generated text through supervised training models. The advantage of this method is that it can improve detection performance through large amounts of training data. For example, the method by Guo et al. [13]. However, DNN-based classifiers have high data requirements, poor generalization ability [14], and the trained classifiers are vulnerable to backdoor attacks [15] and adversarial attacks [16]. 3. System Overview 3.1. Data Preprocessing Table 1 Overview of the datasets used in PAN 2024 Generative AI Authorship Verification task. Dataset Quantity Label human 1087 1 alpaca-7b 1087 0 bigscience-bloomz-7b1 1087 0 chavinlo-alpaca-13b 1087 0 gemini-pro 1087 0 gpt-3.5-turbo-0125 1087 0 gpt-4-turbo-preview 1087 0 meta-llama-llama-2-70b-chat-hf 1087 0 meta-llama-llama-2-7b-chat-hf 1087 0 mistralai-mistral-7b-instruct-v0.2 1087 0 mistralai-mixtral-8x7b-instruct-v0.1 1087 0 qwen-qwen1.5-72b-chat-8bit 1087 0 text-bison-002 1087 0 vicgalle-gpt2-open-instruct-v1 1087 0 The dataset for the author verification task in generative AI, provided by ELOQUENT and PAN participants, includes various types of texts such as news articles, Wikipedia summaries, and fan fiction. It covers real and fake news articles from multiple US headlines in 2021. The dataset consists of 14 JSONL files, each containing 24 topics and 1087 articles. One file is written by humans, while the other 13 files are generated by different large language models. Each file corresponds to the same row ID, indicating the content pertains to the same topic. In total, the 14 files contain 15,218 articles, as shown in Table 1. Label 1 indicates text written by humans, while label 0 indicates text generated by large language models. We integrated and categorized the dataset provided by PAN into a new dataset named "combine." This dataset consists of two columns: "text" and "label." The "text" column represents the content of the text, while the "label" column indicates the source of the text, with human-written texts labeled as 1 and machine-generated texts labeled as 0. The combine dataset contains a total of 15,218 articles. We used 80% of the data labeled as 0 and 1 for training and the remaining 20% for validation, resulting in 12,174 samples for training and 3,044 samples for validation. 3.2. Method Our objective is to split long texts into multiple short texts and determine whether the original text was generated by AI or written by humans by analyzing features extracted from each short text.To achieve this, we propose a deep learning-based text classification method called Tri-Sentence Analysis (TSA). TSA works by dividing long texts into short texts, each containing three sentences, and analyzing each independently.The combined results of these analyses are used for the final classification of the entire text.This approach aims to capture fine-grained contextual information more effectively, thereby improving classification accuracy. Additionally, segmenting the text reduces the burden of long texts on the model, enhancing its stability when processing lengthy texts. During the prediction phase, we use TSA to process new input texts in a similar manner.Each group of three sentences is treated as a short text and input into the trained BERT model. The results of each short text prediction are averaged with weights to classify the original text. During testing, if we need to determine which of two texts is closer to being human-written, we apply the same method to obtain the prediction values for each text. The text with a prediction value closer to human-written is classified as such. We also want to enhance the model’s ability to classify short texts. Tian et al. [8] proposed a new loss function called MPU, which improves the recognition and differentiation of AI-generated short texts.Therefore, we modified the model’s loss function to MPU [8] to improve classification accuracy. Our system architecture is shown in Figure 1. Figure 1: Model skeleton of our method 4. Experiments and Results 4.1. Experimental Setting In this work, we selected the BERT-base-uncased model, featuring 12 layers, 768 hidden units, 12 attention heads, and 110M parameters. The maximum length of the encoder was configured to 512, with a batch size of 32. Meanwhile, we employed MPU [8] as the loss function, utilizing the Adam optimizer with a learning rate of 5e-5. Our experiments were conducted on an A800 server. The optimal performance was achieved after 13 epochs of training. 4.2. Experimental Setting To evaluate the performance of our model, we used the evaluation platform provided by PAN, which includes the following metrics: • AUC: the conventional area under the curve score. • c@1: rewards systems that leave complicated problems unanswered [17]. • F_0.5u: focus on deciding same-author cases correctly [18]. • F1-score: harmonic way of combining the precision, and recall of the model [19]. • Brier: Brier Score evaluates the accuracy of probabilistic predictions [20]. • Mean: The arithmetic mean of all the metrics above. 4.3. Result We finally submitted the model to TIRA [10] for execution to obtain the final metrics. Our model, charitable-mole_v3, performed exceptionally well in the PAN 2024 Generative AI Authorship Verification task. Table 2 presents the outstanding performance of charitable-mole_v3 across various metrics: ROC- AUC of 0.991, Brier of 0.991, C@1 of 0.991, F1 of 0.99, F0.5u of 0.989, and Mean of 0.99. Our model outperformed the official baselines across all metrics and maintained strong competitiveness in the 95th percentile among participants. Table 3 further illustrates the average accuracy of charitable-mole_v3 across different dataset variants, particularly on the test sets of nine variants. Our model’s Minimum value across all variants was 0.883, with the 25-th and 75-th Quantile at 0.936 and 0.989, respectively, a Median of 0.976, and a Maximum value of 0.999. These results significantly surpass those of all official baselines. Compared to the quantile results of other participants, charitable-mole_v3 is close to or surpasses the models in the 95-th quantile in most metrics and exceeds the 75-th quantile models in all metrics, demonstrating strong competitiveness. This further proves the excellent performance of our model in the PAN 2024 Generative AI Authorship Verification task. 5. Conclusion This paper details our achievements in the PAN2024 generative AI author verification task. We proposed a text classification method based on the BERT pre-trained model, called Tri-Sentence Analysis (TSA). The TSA method can capture more fine-grained contextual information, thereby improving the accuracy of text classification. It better understands the semantic relationships and consistency between sentences, enhancing the model’s robustness in handling long texts. Additionally, we integrated the MPU method to improve the efficiency of distinguishing short texts. Ultimately, our method performed excellently in the PAN2024 generative AI author verification test set. The Minimum, 25-th Quantile, Median, 75-th Quantile, and Maximum values were 0.883, 0.936, 0.976, 0.989, and 0.999, respectively. In the future, we plan to further improve this method and explore its potential applications in a broader range of natural language processing tasks to achieve higher detection efficiency and wider application. Table 2 The final performance of our submission on PAN 2024 (Voight-Kampff Generative AI Authorship Verification) Approach ROC-AUC Brier C@1 F1 F0.5𝑢 Mean charitable-mole_v3 0.991 0.991 0.991 0.99 0.989 0.99 Baseline Binoculars 0.972 0.957 0.966 0.964 0.965 0.965 Baseline Fast-DetectGPT (Mistral) 0.876 0.8 0.886 0.883 0.883 0.866 Baseline PPMd 0.795 0.798 0.754 0.753 0.749 0.77 Baseline Unmasking 0.697 0.774 0.691 0.658 0.666 0.697 Baseline Fast-DetectGPT 0.668 0.776 0.695 0.69 0.691 0.704 95-th quantile 0.994 0.987 0.989 0.989 0.989 0.990 75-th quantile 0.969 0.925 0.950 0.933 0.939 0.941 Median 0.909 0.890 0.887 0.871 0.867 0.889 25-th quantile 0.701 0.768 0.683 0.657 0.670 0.689 Min 0.131 0.265 0.005 0.006 0.007 0.224 Table 3 Overview of the mean accuracy over 9 variants of the test set Approach Minimum 25-th Quantile Median 75-th Quantile Max charitable-mole_v3 0.883 0.936 0.976 0.989 0.999 Baseline Binoculars 0.342 0.818 0.844 0.965 0.996 Baseline Fast-DetectGPT (Mistral) 0.095 0.793 0.842 0.931 0.958 Baseline PPMd 0.270 0.546 0.750 0.770 0.863 Baseline Unmasking 0.250 0.662 0.696 0.697 0.762 Baseline Fast-DetectGPT 0.159 0.579 0.704 0.719 0.982 95-th quantile 0.863 0.971 0.978 0.990 1.000 75-th quantile 0.758 0.865 0.933 0.959 0.991 Median 0.605 0.645 0.875 0.889 0.936 25-th quantile 0.353 0.496 0.658 0.675 0.711 Min 0.015 0.038 0.231 0.244 0.252 References [1] E. Stamatatos, M. Kestemont, K. Kredens, P. Pezik, A. Heini, J. Bevendorff, B. Stein, M. Potthast, Overview of the authorship verification task at pan 2022, in: CEUR workshop proceedings, volume 3180, CEUR-WS. org, 2022, pp. 2301–2313. [2] J. Bevendorff, I. Borrego-Obrador, M. Chinea-Ríos, M. Franco-Salvador, M. Fröbe, A. Heini, K. Kre- dens, M. Mayerl, P. Pęzik, M. Potthast, et al., Overview of pan 2023: Authorship verification, multi-author writing style analysis, profiling cryptocurrency influencers, and trigger detection: Condensed lab overview, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2023, pp. 459–481. [3] J. Bevendorff, M. Wiegmann, J. Karlgren, L. Dürlich, E. Gogoulou, A. Talman, E. Stamatatos, M. Potthast, B. Stein, Overview of the “Voight-Kampff” Generative AI Authorship Verification Task at PAN and ELOQUENT 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2024. [4] A. A. Ayele, N. Babakov, J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Korenčić, M. Mayerl, D. Moskovskiy, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, N. Rizwan, P. Rosso, F. Schneider, A. Smirnova, E. Stamatatos, E. Stakovskii, B. Stein, M. Taulé, D. Ustalov, X. Wang, M. Wiegmann, S. M. Yimam, E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024. [5] J. Karlgren, L. Dürlich, E. Gogoulou, L. Guillou, J. Nivre, M. Sahlgren, A. Talman, Eloquent clef shared tasks for evaluation of generative language model quality, in: European Conference on Information Retrieval, Springer, 2024, pp. 459–465. [6] A. Hans, A. Schwarzschild, V. Cherepanova, H. Kazemi, A. Saha, M. Goldblum, J. Geiping, T. Gold- stein, Spotting llms with binoculars: Zero-shot detection of machine-generated text, arXiv preprint arXiv:2401.12070 (2024). [7] G. Bao, Y. Zhao, Z. Teng, L. Yang, Y. Zhang, Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature, arXiv preprint arXiv:2310.05130 (2023). [8] Y. Tian, H. Chen, X. Wang, Z. Bai, Q. Zhang, R. Li, C. Xu, Y. Wang, Multiscale positive-unlabeled detection of ai-generated texts, arXiv preprint arXiv:2305.18149 (2023). [9] J. Lee, K. Toutanova, Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 3 (2018) 8. [10] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/ 978-3-031-28241-6_20. [11] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, C. Finn, Detectgpt: Zero-shot machine-generated text detection using probability curvature, in: International Conference on Machine Learning, PMLR, 2023, pp. 24950–24962. [12] X. Yang, W. Cheng, Y. Wu, L. Petzold, W. Y. Wang, H. Chen, Dna-gpt: Divergent n-gram analysis for training-free detection of gpt-generated text, arXiv preprint arXiv:2305.17359 (2023). [13] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu, How close is chatgpt to human experts? comparison corpus, evaluation, and detection, arXiv preprint arXiv:2301.07597 (2023). [14] A. Uchendu, T. Le, K. Shu, D. Lee, Authorship attribution for neural text generation, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 8384–8395. [15] F. Qi, M. Li, Y. Chen, Z. Zhang, Z. Liu, Y. Wang, M. Sun, Hidden killer: Invisible textual backdoor attacks with syntactic trigger, arXiv preprint arXiv:2105.12400 (2021). [16] X. He, X. Shen, Z. Chen, M. Backes, Y. Zhang, Mgtbench: Benchmarking machine-generated text detection, arXiv preprint arXiv:2303.14822 (2023). [17] A. Peñas Padilla, Á. Rodrigo Yuste, A simple measure to assess non-response (2011). [18] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Generalizing unmasking for short texts, in: Proceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 654–659. [19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, the Journal of machine Learning research 12 (2011) 2825–2830. [20] G. W. Brier, Verification of forecasts expressed in terms of probability, Monthly weather review 78 (1950) 1–3.