1. Introduction

AI Text Detection Method Based on Perplexity Features with Strided Sliding Window

Xurong Liu

Leilei Kong

0 0 Foshan University , Foshan , Guangdong , China

2024

In recent years, the application of Large Language Models (LLMs) in various Natural Language Processing (NLP) tasks has become prevalent, significantly enhancing text generation, machine translation, language understanding, and conversational systems. However, this widespread use has introduced new ethical and legal challenges, particularly the dificulty in distinguishing human-generated content from AI-generated content. This paper addresses this issue by treating it as an authorship verification problem, aiming to identify whether a given text is AI-generated. We investigate the distinct characteristics of human and AI-generated texts and employ a strided sliding window approach based on GPT-2 to extract perplexity features. For the task of Voight Kampf Generative AI Author Verification 2024, we determined AI text and human text by comparing perplexity features. The results demonstrated that by leveraging the perplexity metric, which measures the unpredictability of a text, we were able to capture distinct patterns characteristic of AI-generated content.

eol>AI Detection Perplexity GPT-2

1. Introduction 2. Background

With LLMs improving at breakneck speed and seeing more widespread adoption every day, it is getting increasingly hard to discern whether a given text was authored by a human being or AI[ 3 ]. As developers of ChatGPT, OpenAI approaches the detection of AI-generated text as a binary classification problem. They conduct research on fine-tuning models based on RoBERTa and GPT-2 detector models to distinguish between non-AI-generated text and text generated by GPT-2. However, as the size of the text generation model increases, the performance of the classifier tends to decline[ 4 ]. By studying existing generative AI models, GPTZero analyzes two metrics of text: "perplexity" and "burstiness". GPTZero is capable of detecting text generated by various AI models, including Google’s LaMD (also known as Bard), Facebook’s LLaMa, and OpenAI’s GPT-3 and GPT-4[ 5 ]. Biyang Guo collected a dataset named the Human ChatGPT Comparison Corpus (HC3) and studied the diferences between human and AI-generated texts in both Chinese and English based on this dataset. By analyzing the perplexity feature at both the sentence and text levels, it was found that ChatGPT has relatively lower PPLs compared to the text written by humans[6]. In the work of Lorenz Mindner[7], traditional and novel features were explored to distinguish AI-generated text from human text and AI-rewritten text. When using GPT-2 to calculate perplexity and analyze based on this feature, it was found that the perplexity of approximately 25% of AI-generated texts was significantly lower compared to nearly 50% of human texts. Additionally, they used XGBoost for text classification and achieved good results. And some researchers used perplexity as a feature on GPT-2 to distinguish between human-generated and AI-generated text based [8, 9, 10]. Many studies assert that linguistic analysis indicates humans exhibit greater logicality, semantic coherence, and contextual understanding in language use. When expressing ideas, humans tend to minimize information quantity while maintaining semantic clarity and efective communication, resulting in lower entropy. In contrast, AI-generated texts often have more complex syntactic structures but lower lexical complexity. In most cases, the perplexity of AI text is lower than that of human text[8, 9, 11, 12].

3. System Overview

The Generative AI Authorship Verification Task @ PAN is organized in collaboration with the VoightKampf Task @ ELOQUENT Lab in a builder-breaker style: Given two texts, one authored by a human and one by a machine, pick out the human. Test data for this task will be compiled from the submissions of ELOQUENT participants and will comprise multiple text genres such as news articles Wikipedia intro texts or fanfiction. Additionally, a bootstrap dataset is provided[ 10, 3 ].

Due to the imbalance in the quantity of human-generated texts versus AI-generated texts of 2024 PAN, we investigated the following features: average length () which is the average number of words per text; vocabulary size ( ) which is the number of unique words used across all responses; and density () calculated as (1) (2) where is the number of texts. Density measures the concentration of unique words used in the text. A higher density indicates a greater variety of diferent words used within texts of the same length.[ 6]

The text features are shown in the table 1 . The features and show that human-generated texts are relatively longer and use a more extensive vocabulary. However, for more advanced large models, these characteristics are less pronounced. Similarly, this phenomenon is prominently reflected in the feature. To obtain accurate results, it is necessary to use the features of entire sentences for classification.

Perplexity (PPL) is one of the most common metrics for evaluating language models. Perplexity is defined as the exponentiated average negative log-likelihood of a sequence[ 13]. If we have a tokenized sequence = (0, 1, . . . , ) then the perplexity of is = 100 × × () = exp − ︃( 1 ∑︁ log (|<) )︃

Where log (|<) is the log-likelihood of the -th token conditioned on the preceding tokens < according to the model. This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions.

We chose GPT-2 as the base model for calculating perplexity. GPT-2 is a large language model developed by OpenAI based on the Transformer architecture. It is pre-trained in an unsupervised

Average Length (L) Vocab Size (V) Density (D)

manner on a large text dataset containing billions of words, enabling it to generate text that closely resembles human language. It can handle contexts up to 1024 tokens, allowing it to consider more context information and thus predict the next word more accurately[ 2 ]. In summary, GPT -2 can provide more accurate assessment of perplexity.

The text is always limited by a model’s context size when evaluating the model’s perplexity by autoregressively factorizing a sequence and conditioning on the entire preceding subsequence at each step. The largest version of GPT-2 has a fixed length of 1024 tokens, so we cannot calculate (|<) directly when is greater than 1024. we then approximate the likelihood of a token by conditioning only on the fixed tokens that precede it rather than the entire context. So, when evaluating the model’s perplexity of a sequence, we break the sequence into disjoint chunks and independently add up the decomposed log-likelihoods of each segment. To solve the model that will have less context at most of the prediction steps, we evaluate with a sliding-window strategy so that the model has more context when making each prediction. This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more favorable score. The downside is that it requires a separate forward pass for each token in the corpus. So, we employ a strided sliding window, moving the context by 512 token strides rather than sliding by 1 token a time. This allows computation to proceed much faster while still giving the model a large context to make predictions at each step. For the detailed algorithm, refer to Algorithm 1.

For the task of Voight Kampf Generative AI Author Verification 2024, which we addressed by treating it as an authorship attribution problem, after fully extracting the perplexity features of the text, we determined AI text and human text by comparing the magnitude of perplexity features, the text with lower perplexity is AI-generated.

4. Results

Following the above experiment design, the results are as table 2 and table 3 follows[ 3, 14 ]: 95-th quantile 75-th quantile Median 25-th quantile Min

ROC-AUC Brier C@1

F0.5 Mean 0.746 0.972 0.876 0.795 0.697 0.668 5. Conclusion In this study, we explored the identification of AI-generated text using a combination of perplexity features extracted by a strided sliding window based on GPT-2. We determined AI text and human text by comparing the magnitude of perplexity features. The results demonstrated that by leveraging the perplexity metric, which measures the unpredictability of a text, we were able to capture distinct patterns characteristic of AI-generated content, but the performance is poor and further improvement is needed. In addition, our study is not without limitations. The variability in text characteristics across diferent AI models suggests that our method might need further adaptation to handle new and emerging models. Additionally, the computational intensity of the sliding window approach, despite its accuracy, could be a bottleneck in real-time applications. Future work should focus on optimizing the computational eficiency of our method and exploring its adaptability to newer, more advanced LLMs. Furthermore, integrating additional features and leveraging ensemble methods could enhance detection accuracy and robustness.

Acknowledgments

This work is supported by the Natural Science Platforms and Projects of Guangdong Province Ordinary Universities (Key Field Special Projects) (No.2023ZDZX1023). [6] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu, How close is chatgpt to human experts? comparison corpus, evaluation, and detection, arXiv preprint arXiv:2301.07597 (2023). [7] L. Mindner, T. Schlippe, K. Schaaf, Classification of human-and ai-generated texts: Investigating features for chatgpt, in: International Conference on Artificial Intelligence in Education Technology, Springer, 2023, pp. 152–170. [8] S. Gehrmann, H. Strobelt, A. M. Rush, Gltr: Statistical detection and visualization of generated text, arXiv preprint arXiv:1906.04043 (2019). [9] S. Mitrović, D. Andreoletti, O. Ayoub, Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text, arXiv preprint arXiv:2301.13852 (2023). [10] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/ 978-3-031-28241-6_20. [11] E. Crothers, N. Japkowicz, H. L. Viktor, Machine-generated text: A comprehensive survey of threat models and detection methods, IEEE Access (2023). [12] Y. Liu, Z. Zhang, W. Zhang, S. Yue, X. Zhao, X. Cheng, Y. Zhang, H. Hu, Argugpt: evaluating, understanding and identifying argumentative essays generated by gpt models, arXiv preprint arXiv:2304.07666 (2023). [13] huggingface, calculating-perplexity-with-gpt-2, 2023. URL: https://huggingface.co/docs/ transformers/en/perplexity, (2024). [14] J. Bevendorf, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Korenčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova, E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024.

[1]

Uchendu ,

Ma , T. Le,

Zhang ,

Lee , Turingbench: A benchmark environment for turing test in the age of neural text generation , arXiv preprint arXiv:2109.13296 ( 2021 ).

[2]

Radford , J. Wu ,

Child ,

Luan ,

Amodei ,

Sutskever , et al., Language models are unsupervised multitask learners , OpenAI blog 1 ( 2019 ) 9 .

[3]

Bevendorf ,

Wiegmann , E. Stamatatos,

Potthast ,

Stein , Overview of the Voight-Kampf Generative AI Authorship Verification Task at PAN 2024 , in: G.

F. N.

Ferro ,

Galuščáková , A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS .org, 2024 .

[4] OpenAI, Ai classifier, 2023 . URL: https://openai.com/, ( 2024 ).

[5]

Tian , Gptzero: An ai text detector , 2023 . URL: https://news.gptzero. me/ thoughtful-thorough-solution-development-gptzero-x- anthology , ( 2024 ).