<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AI Text Detection Method Based on Perplexity Features with Strided Sliding Window</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xurong Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leilei Kong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan , Guangdong</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>In recent years, the application of Large Language Models (LLMs) in various Natural Language Processing (NLP) tasks has become prevalent, significantly enhancing text generation, machine translation, language understanding, and conversational systems. However, this widespread use has introduced new ethical and legal challenges, particularly the dificulty in distinguishing human-generated content from AI-generated content. This paper addresses this issue by treating it as an authorship verification problem, aiming to identify whether a given text is AI-generated. We investigate the distinct characteristics of human and AI-generated texts and employ a strided sliding window approach based on GPT-2 to extract perplexity features. For the task of Voight Kampf Generative AI Author Verification 2024, we determined AI text and human text by comparing perplexity features. The results demonstrated that by leveraging the perplexity metric, which measures the unpredictability of a text, we were able to capture distinct patterns characteristic of AI-generated content.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;AI Detection</kwd>
        <kwd>Perplexity</kwd>
        <kwd>GPT-2</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        With LLMs improving at breakneck speed and seeing more widespread adoption every day, it is
getting increasingly hard to discern whether a given text was authored by a human being or AI[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As
developers of ChatGPT, OpenAI approaches the detection of AI-generated text as a binary classification
problem. They conduct research on fine-tuning models based on RoBERTa and GPT-2 detector models
to distinguish between non-AI-generated text and text generated by GPT-2. However, as the size of
the text generation model increases, the performance of the classifier tends to decline[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. By studying
existing generative AI models, GPTZero analyzes two metrics of text: "perplexity" and "burstiness".
GPTZero is capable of detecting text generated by various AI models, including Google’s LaMD (also
known as Bard), Facebook’s LLaMa, and OpenAI’s GPT-3 and GPT-4[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Biyang Guo collected a dataset
named the Human ChatGPT Comparison Corpus (HC3) and studied the diferences between human
and AI-generated texts in both Chinese and English based on this dataset. By analyzing the perplexity
feature at both the sentence and text levels, it was found that ChatGPT has relatively lower PPLs
compared to the text written by humans[6]. In the work of Lorenz Mindner[7], traditional and novel
features were explored to distinguish AI-generated text from human text and AI-rewritten text. When
using GPT-2 to calculate perplexity and analyze based on this feature, it was found that the perplexity of
approximately 25% of AI-generated texts was significantly lower compared to nearly 50% of human texts.
Additionally, they used XGBoost for text classification and achieved good results. And some researchers
used perplexity as a feature on GPT-2 to distinguish between human-generated and AI-generated text
based [8, 9, 10]. Many studies assert that linguistic analysis indicates humans exhibit greater logicality,
semantic coherence, and contextual understanding in language use. When expressing ideas, humans
tend to minimize information quantity while maintaining semantic clarity and efective communication,
resulting in lower entropy. In contrast, AI-generated texts often have more complex syntactic structures
but lower lexical complexity. In most cases, the perplexity of AI text is lower than that of human
text[8, 9, 11, 12].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. System Overview</title>
      <p>
        The Generative AI Authorship Verification Task @ PAN is organized in collaboration with the
VoightKampf Task @ ELOQUENT Lab in a builder-breaker style: Given two texts, one authored by a human
and one by a machine, pick out the human. Test data for this task will be compiled from the submissions
of ELOQUENT participants and will comprise multiple text genres such as news articles Wikipedia
intro texts or fanfiction. Additionally, a bootstrap dataset is provided[
        <xref ref-type="bibr" rid="ref3">10, 3</xref>
        ].
      </p>
      <p>Due to the imbalance in the quantity of human-generated texts versus AI-generated texts of 2024
PAN, we investigated the following features: average length () which is the average number of words
per text; vocabulary size ( ) which is the number of unique words used across all responses; and density
() calculated as
(1)
(2)
where  is the number of texts. Density measures the concentration of unique words used in the text.
A higher density indicates a greater variety of diferent words used within texts of the same length.[ 6]</p>
      <p>The text features are shown in the table 1 . The features  and  show that human-generated
texts are relatively longer and use a more extensive vocabulary. However, for more advanced large
models, these characteristics are less pronounced. Similarly, this phenomenon is prominently reflected
in the  feature. To obtain accurate results, it is necessary to use the features of entire sentences for
classification.</p>
      <p>Perplexity (PPL) is one of the most common metrics for evaluating language models. Perplexity is
defined as the exponentiated average negative log-likelihood of a sequence[ 13]. If we have a tokenized
sequence  = (0, 1, . . . , ) then the perplexity of  is
 =
100 × 
 × 
  () = exp −  
︃(
1 ∑︁ log  (|&lt;)
)︃</p>
      <p>Where log  (|&lt;) is the log-likelihood of the -th token conditioned on the preceding tokens &lt;
according to the model. This is also equivalent to the exponentiation of the cross-entropy between the
data and model predictions.</p>
      <p>We chose GPT-2 as the base model for calculating perplexity. GPT-2 is a large language model
developed by OpenAI based on the Transformer architecture. It is pre-trained in an unsupervised</p>
      <sec id="sec-3-1">
        <title>Average Length (L)</title>
      </sec>
      <sec id="sec-3-2">
        <title>Vocab Size (V) Density (D)</title>
        <p>
          manner on a large text dataset containing billions of words, enabling it to generate text that closely
resembles human language. It can handle contexts up to 1024 tokens, allowing it to consider more
context information and thus predict the next word more accurately[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In summary, GPT -2 can provide
more accurate assessment of perplexity.
        </p>
        <p>The text is always limited by a model’s context size when evaluating the model’s perplexity by
autoregressively factorizing a sequence and conditioning on the entire preceding subsequence at each
step. The largest version of GPT-2 has a fixed length of 1024 tokens, so we cannot calculate  (|&lt;)
directly when  is greater than 1024. we then approximate the likelihood of a token  by conditioning
only on the fixed tokens that precede it rather than the entire context. So, when evaluating the model’s
perplexity of a sequence, we break the sequence into disjoint chunks and independently add up the
decomposed log-likelihoods of each segment. To solve the model that will have less context at most of
the prediction steps, we evaluate with a sliding-window strategy so that the model has more context
when making each prediction. This is a closer approximation to the true decomposition of the sequence
probability and will typically yield a more favorable score. The downside is that it requires a separate
forward pass for each token in the corpus. So, we employ a strided sliding window, moving the context
by 512 token strides rather than sliding by 1 token a time. This allows computation to proceed much
faster while still giving the model a large context to make predictions at each step. For the detailed
algorithm, refer to Algorithm 1.</p>
        <p>For the task of Voight Kampf Generative AI Author Verification 2024, which we addressed by treating
it as an authorship attribution problem, after fully extracting the perplexity features of the text, we
determined AI text and human text by comparing the magnitude of perplexity features, the text with
lower perplexity is AI-generated.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>
        Following the above experiment design, the results are as table 2 and table 3 follows[
        <xref ref-type="bibr" rid="ref3">3, 14</xref>
        ]:
95-th quantile
75-th quantile
Median
25-th quantile
Min
      </p>
      <sec id="sec-4-1">
        <title>ROC-AUC Brier C@1</title>
        <p>F1</p>
        <p>F0.5 Mean
0.746
0.972
0.876
0.795
0.697
0.668
5. Conclusion
In this study, we explored the identification of AI-generated text using a combination of perplexity
features extracted by a strided sliding window based on GPT-2. We determined AI text and human
text by comparing the magnitude of perplexity features. The results demonstrated that by leveraging
the perplexity metric, which measures the unpredictability of a text, we were able to capture distinct
patterns characteristic of AI-generated content, but the performance is poor and further improvement
is needed. In addition, our study is not without limitations. The variability in text characteristics
across diferent AI models suggests that our method might need further adaptation to handle new and
emerging models. Additionally, the computational intensity of the sliding window approach, despite its
accuracy, could be a bottleneck in real-time applications. Future work should focus on optimizing the
computational eficiency of our method and exploring its adaptability to newer, more advanced LLMs.
Furthermore, integrating additional features and leveraging ensemble methods could enhance detection
accuracy and robustness.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is supported by the Natural Science Platforms and Projects of Guangdong Province Ordinary
Universities (Key Field Special Projects) (No.2023ZDZX1023).
[6] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu, How close is chatgpt to human
experts? comparison corpus, evaluation, and detection, arXiv preprint arXiv:2301.07597 (2023).
[7] L. Mindner, T. Schlippe, K. Schaaf, Classification of human-and ai-generated texts: Investigating
features for chatgpt, in: International Conference on Artificial Intelligence in Education Technology,
Springer, 2023, pp. 152–170.
[8] S. Gehrmann, H. Strobelt, A. M. Rush, Gltr: Statistical detection and visualization of generated
text, arXiv preprint arXiv:1906.04043 (2019).
[9] S. Mitrović, D. Andreoletti, O. Ayoub, Chatgpt or human? detect and explain. explaining
decisions of machine learning model for detecting short chatgpt-generated text, arXiv preprint
arXiv:2301.13852 (2023).
[10] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
978-3-031-28241-6_20.
[11] E. Crothers, N. Japkowicz, H. L. Viktor, Machine-generated text: A comprehensive survey of threat
models and detection methods, IEEE Access (2023).
[12] Y. Liu, Z. Zhang, W. Zhang, S. Yue, X. Zhao, X. Cheng, Y. Zhang, H. Hu, Argugpt: evaluating,
understanding and identifying argumentative essays generated by gpt models, arXiv preprint
arXiv:2304.07666 (2023).
[13] huggingface, calculating-perplexity-with-gpt-2, 2023. URL: https://huggingface.co/docs/
transformers/en/perplexity, (2024).
[14] J. Bevendorf, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D.
Korenčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova,
E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024:
Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot,
D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro
(Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of
the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in
Computer Science, Springer, Berlin Heidelberg New York, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Uchendu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          , T. Le,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Turingbench: A benchmark environment for turing test in the age of neural text generation</article-title>
          ,
          <source>arXiv preprint arXiv:2109.13296</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Stamatatos,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the Voight-Kampf Generative AI Authorship Verification Task at PAN 2024</article-title>
          , in: G.
          <string-name>
            <given-names>F. N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] OpenAI, Ai classifier,
          <year>2023</year>
          . URL: https://openai.com/, (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <source>Gptzero: An ai text detector</source>
          ,
          <year>2023</year>
          . URL: https://news.gptzero.
          <article-title>me/ thoughtful-thorough-solution-development-gptzero-x-</article-title>
          <string-name>
            <surname>anthology</surname>
          </string-name>
          , (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>