1. Introduction

Human or Not? Light-Weight and Interpretable Detection of AI-Generated Text

Maximilian Seeliger

maximilian.seeliger@tuwien.ac.at 0

Patrick Styll

patrick.styll@tuwien.ac.at 0

Moritz Staudinger

moritz.staudinger@tuwien.ac.at 0

Allan Hanbury

allan.hanbury@tuwien.ac.at 0 0 TU Wien Informatics , Favoritenstraße 9-11, 1040 Vienna , Austria

2025

Text generated by Large Language Models (LLMs) is becoming less distinguishable from their human-written counterparts. Reliable detection of the diferences between the two is increasingly important to limit the spread of fake content, plagiarism and the manipulation of public opinion. We study the binary classification problem of distinguishing human-written from AI-generated text. We propose a two-step learning algorithm. In the first step, it calculates the correlation between the rows of the binary term-document matrix (TDM) and the binary labels associated with the documents. This step runs in (max + ) time, where is the number of texts, max is the maximum text length, and is the vocabulary size. In the second step, it uses these values to map any text to a sequence of correlations, which can be interpreted as a signal. This can be done in linear time () where is the size of the text. Together with other statistical measurements, this signal serves as a feature for standard machine learning algorithms. Furthermore, we give a perspective on the interpretability of our proposed approach for global and local (instance-level) explanations. Our work demonstrates that while large language models like RoBERTa remain state-of-the-art in terms of raw accuracy for AI-text identification, our interpretable and computationally eficient approach ofers a competitive alternative, particularly in scenarios where interpretability is important. We evaluate our approach within the Voight-Kampf Generative AI Detection task, which is part of the PAN lab at CLEF 2025.

eol>AI-Generated Text Explainability Signal Processing PAN 2025

1. Introduction

To demonstrate the efectiveness of our approach, we use data from the Voight-Kampf Generative AI

Detection challenge, which is part of the PAN lab at CLEF 2025 [ 10, 11, 12 ]. The challenge is divided into two tasks: (1) binary classification of texts as either human- or AI-generated, and (2) multi-class classification estimating the degree of human or machine authorship in mixed-authorship texts. Each task consists of an individual dataset.

Our contributions include: • A novel two-step learning algorithm that transforms text into a sequence of correlation values, interpretable as a signal. • A collection of global and local interpretations based on the output of the learning algorithm. • A simple approach to use hand-crafted linguistic features together with correlation signals, fed into a standard machine learning algorithm, to achieve competetive performance for distinguishing human-written from AI-generated text.

2. Main Method 2.1. Problem Setting

We formally introduce the problem setting and propose the concept of correlation signals as well as a simple way to use them for classification. We use binary term-document matrices and the Phi-coeficient as fundamental building blocks to obtain a word-correlation value for each word. We map the words in a given text to their respective word-correlation and call this sequence a correlation signal. We study the problem of distinguishing human-written from AI-generated text in a supervised binary classification setting. Let be the instance space, containing all possible texts, and let = {0, 1} denote the binary label space, where label 0 represents human-written text and label 1 AI-generated text. For training, we get a set of labeled training instances {(, )}=1 ⊆ × and try to find a function : → that correctly classifies unseen instances.

Let = {1, 2, . . . , } be the set of texts from the training data. We consider each text as a sequence of word tokens (1, 2, . . . , ), resulting from tokenization (cf. Section 4.3), and expand the notation of set inclusion to allow ∈ to denote that the word is contained at any position in the text . We define the vocabulary of as Vocab( ) = { | ∈ for all ∈ } and say that = |Vocab( )| is the number of words in the text corpus.

2.2. Correlation Signals

We construct a binary term-document matrix B from , where a row represents for a specific word the inclusion relation to each text from the training dataset.

Definition 1. A binary term-document matrix B ∈ {0, 1}× indicates at position B, whether a word ∈ Vocab( ) is contained in document for 1 ≤ ≤ and 1 ≤ ≤ :

B, = {︃1 if ∈

0 otherwise Given the ’th row B,· ∈ {0, 1} and the label vector y = (1, 2, . . . , ), we are interested in quantifying the predictive power that the occurrence of the word has (i.e. which label is more likely, after knowing that occurs in the text). For this, we calculate the correlation between these two vectors. defined as where Definition 2.

The Phi-coeficient

[ 13 ] (also known as Matthews correlation coeficient) is a special case of the Pearson correlation coeficient for binary vectors. Given two binary vectors x, y ∈ {0, 1} it is (x, y) = √︀¯(1 − ¯) · ¯(1 − ¯) 1 ∑︀=1 − ·

¯¯ ¯ = =1 1 ∑︁ , ¯ = 1 ∑︁ . =1 This leads to the definition of word-correlations. For a word , represented in the ’th row of the term-document matrix, we denote its word-correlation with the function () = (B,· , y), where y is the label vector. We further extend this notation to texts and say that text = (1, 2, . . . , ) is mapped to its correlation signal with

︁( ( ) = (1), (2), . . . , () ︁) calculation of the Phi-coeficient for each word individually takes For the given corpus of size | | = with a vocabulary of size |Vocab| = , let max be the length of the longest text. We do preprocessing of the training corpus in (max + ) time. Constructing the binary term-document matrix takes (max) time by reading through each text in (max) time and updating entries in the matrix corresponding to occurring words in (1) time. The subsequent () time and is done in cumulative () time. The preprocessing results in a associative datastructure of size (), that maps each word to its word-correlation. Given constant lookup in this datastructure (e.g. hash table), we only need () time to construct a correlation signal for a query text of size | | = .

2.3. Classifier

Given the mapping from a text to its correlation signal, we define a classifier ( ) = {︃1 if |1 | ∑︀ 0 otherwise ∈ ( ) > for a given parameter . Intuitively, the average correlation signal acts as a soft decision boundary: if a text contains more words that tend to appear in AI-generated texts, its average correlation will be positive, and vice versa. The threshold determines the decision boundary in this latent correlation space. In practice the optimal decision threshold is chosen to minimize classification error for the given distribution of the training data (see Figure 1).

3. Interpretability

ifnal predictions.

3.1. Correlation Signals

This section gives a perspective on the interpretability of the proposed approach. Correlation signals are based on the word-correlations assigned to each individual word. This word level contribution ofers ways to analyze the underlying model on a global and local (instance-level) scale to explain the Globally, we can look at the magnitude of the correlations and see that AI models appear to avoid specific words more (strong negative correlation, min∈Vocab( ) () = − 0.4849) than they seem to

Human-written

Threshold

AI-generated 91.7% <

93.1% > 0.08 0.05

Signal Value 0.20 0.17 0.14 0.11 0.02 0.01 0.04 0.07 0.10 favor specific words (positive correlation, max∈Vocab( ) () = 0.3338). A list of tokens with the largest/smallest correlation scores is given in Table 7 in Appendix B. Furthermore, interpreting text as a correlation signal opens the door to more advanced analyses, such as spectral methods to investigate global patterns and structural trends (see Appendix C).

On the local scale, these scores can be used to explain individual instances, as the final output sum can be traced back to the specific token-level contributions at each point in the sequence. Predictions are constructed sequentially from the individual word-correlations in a text. This allows to pinpoint exactly the word or sub-sentence structure that lead to either predicted class. Given an appropriate threshold , we can see in Figure 2 how the models prediction changes from one class to the other as a result of words with an opposing word-correlation occurring.

3.2. -gram extension

We generalize our approach to -grams by treating them the same as simple word tokens. We calculate an -gram-correlation score analogous to word-correlations and build the final correlation signal as a sequence of such -gram-correlations.

Intuitively, we can capture more nuanced language interactions from the text by using -grams as they capture local contextual dependencies. However, -grams for > 1 are sparse. There is a total of 56987 tokens contained in the text corpus of the training data. Only 0.3% of the tokens in the validation set are not present during training. However, about 34% of the 2-grams and 84% of the 3-grams in the validation set have not been seen during training. This leads to poor generalization to unseen data, while the ability to find -gram-correlations that fit the training dataset improves with larger . (This efect explains the reduced performance of the corsig-2gram and corsig-3gram runs in Table 3.)

4. Experimental Evaluation

We evaluate the performance of correlation signal classifiers. There are two main objectives in our experiments: (1) Determine the ability of our approach to generalize to new instances and (2) identify if correlation signals contain additional predictive information, not contained in simple linguistic measures. We will evaluate our approach on the dataset provided in the PAN Lab’s Voight-Kampf Generative AI Detection challenge [ 10 ]. This challenge is split into two tasks. Task 1 consists of training and validation data for the binary classification setting presented in Section 2.1. Task 2 is a variation with 6 classes for diferent human-AI collaboration schemes (cf. Table 1).

The experiments are implemented in Python and the code is available on GitHub1.

4.1. Exploratory Data Analysis

For both tasks of the Voight-Kampf Generative AI Detection challenge, separate datasets are provided. As shown in Table 1, the class distributions in the training and validation sets are relatively balanced for task 1. In contrast, task 2 shows significant imbalances, both across individual classes and between the training and validation splits. Specifically, in the training set, classes 3–5 together account for less than 10% of the data. This is even more prominent in the validation set, where classes 4–5 collectively represent only 1.01% of samples. The most significant inconsistency appears in class 3: while it comprises just 3.72% of the training data, it dominates the validation set with 51.16%. Such inconsistencies between training and validation distributions can severely impair controlled evaluation of model performance, as they lead to incorrect representations of the target data distribution during training.

4.2. Baselines

We introduce a simple baseline classifier that takes several hand-crafted features into account. For task 1, simple classification based on the respective optimal threshold of said features already achieves a high performance that translates well from training to validation data (see Table 2). The features are calculated separately for the train and validation set and then fed into any standard machine learning algorithm (Random Forest, RF, in our case) to serve as a baseline. Additionally, we employ Facebook’s RoBERTa base model [ 14 ] (roberta-base2 via Hugging Face) as a Language Model (LM) baseline classifier, which has proven beneficial in previous studies [ 1 ]. We fine-tune RoBERTa using a maximum input sequence length of 500 tokens, running for three epochs on a T4 GPU provided by Google Colab. 1https://github.com/max-seeli/steely 2https://huggingface.co/FacebookAI/roberta-base The selected hyperparameters are based on default values and were chosen to establish a reasonable initial baseline for comparison.

4.3. Data Preprocessing

To prepare the input data for processing into correlation signals, we first use a word-tokenizer that is sensitive to punctuation for the English language. Subsequently, we employ the Porter stemming algorithm [ 15 ] and remove English stopwords.

For the RoBERTa baseline, we use the model specific tokenizer and do not further preprocess the inputs.

4.4. Task 1: Binary Classification

For task 1, we analyze six systems and present the evaluation metrics in Table 3. We run the statistical baseline with the name stats and the RoBERTa baseline as roberta. The systems corsig-<n>gram for ∈ {1, 2, 3} uses our main approach as presented in Section 2 as well as the extension to -grams from Section 3.2. Finally, the system stats-corsig is an adaptation to the statistical baseline, that uses the correlation signal |1 | ∑︀∈ () for each text ∈ as an additional feature. We can clearly see the negative efect -grams with > 1 have on the discriminative power of correlation signals, as we witness a slight decline in the performance metrics from corsig-1gram to corsig-2gram and a significantly more pronounced drop in performance when looking at corsig-3gram. The reason for this behaviour is the sparsity of -grams as explained in Section 3.2. Furthermore, system stats-corsig displays a substantial increase over the stats baseline. This indicates that correlation signals contain statistical information, not available from simple linguistic features. stats-corsig also shows that combined with correlation signals, a simple statistical baseline is suficient for competetive performance levels to the roberta baseline. 0.957 0.894 0.984

4.5. Task 2: Multi-Class Classification

For task 2, it is important to note that we are no longer dealing with binary classification, but rather a multi-class setting with six distinct classes. Consequently, our approach for creating correlation signals via a binary label vector y and classifying the summed up signals via , as introduced in Sections 2.2 and 2.3, no longer works. We define y = (1, 2, . . . , ) ∈ {0, 1, 2, 3, 4, 5} and build the correlation signals according to () = (B,· , y). Instead of using a threshold for classification, we use the RF classifier as described in Section 4.2, both with and without normalization |1 | ∑︀∈ () for each ∈ .

The results of our experiments on the validation-set can be seen in Table 4. The RoBERTa baseline (roberta) clearly outperformed the RF classifiers, both with ( stats-corsig) and without (stats) the correlation signal, which just slightly outperform guessing levels.

We hypothesize that the lack of performance can be attributed to an inconsistent class distribution between the training and validation sets, as described in Section 4.1. To verify this, we combined the original training and validation data and performed a new stratified split. The results on the new validation set confirm our assumptions, as we receive an F1-score of 96% via the RoBERTa baseline (roberta-strat). Additionally, we can now we observe a clear performance gain when using the correlation signal as a feature in the RF classifier ( stats-corsig-strat) compared to using baseline features alone (stats-strat). Nonetheless, the RF classifier still underperforms relative to the LM baseline, suggesting that our feature-based approach may be less efective for multi-class classification tasks.

5. Conclusion

In this work, we presented a lightweight and interpretable approach for distinguishing human-written from AI-generated text. Our method leverages the statistical correlation between individual words and class labels, encoding texts as correlation signals that can be processed eficiently and explained both globally and locally. We demonstrated that this signal-based representation achieves strong performance in the binary classification setting and adds complementary value when combined with standard statistical features.

In the multi-class classification setting, we observed that correlation signals alone may not capture the full complexity of mixed-authorship scenarios. However, they still ofer predictive gains when incorporated into classical models, provided that the data distribution is properly balanced. While language models like RoBERTa remain state-of-the-art in terms of raw accuracy, our findings show that interpretable, transparent, and computationally eficient methods can provide competitive alternatives—particularly when interpretability is a key concern.

In future work, we plan to introduce a relevance weight (e.g. tf-idf) for each word to calculate a weighted correlation signal, ensuring that more significant words impact the overall signal more. When removing stopwords, we already saw a performance improvement, which indicates that less relevant terms primarily add noise, hindering the prediction. Future work also includes extending correlation-based features to more fine-grained signals over richer linguistic representations (e.g., syntactic or semantic structures), and exploring hybrid models that combine the interpretability of correlation signals with the expressiveness of neural networks.

6. Declaration on Generative AI

During the preparation of this work, we used ChatGPT to paraphrase and reword. After using this service, we reviewed and edited the content as needed and take full responsibility for the publication’s content.

A. Further Results

The PAN Lab’s challenge organizers evaluated the submitted models from Task 1 on additional datasets. The test-set is a previously unknown part of the original dataset for competition purposes and the Eloquent dataset comes from a related competition, where participants are asked to generate text, such that it is indistinguishable from human text. We present the results in Tables 5 and 6. 0.916 0.933 0.575

B. Significant Word-Correlations C. Spectral Analysis of Correlation-Signals

Since we are looking at texts in the form of signals (see Section 2.2), we hypothesize that there are certain structural diferences between human-written and AI-generated texts that can be uncovered by analyzing their frequency components. Specifically, let ( ) denote the real-valued correlation signal of the word at position of text . We interpret ( ) as a discrete-time process, which encodes some sort of evidence towards AI- or human-authorship. Our goal is to examine the power spectral density (PSD) for a text via the periodogram , which serves as a basic estimator for the PSD. is defined as where ∈ {0, 1, . . . , − 1} is the discrete frequency index and is the length of document .

To conduct spectral analysis, will will use Welch’s method, which segments the signal into overlapping windows, applies a tapering function and finally averages the resulting periodograms. This method, however, assumes stationarity of the signal, which means that the mean and variance do not change over time; this is non-trivial for natural language. Similarly to [ 9 ], we applied the Augmented DickeyFuller (ADF) test [ 16 ] to examine this property. Our null hypothesis 0 of the ADF test is non-stationarity, meaning that < .05 test results would reject 0 and hence accept the alternative hypothesis of stationarity in the signals. For the training set of task 1, we see that 99.92% of texts accept 1, which is also why we assume that Welch’s method can be applied to this kind of correlation signal. An example of a resulting PSD for an AI-generated text can be seen in Figure 3.

After calculating for all ∈ , we average the values of these periodograms within each individual class; Figure 4 shows that there are indeed distinct diferences in the mean power density spectra of the correlation scores.

For task 1, we can see that both classes have a peak in the low-frequency range, which means that occurring patterns change slowly across the texts. In our context, this would indicate that the correlation scores remain mostly positive or negative over many words. This aligns with our expectation that human- and machine-authored segments typically span full sentences or paragraphs rather than just single words.

We see a similar trend in task 2. There are two large low-frequency peaks for human-initiated and machine-continued text as well as deeply-mixed texts, suggesting that machine- and human-authored parts are interleaved on the sentence- or paragraph-level. As expected, such a peak does not exist for fully human-written texts. Interestingly, we can see a minor peak at higher frequencies for the machine-written, then human-edited category. This could indicate that human editors made small local changes, such as modifying individual words or short phrases, rather than rewriting entire segments. Such finer-grained edits introduce higher-frequency peaks in the correlation signal.

[1]

A. M.

Sarvazyan ,

J. Ángel

González ,

Franco-Salvador ,

Rangel ,

Chulvi ,

Rosso , Overview of autextification at iberlef 2023: Detection and attribution of machine-generated text in multiple domains , 2023 . URL: https://arxiv.org/abs/2309.11285. arXiv: 2309 . 11285 .

[2]

Cabanac ,

Labbé , Prevalence of nonsensical algorithmically generated papers in the scientific literature , Journal of the Association for Information Science and Technology 72 ( 2021 ) 1461 - 1476 . doi: 10 .1002/asi.24495.

[3]

J. D.

Rodriguez ,

Hay ,

Gros ,

Shamsi ,

Srinivasan , Cross-domain detection of GPT-2- generated technical text , in: M. Carpuat , M.-C. de Marnefe , I. V. Meza Ruiz (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , Seattle, United States, 2022 , pp. 1213 - 1233 . URL: https://aclanthology.org/ 2022 .naacl-main. 88 /. doi: 10 . 18653/v1/ 2022 .naacl-main. 88 .

[4]

Wadden ,

Lin ,

Lo ,

L. L.

Wang , M. van Zuylen ,

Cohan ,

Hajishirzi , Fact or fiction: Verifying scientific claims , in: B. Webber , T. Cohn, Y. He , Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Association for Computational Linguistics , Online, 2020 , pp. 7534 - 7550 . URL: https://aclanthology.org/ 2020 . emnlp-main. 609 /. doi: 10 .18653/v1/ 2020 .emnlp-main. 609 .

[5]

Gehrmann ,

Strobelt ,

Rush , GLTR: Statistical detection and visualization of generated text , in: M. R. Costa-jussà, E. Alfonseca (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 111 - 116 . URL: https://aclanthology.org/P19-3019/. doi: 10 . 18653/v1/ P19 -3019.

[6]

Mitchell ,

Lee ,

Khazatsky ,

C. D.

Manning ,

Finn , Detectgpt: Zero-shot machinegenerated text detection using probability curvature , 2023 . URL: https://arxiv.org/abs/2301.11305. arXiv: 2301 . 11305 .

[7]

Bao ,

Zhao ,

Teng ,

Yang ,

Zhang , Fast-detectgpt: Eficient zero-shot detection of machinegenerated text via conditional probability curvature , 2024 . URL: https://arxiv.org/abs/2310.05130. arXiv: 2310 . 05130 .

[8]

Xu ,

Wang ,

An ,

Liu ,

Li , Detecting subtle diferences between human and model languages using spectrum of relative likelihood , 2024 . URL: https://arxiv.org/abs/2406.19874. arXiv: 2406 . 19874 .

[9]

Yang ,

Yuan ,

Xu ,

Zhan ,

Bai ,

Chen , Face: Evaluating natural language generation with fourier analysis of cross-entropy , 2023 . URL: https://arxiv.org/abs/2305.10307. arXiv: 2305 . 10307 .

[10]

Bevendorf ,

Dementieva ,

Fröbe ,

Gipp ,

Greiner-Petter ,

Karlgren ,

Mayerl ,

Nakov ,

Panchenko ,

Potthast ,

Shelmanov ,

Stamatatos ,

Stein ,

Wang ,

Wiegmann , E. Zangerle, Overview of PAN 2025: Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025 .

[11]

Bevendorf ,

Wang ,

Karlgren ,

Wiegmann ,

Fröbe ,

Tsivgun ,

Su ,

Xie ,

Abassy ,

Mansurov ,

Xing ,

M. N.

Ta ,

K. A.

Elozeiri ,

Gu ,

R. V.

Tomar ,

Geng ,

Artemova ,

Shelmanov ,

Habash ,

Stamatatos , I. Gurevych ,

Nakov ,

Potthast ,

Stein , Overview of the “VoightKampf” Generative AI Authorship Verification Task at PAN and ELOQUENT 2025 , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org , 2025 .

[12]

Fröbe ,

Wiegmann ,

Kolyada ,

Grahm ,

Elstner ,

Loebe ,

Hagen ,

Stein ,

Potthast , Continuous Integration for Reproducible Shared Tasks with TIRA.io , in: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023 , pp. 236 - 241 .

[13]

B. W.

Matthews , Comparison of the predicted and observed secondary structure of t4 phage lysozyme , Biochimica et Biophysica Acta (BBA)-Protein Structure 405 ( 1975 ) 442 - 451 .

[14]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized BERT pretraining approach , CoRR abs/ 1907 .11692 ( 2019 ). URL: http://arxiv.org/abs/ 1907 .11692. arXiv: 1907 .11692.

[15]

M. F.

Porter , An algorithm for sufix stripping , Program 14 ( 1980 ) 130 - 137 .

[16]

Dickey , W. Fuller, Distribution of the estimators for autoregressive time series with a unit root, JASA . Journal of the American Statistical Association 74 ( 1979 ). doi: 10 .2307/2286348.