=Paper=
{{Paper
|id=Vol-3740/paper-262
|storemode=property
|title=BaselineAvengers at PAN 2024: Often-Forgotten Baselines for LLM-Generated Text Detection
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-262.pdf
|volume=Vol-3740
|authors=Ludwig Lorenz,Funda Zeynep Aygüler,Ferdinand Schlatt,Nailia Mirzakhmedova
|dblpUrl=https://dblp.org/rec/conf/clef/LorenzASM24
}}
==BaselineAvengers at PAN 2024: Often-Forgotten Baselines for LLM-Generated Text Detection==
BaselineAvengers at PAN 2024: Often-Forgotten Baselines
for LLM-Generated Text Detection
Notebook for the PAN Lab at CLEF 2024
Ludwig Lorenz1,† , Funda Zeynep Aygüler1,† , Ferdinand Schlatt2 and Nailia Mirzakhmedova1,†
1
Bauhaus-Universität Weimar, Germany
2
Friedrich-Schiller-Universität Jena
Abstract
The rapid advancements of Large Language Models (LLMs) make it increasingly challenging to distinguish
between human-written and machine-generated texts, which raises concerns regarding their potential misuse.
This paper describes our submission to the PAN: Generative AI Authorship 2024 verification task, which involves
identifying the human-authored text from a pair of texts, one written by a human and the other by an LLM. Our
approach is based on the assumption that LLMs use a distinct vocabulary. We propose a simple and interpretable
method using non-neural machine learning classifiers with lexical features. We evaluate several classification
models and feature sets on a validation split and find logistic regression and SVM models using tf-idf feature
vectors to be highly effective. Our submissions offer a more effective alternative to all baseline approaches while
also being more efficient and interpretable.
Keywords
Authorship verification, Logistic Regression, Tf-Idf Vectorizer
1. Introduction
With the rapid advancements of Large Language Models (LLMs), distinguishing between human-written
and machine-generated texts becomes more and more challenging. As a result, the need for reliable
authorship verification methods becomes even more pressing. The ability to distinguish between
human-written and machine-generated texts is crucial for various applications, such as plagiarism
detection [1], forensic linguistics [2], and content moderation [3]. Multiple approaches have been
proposed to address this problem, including complex feature engineering and stylometric analysis,
linguistic analysis, and machine learning-based methods [4]. However, the increasing sophistication
of LLMs poses a significant challenge to existing authorship verification methods. In response to this
challenge, PAN [5] introduced the Voight-Kampff Generative AI Authorship Verification task to test the
feasibility of distinguishing between human-written and LLM-generated texts [6].
In this paper, we present our submission to the PAN shared task, where we address the generative
authorship verification problem using non-neural machine learning classifiers based on lexical features.
Our decision to employ non-neural models is motivated by the observation that simple models are often
overlooked in recent research, despite their proven effectiveness and their ability to serve as efficient
baselines for comparison with more complex models [7]. Moreover, our emphasis on lexical features is
based on the hypothesis that LLMs use a distinct vocabulary, which may be sufficient to differentiate
between human-authored and machine-generated texts.
In our work, we experimented with three classification models and two lexical feature sets. We
found logistic regression and SVM models using tf-idf feature vectors are highly effective for the
task. Motivated by the performance of our approach, we conducted a qualitative analysis of the most
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
†
Authors contributed equally
$ ludwig.david.lorenz@uni-weimar.de (L. Lorenz); funda.zeynep.aygueler@uni-weimar.de (F. Z. Aygüler);
ferdinand.schlatt@uni-jena.de (F. Schlatt); nailia.mirzakhmedova@uni-weimar.de (N. Mirzakhmedova)
0009-0005-2410-9005 (L. Lorenz); 0009-0009-6160-5074 (F. Z. Aygüler); 0000-0002-6032-909X (F. Schlatt);
0000-0002-8143-1405 (N. Mirzakhmedova)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
significant lexical features to test our hypothesis that LLMs employ a distinct vocabulary. Our analysis
revealed that there is a small set of words that can indicate whether a text is written by an LLM. Overall,
our approach offers a more effective alternative to all baseline approaches while also being more efficient
and interpretable.
The remainder of this paper is structured as follows. In Section 2, we provide background information
on the PAN: Generative AI Authorship Verification task and review the related work. In Section 3, we
describe our system and the components of our submission. In Section 4, we present the results of
our submission. Section 5 provides a qualitative analysis of the most important lexical features. We
conclude with a discussion of our results in Section 6.
2. Background
Task Description The PAN: Generative AI Authorship Verification task is organized in collaboration
with the Voight-Kampff Task at the ELOQUENT Lab in a builder-breaker style. PAN participants build
systems to tell human and machine-generated texts apart, while ELOQUENT participants investigate
novel text generation and obfuscation methods to avoid detection. The task is defined as follows:
Given two texts, one authored by a human, one by a machine: pick out the human.
More formally, given a pair of texts (𝑡1 , 𝑡2 ), one of which is written by a human and the other by
an LLM, the system must output a confidence score 𝑠 ∈ [0.0, 1.0]. A score 𝑠 < 0.5 indicates that
text 𝑡1 is believed to be human-authored, while a score 𝑠 > 0.5 indicates that text 𝑡2 is believed to be
human-authored. A score of exactly 0.5 means the case is undecidable.
Dataset The task participants were provided with a training dataset of 1,359 U.S. news articles. To
ensure that the articles were human-authored, the task organizers collected the articles from Google
News, focusing on the period before the release of GPT-3.5. The articles were summarized using
GPT-4-Turbo, and the summaries were used as input for 13 downstream LLMs to generate new articles.
The dataset consists of pairs of articles, one human-authored and one LLM-generated, and is split into
training, validation, and test sets.
To further test the robustness of submissions, the task organizers provided additional test datasets,
each applying a different obfuscation technique to the original test dataset. The obfuscation techniques
include switching the text encoding, prompting the LLMs to generate German instead of English, using
contrastive decoding, cropping the text to 35 words, etc. In total, 65 different test datasets were created
by obfuscation, with ELOQUENT providing another five.
3. System Overview
Scoring Function As follows from the task description (cf. Section 2), the generative authorship
verification task is formulated as a pairwise classification problem. Given a pair of texts (𝑡1 , 𝑡2 ), the
goal is to determine which text is human-authored. However, we approach this task as a pointwise
binary classification problem. That is, given a text 𝑡𝑖 , we aim to predict the probability 𝑃 (human|𝑡𝑖 )
that the text is human-authored.
By definition, the probability 𝑃 (human|𝑡𝑖 ) is equal to 1 − 𝑃 (LLM|𝑡𝑖 ). Given that we need to predict
the probability that 𝑡1 is human-authored while taking into account 𝑡2 , we average the probabilities of
the first text being written by a human and the second text not being written by a human to obtain the
final score 𝑠(human|𝑡1 ):
𝑃 (human|𝑡1 ) + 1 − 𝑃 (LLM|𝑡2 )
𝑠(human|𝑡1 ) = (1)
2
Table 1
Overview of the different classifiers (rows) and features (columns) evaluated on the validation set.
Classifier tf-idf Term Frequencies
Multinomial Naive Bayes 0.77 0.874
Logistic Regression 0.927 0.922
SVM 0.932 0.925
Feature Extraction To capture the distinctive vocabulary of LLM-generated texts, we use a bag-of-
words model to represent the texts. We experiment with two feature sets: term frequencies and tf-idf
values for all tokens in the training dataset.
Classification Models We experiment with three classifiers: Multinomial Naive Bayes, logistic
regression, and a support vector machine (SVM) with a linear kernel. We test the classifiers with both
term frequencies and tf-idf values to identify the most effective model and feature combination.
Model and Feature Selection To evaluate the performance of the different models and feature sets,
we use 100 samples from the training dataset as a validation split. The results of the validation are used
to select the most effective model and feature combination.
Table 1 shows the accuracy achieved on the validation split for each model. Overall, logistic regression
and SVM are more effective than multinomial Naive Bayes. The differences in effectiveness for different
feature sets for logistic regression and SVM are minimal. Interestingly, the performance of multinomial
naive Bayes is significantly better using raw term frequencies compared to tf-idf values.
4. Results
Evaluation Setup The PAN: Generative AI Authorship Verification task employed the TIRA platform
[8] to ensure the reproducibility and comparability of submissions. The platform provides a standardized
environment for running submissions and evaluates the submissions using the following metrics:
• ROC-AUC: The area under the ROC (Receiver Operating Characteristic) curve
• Brier: The complement of the Brier score (mean squared loss)
• C@1: A modified accuracy score that assigns non-answers (score = 0.5) the average accuracy of
the remaining cases
• F1: The harmonic mean of precision and recall
• F0.5u: A modified F0.5 measure (precision-weighted F measure) that treats non-answers (score =
0.5) as false negatives
• The arithmetic mean of all the metrics above.
The arithmetic mean of all metrics is used to rank the submissions.
Baselines The task organizers provided official baselines for comparison, which are based on the
performance of various approaches to the task of authorship verification. The baselines include a
simple text length classifier, PPMd Compression-based Cosine [9, 10], Authorship Unmasking [11, 12],
Binoculars [13], DetectLLM LRR and NPR [14], and DetectGPT [15].
Evaluation Results Table 2 presents the evaluation results of our submissions to the task, along with
the official baselines and summary statistics of all submissions. Our best performing submission (SVM)
outperforms all official baselines across all metrics, with the other two submissions (Multinomial Naive
Bayes and Logistic Regression) not outperforming only the Binoculars baseline for the algorithmic
mean of all metrics (0.965 vs. 0.956 and 0.958 respectively).
Table 2
Overview of the performance of our approaches, baselines, and the summary statistics of the performance of all
submissions in the competition. We report ROC-AUC, Brier, C@1, F1 , F0.5𝑢 and their arithmetic mean.
Approach ROC-AUC Brier C@1 F1 F0.5𝑢 Mean
naive-bayes 0.998 0.859 0.975 0.975 0.974 0.956
logistic-regression 0.996 0.884 0.97 0.97 0.97 0.958
svm 0.994 0.923 0.976 0.976 0.975 0.969
Baseline Binoculars 0.972 0.957 0.966 0.964 0.965 0.965
Baseline Fast-DetectGPT (Mistral) 0.876 0.8 0.886 0.883 0.883 0.866
Baseline PPMd 0.795 0.798 0.754 0.753 0.749 0.77
Baseline Unmasking 0.697 0.774 0.691 0.658 0.666 0.697
Baseline Fast-DetectGPT 0.668 0.776 0.695 0.69 0.691 0.704
95-th quantile 0.995 0.986 0.988 0.988 0.989 0.989
75-th quantile 0.971 0.925 0.954 0.935 0.942 0.945
Median 0.911 0.889 0.887 0.869 0.867 0.889
25-th quantile 0.714 0.771 0.683 0.657 0.670 0.697
Min 0.131 0.265 0.005 0.006 0.007 0.224
Table 3
Overview of the performance of our approaches, baselines, and the summary statistics of the performance of all
submissions in the competition over 10 variants of the test set. We report the minimum, 25-th quantile, median,
75-th quantile, and maximum of the arithmetic mean of all metrics.
Approach Minimum 25-th Quantile Median 75-th Quantile Max
naive-bayes 0.884 0.935 0.945 0.967 0.969
logistic-regression 0.837 0.941 0.957 0.963 0.989
svm 0.832 0.949 0.969 0.974 0.999
Baseline Binoculars 0.342 0.818 0.844 0.965 0.996
Baseline Fast-DetectGPT (Mistral) 0.095 0.793 0.842 0.929 0.958
Baseline PPMd 0.270 0.546 0.750 0.770 0.863
Baseline Unmasking 0.250 0.653 0.673 0.697 0.762
Baseline Fast-DetectGPT 0.159 0.579 0.677 0.719 0.982
95-th quantile 0.875 0.973 0.985 0.989 1.000
75-th quantile 0.758 0.875 0.935 0.959 0.994
Median 0.605 0.629 0.876 0.889 0.946
25-th quantile 0.350 0.481 0.658 0.697 0.709
Min 0.015 0.038 0.231 0.235 0.252
Table 3 shows the summarized results averaged (arithmetic mean) over 10 obfuscated variants of
the test dataset. Each dataset variant applies one potential technique to measure the robustness of
authorship verification approaches (cf. Section 2). The results show that all our submissions are robust
to the obfuscation techniques, as the performance does not drop significantly compared to the baseline
approaches. For example, the minimum achieved score for our best submission (SVM) is 0.832, while
the minimum score for the best baseline (Binoculars) is 0.342.
Overall, our approach demonstrates that simple and interpretable models can be highly effective
for the task of generative authorship verification. The results suggest that the distinctive vocabulary
used by LLMs can indeed be effectively captured using simple lexical features and machine learning
classifiers. Moreover, our submissions showed to be robust to obfuscation techniques, making them a
promising alternative to more complex and computationally expensive methods.
Naive Bayes: Tokens with largest differences in log probabilities
2.0
LLM
1.5 Human
emphasizing
underscores
commitment
emphasized
emphasizes
implications
highlighting
1.0
importance
highlighted
challenges
conclusion
expressed
significant
testament
measures
highlights
resilience
reminder
headline
ensure
amidst
stating
legacy
0.5
stated
article
0.0
says
monday
tuesday
doesn
thursday
told
asked
sunday
didn
won
friday
say
wrote
really
ago
wasn
morning
afternoon
wednesday
earlier
got
things
sept
said
bit
0.5
1.0
1.5
2.0
Figure 1: Top 50 tokens with the largest differences in log probabilities for multinomial Naive Bayes. Positive
values indicate the probability is higher for LLM-generated texts, negative values indicate the probability is
higher for human-written texts.
5. Qualitative Analysis
In addition to the quantitative evaluation of our submissions, we conducted a qualitative analysis of the
most important lexical features identified by the models. This analysis aims to highlight key tokens
that contribute to distinguishing between human-written and LLM-generated texts.
The implementation of the multinomial Naive Bayes model allows us to extract the log probabili-
ties of each token belonging to the human-written and LLM-generated classes. By comparing these
probabilities, we can identify the tokens that contribute most to the classification decision. We use the
following equation to calculate the difference in log probabilities for each token 𝑤𝑖 in the feature set:
log_diff(𝑤𝑖 ) = log(𝑃 (𝑤𝑖 |LLM)) − log(𝑃 (𝑤𝑖 |human)) (2)
The log difference values are then sorted in descending order to identify the tokens with the largest
differences. The resulting values are interpreted as the importance of each token in distinguishing
between human-written and LLM-generated texts. Positive values indicate higher probabilities for LLM-
generated texts, while negative values indicate higher probabilities for human-written texts. Figure 1
presents the top 50 tokens with the largest differences in log probabilities for the multinomial Naive
Bayes model. Here, we observe that LLM-generated texts frequently use specific terms such as “article”,
“importance”, “emphasized”, “context”, and “despite”. These terms often relate to structured and formal
writing, which is often characteristic of LLM-generated content. On the other hand, human-written
texts show a higher probability of tokens related to everyday language and temporal expressions such
as “told”, “says”, “asked”, “wrote”, and “really”. These tokens indicate a more narrative and less formal
style typical of human writing. The frequent use of days of the week such as “Wednesday”, “Thursday”,
and “Friday” and terms like “afternoon” and “morning” in human-written texts can be attributed to their
common use in chronological events or planning. Humans often refer to specific days when recounting
events, discussing plans, or setting contexts within their narratives. This is particularly relevant in
our news articles dataset, where providing temporal context is essential for accurate and engaging
reporting. The word “told” is particularly prominent in human-written texts, as it is frequently used in
direct and indirect speech, which is also common in news articles. In contrast, LLM-generated texts
often prioritize structured content delivery and formal exposition over narrative elements, resulting in
frequent use of terms such as “emphasized”, “stating”, and “highlights”. The term “conclusion” is also
prevalent in LLM-generated texts, indicating a structured and formal writing style that often includes a
summary or final remarks, which is uncommon in human-written news articles.
Logistic Regression SVM
significant significant
article despite
importance importance
despite expressed
safety emphasized
incident stating
expressed continues
emphasized actions
challenges conclusion
stating ensure
actions incident
ensure challenges
potential support
conclusion individuals
efforts ongoing
stated stated
concerns commitment
continues highlighted
ongoing article
impact expected
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
Figure 2: Top 20 tokens for identifying LLM-generated texts using Logistic Regression (left) and SVM (right).
The importance of each token is based on the size of the coefficients assigned to them by the trained models.
Figure 2 presents the top 20 most important tokens for identifying LLM-generated texts based on the
coefficients assigned to them by the trained logistic regression and SMV models. Tokens with larger
coefficients have a greater impact on the model’s decision function. Similarly to the Naive Bayes model,
some of the most notable tokens both in logistic regression and the SVM models include “significant”,
“article”, “importance”, “despite”, “stating” and “conclusion”. This suggests that LLM-generated texts
often contain terms that convey formality, which might be less prevalent in human-written texts. The
overlap in key tokens between the logistic regression and SVM models underlines the consistency of
these patterns in distinguishing LLM-generated texts. The frequent appearance of the word "significant"
in LLM-generated texts can be attributed to the tendency of language models to produce content that
is polished and systematic. Language models are typically trained on large datasets that include a
large amount of academic, technical, and professional writing. This extensive exposure to formal texts
influences the models to emulate this style.
Our qualitative analysis supports the hypothesis that LLMs use a distinctive vocabulary that can be
captured using lexical features. The presence of terms related to formality and structured discourse in
LLM-generated texts contrasts with the more narrative and less formal vocabulary found in human-
written texts.
6. Conclusion
In this paper, we presented our submission to the PAN: Generative AI Authorship Verification task. Our
approach is based on the assumption that LLMs use a particular vocabulary, which can be captured
using lexical features. We experiment with three classifiers and two feature sets to identify the most
effective model and feature combination. Our results show that logistic regression and SVM models
using tf-idf feature vectors are highly effective for the task. We find that our submissions outperform
all official baselines, demonstrating that simple and interpretable models can be more effective than
complex and computationally expensive methods. Our qualitative analysis of the most important
lexical features confirms that LLM-generated texts often contain terms distinct from human-written
texts, which can be effectively captured using lexical features. The robustness of our submissions to
obfuscation techniques further highlights the effectiveness of our approach. Overall, our results offer a
more effective alternative to all baseline approaches while also being more efficient and interpretable.
Acknowledgments
This work originates from a programming assignment from the “Introduction to Natural Language
Processing” course at Bauhaus-Universität Weimar during the summer term of 2024. We would like to
thank the teaching staff who recognized the potential of our approach and encouraged us to participate
in the PAN task. Together we turned these ideas into writing.
References
[1] M. Potthast, B. Stein, A. Barrón-Cedeño, P. Rosso, An evaluation framework for plagiarism
detection, in: C.-R. Huang, D. Jurafsky (Eds.), Coling 2010: Posters, Coling 2010 Organizing
Committee, Beijing, China, 2010, pp. 997–1005. URL: https://aclanthology.org/C10-2115.
[2] V. Guillén-Nieto, D. Stein, Language as evidence: Doing forensic linguistics, Springer Nature, 2022.
[3] V. U. Gongane, M. V. Munot, A. D. Anuse, Detection and moderation of detrimental content on
social media platforms: current status and future directions, Social Network Analysis and Mining
12 (2022) 129.
[4] E. Stamatatos, M. Kestemont, K. Kredens, P. Pezik, A. Heini, J. Bevendorff, B. Stein, M. Potthast,
Overview of the Authorship Verification Task at PAN 2022, in: G. Faggioli, N. Ferro, A. Hanbury,
M. Potthast (Eds.), CLEF 2022 Labs and Workshops, Notebook Papers, volume 3180 of CEUR
Workshop Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3180/paper-184.pdf.
[5] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Ko-
renčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova,
E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024:
Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot,
D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro
(Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of
the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in
Computer Science, Springer, Berlin Heidelberg New York, 2024.
[6] J. Bevendorff, M. Wiegmann, J. Karlgren, L. D"urlich, E. Gogoulou, A. Talman, E. Stamatatos,
M. Potthast, B. Stein, Overview of the “Voight-Kampff” Generative AI Authorship Verification
Task at PAN and ELOQUENT 2024, in: G. Faggioli, N. Ferro, P. Galušč’akov’a, A. G. S. de Herrera
(Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR
Workshop Proceedings, CEUR-WS.org, 2024.
[7] Y.-C. Lin, S.-A. Chen, J.-J. Liu, C.-J. Lin, Linear classifier: An often-forgotten baseline for text
classification, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association
for Computational Linguistics, Toronto, Canada, 2023, pp. 1876–1888. URL: https://aclanthology.
org/2023.acl-short.160. doi:10.18653/v1/2023.acl-short.160.
[8] M. Fr"obe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
978-3-031-28241-6_20.
[9] D. Sculley, C. Brodley, Compression and machine learning: a new perspective on feature space
vectors, in: Data Compression Conference (DCC’06), 2006, pp. 332–341. doi:10.1109/DCC.2006.
13.
[10] O. Halvani, C. Winter, L. Graner, On the usefulness of compression models for authorship
verification, in: Proceedings of the 12th International Conference on Availability, Reliability
and Security, ARES ’17, Association for Computing Machinery, New York, NY, USA, 2017. URL:
https://doi.org/10.1145/3098954.3104050. doi:10.1145/3098954.3104050.
[11] M. Koppel, J. Schler, Authorship verification as a one-class classification problem, in: Proceedings
of the Twenty-First International Conference on Machine Learning, ICML ’04, Association for
Computing Machinery, New York, NY, USA, 2004, p. 62. URL: https://doi.org/10.1145/1015330.
1015448. doi:10.1145/1015330.1015448.
[12] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Generalizing unmasking for short texts, in:
J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota,
2019, pp. 654–659. URL: https://aclanthology.org/N19-1068. doi:10.18653/v1/N19-1068.
[13] A. Hans, A. Schwarzschild, V. Cherepanova, H. Kazemi, A. Saha, M. Goldblum, J. Geiping, T. Gold-
stein, Spotting llms with binoculars: Zero-shot detection of machine-generated text, 2024. URL:
https://arxiv.org/abs/2401.12070. arXiv:2401.12070.
[14] J. Su, T. Y. Zhuo, D. Wang, P. Nakov, Detectllm: Leveraging log rank information for zero-shot detec-
tion of machine-generated text, 2023. URL: https://arxiv.org/abs/2306.05540. arXiv:2306.05540.
[15] G. Bao, Y. Zhao, Z. Teng, L. Yang, Y. Zhang, Fast-detectgpt: Efficient zero-shot detection of machine-
generated text via conditional probability curvature, 2024. URL: https://arxiv.org/abs/2310.05130.
arXiv:2310.05130.