<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AI Authorship: Analyzing Descriptive Features for AI Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christopher M. J. André</string-name>
          <email>christopher@andre.bz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helene F. L. Eriksen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emil J. Jakobsen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca C. B. Mingolla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolai B. Thomsen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Copenhagen Business School</institution>
          ,
          <addr-line>Frederiksberg</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>6</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>Motivated by the growing role of AI in text generation and the potential misuse of generative tools, this study investigates key features that diferentiate AI-generated text from human-authored content. We produce a corpus of AI-generated counterparts to 2.100 research paper abstracts, in order to compare formal linguistic and stylometric characteristics such as perplexity, grammar, n-gram distributions and function word frequencies between human- and AI-generated texts. Key findings indicate that human-written abstracts tend to exhibit higher perplexity, greater grammatical error, and more diverse n-gram distributions. To distinguish between the two types of texts we employ various machine learning algorithms, with our Random Forest implementation achieving a precision of 0.986 on unseen data. Notably, feature importance analysis reveals that perplexity, grammar, and n-gram distributions are highly influential in AI-detection classification. Our research contributes a nuanced study of discriminating characteristics of AI-generated text to the increasingly important field of AI authorship attribution.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the ever-evolving landscape of artificial intelligence (AI), the advent of Large Language
Models (LLMs), championed by instruction-based ChatGPT, has attracted significant attention
since late 2022. The remarkable capabilities of these pre-trained models enable them to generate
coherent and arguably insightful paragraphs on virtually any conceivable topic. Notably, the
recent successful completion of both the medical exam and uniform bar re-exam highlights
the remarkable performance at a high academic level [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Consequently, this extraordinary
performance raises important concerns regarding authorship attribution. In the education
system, in particular, as students and researchers may be tempted to not only seek inspiration
from such generative tools but to also claim its generated work as their own.
      </p>
      <p>The motivation for this research came from the growing role of AI in text generation and
its potential misuse. This paper extends current research on what defines AI-generated text
and what strategies could be useful in tackling the related challenge of authorship attribution.
In the following, we examine the diferences between human-written and AI-generated text,
and conduct a comparative study of text- and feature-based machine learning approaches for
detecting AI-generated text.</p>
      <p>
        Specifically, we analyze a corpus composed of 2.100 human-written abstracts and their
AIgenerated counterparts, created using the titles of the original abstracts in the prompts 1. To
evaluate the performance of the proposed modelling approaches, we favor precision – the
proportion of correctly predicted AI-generated abstracts from all the predicted positive cases
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] - as, intuitively, the cost of a false positive is significantly higher than that of a false negative.
The choice to solely focus on abstracts of research papers allows for a focused scope, and a
specialized context for the model to consider and for us to understand.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        In the works of Gao et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] they compared abstracts generated by ChatGPT to human-written
abstracts, using both blinded human reviews and a GPT-2 Output Detector from OpenAI2. The
output detector underscores the prospects of AI, correctly classifying 38% more AI-generated
abstracts than humans. Due to the human involvement, the study was limited to 50 abstracts of
each type, a restriction avoided in our case. Levin et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] also used ChatGPT to generate 50
abstracts, analyzing the diferences through Grammarly, a online typing assistant service. They
found that ChatGPT had fewer grammatical errors and that it tends to use more unique tokens.
      </p>
      <p>
        Various researchers have also generated larger corpora, in order to build more reliable AI
authorship attribution models. Guo et al.[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] discovered that ChatGPT-written text tends to be
more formal than humans, who write in a more colloquial language (using i.e. humor, slang,
metaphors, and antiphrasis), while Mitrovic et al.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] discuss the benefits of using an ML model
over a perplexity score-based classification approach.
      </p>
      <p>
        Uchendu et al.[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] explore diferent approaches to determining authorship attribution, on
a corpus parted into eight sections with one human-written and seven diferent generative
AIs. Using a version of the transformer model RoBERTa[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], fine-tuned with 20% of their data,
they were able to achieve immaculate scores on GPT-2. Chen et al.[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] also utilize a RoBERTa
model. The research showcases a strong classification model, achieving a high accuracy on
their test dataset consisting of human-written and rephrased content generated by ChatGPT.
While the benefits of multi-headed attention in NLP tasks are evident, the black box challenge
of transformer models requires interpretability studies to extract and diferentiate key features.
      </p>
      <p>
        Gehrmann et al.[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] developed GLTR, a visualization tool for the identification of anomalies
in texts. The tool analyzes what GPT-2 would have predicted at each token position. Trained
on GPT-2, GLTR reports on any subject in its library, whereas our scope is narrowed to abstract
structures.
1Full list of prompts is available at Github, URL:
https://github.com/ChrisMJAndre/Detecting-AI-AuthorshipAnalyzing-Descriptive-Features-for-AI-Detection/blob/main/DatasetCreation.py
2HuggingFace: https://openai-openai-detector–w994j.hf.space/
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        All analyses and experiments described in the paper were conducted in English. The
linguistic features, including perplexity, grammar, n-gram distribution, type-token ratio, average
token length, and frequency of function words, were examined within the context of the
English language. The dataset3 crafted for this research is composed of two parts: one part of
human-written abstracts, and the other of AI-generated abstracts using GPT-3.5-turbo. The
key challenge lies in finding any distinguishing features between the two sets. We chose to
use abstracts from academic research papers, for both the human and AI-generated datasets.
Abstracts generally conform to a more standardized structure and length, and the comparable
datasets provide us with the ability to better understand distinguishing features. Based on
Kirchner et al.[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] we set a minimum character limit of 500 for the abstracts, trying to balance
the challenges of short texts with the computational load of processing larger documents.
      </p>
      <p>
        The corpus of academic research papers was obtained via Kaggle from the arXiv Database an
open-access archive managed by Cornell University[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. For the human-written dataset, we
sampled 10,000 papers, excluding cases where the abstracts contained mathematical equations,
were under 500 characters, or where the paper was labeled as ”no abstract”, ”withdrawn”,
or ”et al.” due to GPT-3.5-Turbo’s limitations with writing references. We also ensured only
papers updated before 2021-01-01 were included to eliminate potential GPT-3.5-Turbo-generated
content. This left us with 2,100 human-written abstracts. Subsequently, when creating the
AI-generated abstracts, we employed OpenAI’s ChatCompletion GPT-3.5-turbo v. 0.27.6 model4.
Model temperature was set to 0.7 to strike a balance between deterministic behavior and the
creativity of generated content. For creative tasks, the most common temperatures lie within the
0.7 and 0.9 range[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and in the context of academic abstracts, we wanted the AI to be coherent
and logical, but also introduce some variability, which is why we chose the lower range of the
commonly chosen temperatures. ‘Top_p’ was set to 1, the recommended setting for adjusted
temperature setting. To simulate real-world queries and introduce further variation, we sample
each query from a set of 60 slightly varying prompts. Ultimately, we had a dataset, with 2.100
human-written abstracts and 2.100 AI-generated abstracts. However, due to GPT-3.5-turbo
occasional omission of abstracts, we excluded 147 AI abstracts, leaving us with a final dataset of
2,100 human-written and 1,953 AI-generated abstracts.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Feature Understanding</title>
      <p>The challenge in detecting AI-generated text lies in its resemblance to human writing. Hence,
we aimed to contribute to the current understanding of how specific linguistic characteristics
diferentiate both AI and human-written text. Specifically, we conduct a comparative study
of how perplexity, certain grammatical structures, n-gram distributions, type-token ratios, as
well as the stylometric features average token length and frequency of function words, vary
between human- and AI-generated text.
3The dataset is available at Kaggle:
https://www.kaggle.com/datasets/heleneeriksen/gpt-vs-human-a-corpus-ofresearch-abstracts
4The code is available at GitHub:
https://github.com/ChrisMJAndre/Detecting-AI-Authorship-Analyzing-DescriptiveFeatures-for-AI-Detection</p>
      <sec id="sec-4-1">
        <title>4.1. Perplexity</title>
        <p>
          Perplexity quantifies the ability of a given probabilistic model to predict a given sample.
Efectively, it encapsulates the uncertainty of a model’s prediction: The lower the perplexity score,
the higher the model’s confidence in its outputs[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>
          In comparing the AI-generated and human-written abstracts, we hypothesized that a
perplexity score might capture subtle diferences in token-level predictability, and thereby randomness
of tokens used in both types of abstracts. Given that state-of-the-art generative models like
ChatGPT are fine-tuned to minimize perplexity[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], we would expect this metric to be comparatively
lower for AI-generated abstracts.
        </p>
        <p>To examine our hypothesis, we used the GPT-2 model, an autoregressive language model from
OpenAI through the HuggingFace API 5. This pre-trained model made it possible to compute
perplexity scores for each text sample within our corpus, ofering a consistent and reliable
metric to understand the token-over-token predictability of each abstract.</p>
        <p>Our analysis revealed that human-written abstracts consistently show a higher perplexity
than their AI-generated counterparts, as shown in Figure 1.</p>
        <p>This implies that, based on the vocabulary of GPT-2, human writing tends to be significantly
more diverse at a token level. The AI-generated abstracts consistently exhibit higher conformity,
leading to comparatively higher levels of token-over-token predictability. This finding shows
that human-written abstracts consistently register higher perplexity scores than their
AIgenerated counterparts, which, in turn, suggests perplexity as a valuable discriminator in AI
authorship attribution.</p>
        <p>While the vast vocabulary of generative models like GPT-3.5-turbo allow them to produce
coherent and contextually relevant text, they inherently conform to patterns of high statistical
likelihood, leading them to exhibit token-level regularity and predictability. In contrast, human
writing introduce novelty. This results in comparatively lower perplexity scores for AI-generated
text.
5https://huggingface.co/docs/transformers/perplexity</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Grammar</title>
        <p>Grammar refers to the set of structural rules that dictate the composition of sentences, phrases,
and words in any language. Since LLMs like GPT-3.5-turbo are trained on vast corpora of text ,
they typically exhibit high grammatical accuracy [17]. Our hypothesis is that by incorporating
grammatical analysis as a feature, we might capture discrepancies between human and
AIgenerated abstracts.</p>
        <p>Using language_tool_python, an open-source library for detection of grammatical and spelling
errors, we identified the number of errors in each abstract in our corpus. This enabled us to
compute a grammar score. The grammar score is calculated by taking the number of errors
detected by the language_tool_python and dividing it by the number of tokens in each abstract.
Our comparison of grammatical correctness in human- and AI-generated texts reveals that
AI-generated abstracts consistently contain fewer grammatical errors, as shown in Figure 2.</p>
        <p>The grammatical aptitude of GPT-3.5-turbo may be attributed to the sheer volume of data on
which it is trained. Exposure to large amounts of textual data enable LLMs to understand, or
convincingly mimic, the rules and structures of languages.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. N-Gram Distributions</title>
        <p>
          Although eclipsed by emergent LLMs, n-grams[18] remain a reliable technique for the analysis
of linguistic patterns. We hypothesize that LLMs may favor specific token sequences due to
their training data, which will lead to a higher number of n-grams compared to in
humanwritten abstracts. To analyze the distribution of n-grams in our corpus we tokenize the texts
using NLTK, and log the prevalence of n-grams for each value of n in the range [
          <xref ref-type="bibr" rid="ref1 ref7">1,7</xref>
          ] for each
individual abstract. These scores are then aggregated across all texts to understand the overall
distribution. The intent behind n-gram distribution is to discern token sequences that might be
comparatively more prevalent in either human-written texts or AI-generated texts. Our findings,
as visualized in figures 3-5, show that AI-generated abstracts exhibited a higher frequency of the
same n-grams, especially in higher n-gram ranges compared to human-written abstracts. The
increasing frequency of identical n-grams in AI-generated abstracts, particularly in the higher
n-gram ranges, ofers interesting insights into the behavior of generative models. Notably, the
(a) 1-grams
(b) 2-grams
disparity in n-gram distributions, most visible in the pronounced diference from the 3-gram
range onward, underscores the conformity of current AI-generated text. LLMs are trained and
evaluated on their ability to identify statistically significant patterns, which becomes particularly
evident in the dataset by the repetition of certain 5-grams (&gt;100 occurrences) and 7-grams (&gt;20
occurrences). Such frequent repetitions indicate that the model has identified these sequences
as ”safe bets” for language generation, resulting in their high use. Conversely, the more varied
distribution of n-grams in human-written abstracts reflect the relatively more nuanced ways
(a) Human 1-gram
(b) AI 1-gram
humans tend to express themselves, especially for higher values of n. Humans draw from
unique perspectives, which leads to a richer diversity in phrasing and structuring as evident in
our findings. The relative scarcity of repetitive n-grams in human abstracts suggests a broader
array of stylistic choices. Within the field of AI authorship attribution, our findings suggest
that n-gram distributions can serve as a discriminant measure for detection, as high recurrence
of specific higher-order n-grams might be indicative of AI authorship.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Type-Token Ratio</title>
        <p>Type-Token Ratio (TTR) is used to measure the lexical diversity within a text[19]. In this
research, we focus on TTR with diferent n-gram sizes. This design choice allows us to analyze
lexical diversity at diferent scales, looking into choices of individual tokens as well as the
complexity of a three-token sentence. In our research, we chose to use type-token ratio to
further understand the n-gram distribution in our feature-based model, as TTR produces a
descriptive feature at an individual abstract level. Like with n-gram distributions, our hypothesis
was that LLMs would exhibit lower TTR compared to human-written texts. Our analysis
reveals a pronounced diference in the TTR between AI-generated and human-written abstracts,
particularly for higher range n-grams, as shown in Figure 6, Figure 7, and Figure 8. Specifically,
(a) Human 3-gram
(b) AI 3-gram
AI-generated abstracts exhibited a lower TTR score, indicating less variety in both the tokens
and structures used. This concurs with the conclusions of the n-gram distributions.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Average Token Length and Frequency of Function Words</title>
        <p>The average token length ofers insights into tendencies in writing styles. Upon analysis,
AIgenerated abstracts display a consistent bias towards longer tokens relative to those used by
humans6, possibly a result of the LLMs comprehensive vocabulary, or its formal word choice in
general. Function words, primarily comprising of prepositions, pronouns, and conjunctions,
serve as the backbone for grammatical relationships within sentences, but bear minimal lexical
meaning. Our hypothesis aligned with the findings of Boukhaled &amp; Ganascia [ 20], which
illustrate that frequency could be used in discerning stylistic nuances for authorship attribution.
Our results show a broader distribution of function words in human-written abstracts compared
to their AI-generated counterparts, hinting at diferences in their diversity of writing styles 7.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Modeling</title>
      <p>Based on our analyses, we construct two variants of machine learning approaches for AI
authorship attribution: A text-based approach and a feature-based approach. The former relies
solely on the text of the abstracts, while the latter uses the precomputed features described in
our analyses. The text-based approach is employed as a benchmark to assess the eficacy of the
feature-based approach, given the complexities of deriving a comprehensive view solely from a
textual analysis of abstracts. We evaluate the two approaches using three machine learning
algorithms for classification: Logistic Regression, Random Forest, and Multinomial Naïve Bayes.
A most-frequent label classifier baseline, i.e., a zero-rule baseline strategy of predicting the
majority class, is included for comparison
6graphs available on Github: Visualizations/Average Token Lenght
7graphs available on Github: Visualizations/ Distribution of function words</p>
      <sec id="sec-5-1">
        <title>5.1. Feature-based Approach</title>
        <p>The feature-based approach utilizes the precomputed features to predict authorship attribution,
namely: perplexity, grammar, type-token ratio for 1-, 2-, and 3-grams, frequency of function
words, and average token length. Per our earlier analyses, these features are shown to hold
significant discriminatory value. Hyperparameter tuning was performed using grid search with
5-fold cross-validation, optimizing for precision scores. 8 Post-training, we extracted the most
influential features from the top-performing model.</p>
        <sec id="sec-5-1-1">
          <title>5.1.1. Evaluation</title>
          <p>Our models report consistently high results with Random Forest achieving a precision score of
0.986 on the test data9, as shown in Table 1. While other classifiers like Logistic Regression and
MultinomialNB were competent, they were unable to outperform Random Forest.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Feature Importance</title>
          <p>Upon examination of the feature importance calculated based on the top-performing Random
Forest model, as shown in Table 2, we observe that perplexity dominates the decision function
with a weighted score of 0.71. Next, the grammar score, and TTR_3ngram had significant
influence on the predictions. The grammar scores’ significant importance suggest that the
grammatical structure of the abstract is a distinguishing factor. It’s worth noting that abstracts
might be comparatively less prone to grammatical errors, as a result of the process of
peerreview before publication. Arguably, the importance of grammar score may be even higher
in the classification of texts from other domains. Our results indicate that both perplexity,
grammatical structures and type-token ratios could be key features for distinguishing between
human and AI-generated abstracts.
8Hyperparameters for the best performing model: max_depth: 50, min_samples_split: 10, n_estimators: 20.
9Distribution of dataset: Train (70%), Test (15%), Validation(15%)</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Text-based Approach</title>
        <p>For comparison, we also implemented a text-based approach. During this process the abstract
text underwent normalization, tokenization, stop word removal10, and punctuation elimination.
We employed the TF-IDF vectorizer to numerically transform the text, emphasizing n-grams to
capture token sequence contexts. Like our primary approach, we optimized hyperparameters
through grid search with 5-fold cross-validation, and based model selection on precision scores.
11 The Logistic Regression model proved superior, exhibiting a test precision score of 0.988.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Overall Model Evaluation</title>
      <p>Overall, the evaluation of the text-based and feature-based approaches provide insights into
discriminative diferences between human and AI-generated abstracts. Both approaches led
to highly impressive results when evaluated on precision. However, when observing the log
probabilities of the predictions, the feature-based approach show a substantially more robust
discrimination between the two classes, as seen in Figure 9. This is demonstrated by the
predicted probabilities which are primarily clustered near the extremes, either 0 or 1, suggesting
a strong confidence in the classification. Conversely, the text-based approach’s predicted log
probabilities exhibited a more evenly spread distribution, mainly residing around 0.3 and 0.7,
indicating a lesser degree of certainty in its predictions.</p>
      <p>The feature-based approach, especially with the incorporation of the perplexity feature, ofers
a robust and reliable indication of whether an abstract is AI-generated or not. In contrast,
the text-based model has multiple predictions with probabilities around 0.5, meaning the
classification is largely an arbitrary decision between the two classes. Samples within 0.4 to 0.6
could be considered likely “flukes”, where the model is closer to guessing that predicting.</p>
      <p>The high performance of the feature-based approach across training, test, and validation data
speaks to the performance and generalizability of our approach. Overfitting, a common pitfall in
machine learning, typically manifests as high performance on the training data, but a significant
dip in performance on unseen data. However, our model’s ability to maintain consistent high
10Using a custom list, Github File: DataUnderstanding &amp; Modeling.py
11Hyperparameters for best performing model: C: 0.1, penalty: 12, solver: liblinear, tfiaf_max_df’: 0.9,
tfidf_ngram_range: (5,6).</p>
      <p>(a) Feature-based approach
(b) Text-based approach
precision across all datasets counters this issue, indicating that it is not merely memorizing the
training data, but learning underlying patterns that apply to new unseen data. Conversely, the
attainment of exceptionally high performance across training, test, and validation data, such as
95%+ precision, raises some considerations that need further consideration. The simplicity of
our task, or the informative nature of the chosen features, could explain the exceptionally high
performance.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>
        While perplexity is the most deciding factor for the feature-based approach, as shown in section
4.1 and 5.1.2, the low perplexity scores for the AI-generated abstracts may be due to the narrow
scope of the research. This suggests the need for a multi-domain evaluation. Uchendu et al.
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] highlights a similar challenge, caused by using a single topic corpus and argues that basic
topical analysis cannot diferentiate between human- and AI-generated text across domains. Yet,
as our domain, research paper abstracts, span multiple topical domains, the domain influence
may be reduced. Similarly, the establishment of a reliable character minimum for AI authorship
attribution is still an open research question. Kirchner et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] recommend a 1,000-character
threshold, which conflicts with the structures of most abstracts, due to their concise and
formulaic format.
      </p>
      <sec id="sec-7-1">
        <title>7.1. Limitations</title>
        <p>One notable limitation of our study is the use of perplexity to diferentiate human-authored
abstracts from those generated by GPT-3.5-turbo. The potential limitation arises due to the
shared lineage between GPT-2, used to calculate perplexity, and GPT-3.5-turbo. The similarities
in training data might introduce biases in our results, and diferent alternatives to calculate
perplexity could have yielded varying perplexity scores due to diferences in training nuances.
A further limitation in our data generation and modeling process is the absence of a specified
random seed, which afects the replicability of our findings. The lack of a fixed seed for random
number generation introduces variability between runs, potentially influencing the consistency
of our results.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Future Research</title>
      <p>The selected temperature setting in LLMs holds considerable influence over both the variability
and coherence of the generated text. In this study, we chose a temperature value of 0.7 to strike
a balance between coherence and textual variation. It is important to note, however, that this
choice—falling in the lower spectrum of the 0.7 to 0.9 range—may limit diversity and perhaps
induce slight repetition, as corroborated by the frequent n-gram repetitions discussed in section
4.3. For future research, we recommend examining the efects of varying temperature settings on
text generation, particularly focusing on aspects like perplexity and repetition. A more nuanced
understanding of temperature’s influence on text generation could ofer valuable insights
into the patterns of AI-generated text and improve methods for authorship attribution. To
facilitate a potential in-depth exploration of our feature-based approach’s performance, we have
highlighted the top 3 most challenging classifications for both AI and human abstracts available
at GitHub12. This data would allow future research to examine what feature combinations that
present challenges for detecting AI authorship. Furthermore, it is important to acknowledge
that the text generated by LLMs can be altered to avoid detection. Techniques such as swapping
out key phrases, inserting unexpected vocabulary, or embedding grammatical errors into the
abstracts could all impact the probability of the abstracts being misidentified as human writing.
Considering the societal impact, it remains imperative that the field of AI authorship attribution
develops in parallel with the evolution of LLMs.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusion</title>
      <p>In the rapidly progressing field of artificial intelligence, the emergence of sophisticated
generative models poses both opportunities and challenges. This research sheds light on potential key
diferences in human-written and AI-generated academic abstracts. Through a comprehensive
analysis of various textual features like perplexity, grammar, n-gram distributions, and
typetoken ratios, the study reveals clear discriminatory patterns. The AI-generated texts are shown
to exhibit higher token-level predictability and grammatical accuracy when compared to their
human-written counterparts. Our approaches, especially the feature-based one, demonstrate
remarkable precision and certainty in distinguishing between human and AI-generated content.
Such precision emphasizes the significance of the chosen features, with perplexity standing out
as a particularly influential metric in tackling the challenge of AI authorship attribution.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>We would like to express our gratitude to professor Daniel Hardt for his invaluable guidance,
support, and mentorship throughout the duration of the research project.
12Github File: Worst classified AI and Human Abstracts.xlsx
raj-k-kadiyala.medium.com/gptzero-vs-chatgpt-a-gray-story-901b825e0666, last accessed:
September 10, 2023.
[17] H. Wu, W. Wang, Y. Wan, W. Jiao, M. Lyu, Chatgpt or grammarly? evaluating chatgpt on
grammatical error correction benchmark., 2023. Last accessed: September 10, 2023.
[18] S. Srinidhi, Medium: Understanding word n-grams and n-gram probability in natural
language processing.,
https://towardsdatascience.com/understanding-word-n-grams-andn-gram-probability-in-natural-language-processing-9d9eef0fa058, 2019. Last accessed:
September 10, 2023.
[19] D. Thomas, Type-token ratios in one teacher’s classroom talk: An investigation of lexical
complexity, 2005. Last accessed: September 10, 2023.
[20] M. Boukhaled, J.-G. Ganascia, Authorship attribution, 2017. URL: https://www.
sciencedirect.com/topics/computer-science/authorship-attribution, last accessed:
September 10, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bassignana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          , Preface to the
          <source>Seventh Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          ,
          <source>in: Proceedings of the Seventh Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2023</year>
          )
          <article-title>co-located with 22th International Conference of the Italian Association for Artificial Intelligence (AI* IA</article-title>
          <year>2023</year>
          ),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] OpenAI, Gpt-4
          <source>technical report</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bamania</surname>
          </string-name>
          ,
          <article-title>When should you use accuracy, precision, recall &amp; f-1 score?</article-title>
          , https://levelup.gitconnected.com/4
          <article-title>-important-metrics-for-classification-machinelearning-models-when-how-to-use-them-</article-title>
          <string-name>
            <surname>6aa7c85d7665</surname>
          </string-name>
          ,
          <year>2022</year>
          . Last accessed:
          <year>September 10</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Markov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Pearson</surname>
          </string-name>
          ,
          <article-title>Comparing scientific abstracts generated by chatgpt to real abstracts with detectors and blinded human reviewers</article-title>
          ,
          <source>npj Digital Medicine</source>
          <volume>6</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . URL: https://doi.org/10.1038/ s41746-023-00819-6. doi:
          <volume>10</volume>
          .1038/s41746- 023- 00819- 6.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L. G</given-names>
            ,
            <surname>M. R</surname>
          </string-name>
          , K. E,
          <string-name>
            <surname>B. Y.</surname>
          </string-name>
          ,
          <article-title>Identifying chatgpt-written obgyn abstracts using a simple tool</article-title>
          .,
          <source>Am J Obstet Gynecol MFM</source>
          <volume>5</volume>
          (
          <year>2023</year>
          ). URL: https://pubmed.ncbi.nlm.nih.gov/36931435/. doi:
          <volume>10</volume>
          .1016/j.ajogmf.
          <year>2023</year>
          .
          <volume>100936</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>How close is chatgpt to human experts? comparison corpus, evaluation, and detection</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2301</volume>
          .
          <fpage>07597</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mitrović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Andreoletti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ayoub</surname>
          </string-name>
          ,
          <article-title>Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2301</volume>
          .
          <fpage>13852</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Uchendu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Authorship attribution for neural text generation</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>Online</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>8384</fpage>
          -
          <lpage>8395</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>673</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp- main.673.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Raj</surname>
          </string-name>
          ,
          <article-title>Gpt-sentinel: Distinguishing human and chatgpt generated content</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>07969</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Strobelt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Gltr:
          <article-title>Statistical detection and visualization of generated text</article-title>
          , CoRR abs/
          <year>1906</year>
          .04043 (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . URL: http://arxiv.org/abs/
          <year>1906</year>
          .04043. doi:arXiv:
          <year>1906</year>
          .04043.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kirchner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Aaronson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <article-title>New ai classifier for indicating ai-written text</article-title>
          ,
          <year>2023</year>
          . URL: https://openai.com/blog/new-ai
          <article-title>-classifier-for-indicating-ai-written-text</article-title>
          ,
          <source>last accessed: September</source>
          <volume>10</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13] C. University, J. Tricot, devrishi,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maltzan</surname>
          </string-name>
          , S. Brinn, arxiv dataset,
          <year>2023</year>
          . URL: https: //www.kaggle.com/datasets/Cornell-University/arxiv, last accessed:
          <source>September</source>
          <volume>10</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Kolar</surname>
          </string-name>
          ,
          <article-title>A simple guide to setting the gpt-3 temperature,</article-title>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <article-title>Perplexity: a more intuitive measure of uncertainty than entropy</article-title>
          .,
          <year>2021</year>
          . URL: https://mbernste.github.io/posts/perplexity/,
          <source>last accessed: September</source>
          <volume>10</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kadiyala</surname>
          </string-name>
          , Medium.com:
          <article-title>Gptzero vs chatgpt - a gray story</article-title>
          .,
          <year>2023</year>
          . URL: https://
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>