<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deceptive ChatGPT and Human Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvatore Giorgi</string-name>
          <email>sgiorgi@sas.upenn.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David M. Markowitz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikita Soni</string-name>
          <email>nisoni@cs.stonybrook.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasudha Varadarajan</string-name>
          <email>vvaradarajan@cs.stonybrook.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siddharth Mangalik</string-name>
          <email>smangalik@cs.stonybrook.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>H. Andrew Schwartz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>M. Litvak, I.Rabaev, R. Campos, A. Jorge, A. Jatowt (eds.): Proceedingosf the IACT'23 Workshop</institution>
          ,
          <addr-line>Taipei, Taiwan, 27-</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Liars and truth-tellers often communicate diferently. When tasked with writing deceptive text, humans use prior knowledge and experiences to intentionally deceive their target. Automatically generated text via large language models; LLMs, on the other hand, mirror training instances that are most likely written by humans. In the case of content like reviews, automatically generated language is inherently deceptive because the system is not grounded in material world experiences. In this paper, we characterize diferences between (a) truthful text written by humans, (b) intentionally deceptive text written by humans, and (c) inherently deceptive text written by state-of-the-art language models (ChatGPT). We examined the expression of thirteen psychologically grounded and fundamental human traits (e.g., personality and empathy) across truthful and deceptive hotel reviews, finding that texts written by humans were more diverse (had more variation) in their expressions of personality than texts written by ChatGPT. Across all human traits we found that truthful and deceptive human language was easier to distinguish from machine generated language. Building on these diferences, we trained a classifier using only the thirteen human traits to automatically discriminate between truthful and deceptive language, with a classification AUC of up to 0.966. Thus, despite the fact that large language models mirror text written by genuine (and truthful) humans, their lack of diversity in human traits makes them easier to identify. These results suggest that psychologically grounded human traits ofer a robust feature set unafected by the “human-ness” of LLM language, and further suggest that AI and humans are behaviorally diferent when communicating about experiences.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>https://sjgiorgi.github.io/ (S. Giorgi); https://www.davidmarkowitz.org/ (D. M. Markowitz);
https://smangalik.github.io/ (S. Mangalik); https://www3.cs.stonybrook.edu/~has/ (H. A. Schwartz)</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Recent progress in natural language processing has led to the development of advanced large
language models (LLMs) such as LLaMa, GPT-3, and GPT-4. These models can generate
highquality linguistic outputs that are often indistinguishable from those written by humans [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2, 3</xref>
        ].
Even in their short time of mainstream popularity, Artificial Intelligence (AI) running LLMs have
had a considerable impact on content generation for everyday tasks [4], language translation [5],
and learning about human nature through the comparison of LLM outputs to human outputs.
For example, recent evidence suggests LLM moral judgment and decision-making are almost
perfectly aligned with humans [6] and classic psychology experiments (e.g., The Milgram
Experiment) have been replicated using LLMs as well [7]. Altogether, LLMs are remarkably
powerful at language generation and task completion, yet it is unclear how their linguistic
outputs compare to humans to reveal important social and psychological diferences between
human and non-human communicators.
      </p>
      <p>To this end, we draw on decades of social science research to understand how individual
diferences (e.g., age, gender, personality traits) are revealed in language, and how such linguistic
signals difer depending on the communicator (AI vs. humans). We contextualize our work
by examining how individual diferences in language can reveal when one is lying or telling
the truth. Individual diferences are identifiable in human language [ 8, 9, 10] and they are also
critical in deception research (e.g., younger people tend to lie more than older people; men tend
to lie more prolifically than women) [ 11, 12, 13, 14], suggesting it is important to consider how
trait-level linguistic inferences can reveal honest and dishonest AI and humans.</p>
      <p>This work contributes to our understanding of how AI language does and does not look like
human language. We examine the homogeneity of generated text on the basis of thirteen human
factors: two demographics (age and gender), empathy, five personality dimensions (openness,
conscientiousness, extraversion, agreeableness, and emotional stability), and five latent behavior
factors. These features are validated by demonstrating their viability for discriminating AI texts,
in the form of hotel reviews, from those written by humans. We further use these human factors
to automatically discriminate between truthful and deceptive reviews. To do this, we begin
by examining diferences in the human factors distributions across AI-generated reviews (via
ChatGPT), truthful human reviews, and deceptive human reviews. Next, we train a classifier
to predict, out-of-sample, the label of “deceptive machine”, “truthful human”, and “deceptive
human”, to understand which human traits are most discriminative of AI and human texts.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <p>Researchers working to identify text diferences between LLMs and humans have developed
several promising detection techniques, often examining social bots [ 15, 16], or automated
computer algorithms that generate posts and content on social media platforms. Modern techniques
for diferentiating human language from AI language often focus on fingerprinting models via
watermarked outputs [17], statistical modeling of word use [18], neural net classifiers [ 19, 20],
and zero-shot classifiers built on other LLMs [ 21]. Fine-tuned solutions in particular have had
great success in binary classification settings [ 22, 19]. However, these methods are fragile
to changes in training datasets [23, 24], paraphrasing attacks [25, 26], and performance can
be specific to the target model [ 27]. Improvements seen in recent models with ever larger
parameter spaces further threaten the capabilities of current detection methodologies.</p>
      <p>Among existing paradigms described in a recent survey of AI language detectors [28], our
work fits into black-box modeling where we are interested in the outputs of LLMs and not a
specific model’s contents or design. This allows us to focus on language diferences between
human and AI texts rather than on the particular implementation details of models or detectors.
For example, works that studied chat environments have found that AI texts are on average less
verbose and less analytical [29] which can further lend to more dificult diferentiation since
longer and more complex AI text is more easily identified by trained detectors [ 30].</p>
      <p>Here, we look to researchers who have proposed using emotional, personality, and
demographic cues to detect text generated by advanced language models. While LLMs can produce
grammatically correct and coherent text, they may struggle to capture the range of human
emotion and personality, which are key elements of natural language. Characterizing text
written by advanced language models using diferences in emotional and personality cues is a
growing area of research. Sentiment analysis of bots have found they demonstrate low variance
in emotionality, personality, demographics, and positive emotional sentiments when compared
to human authors[31]. Giorgi et al. [32] examined the use of personality traits to detect text
generated by automated bots on Twitter, namely social spambots. The researchers used the Big
Five personality traits (openness, conscientiousness, extraversion, agreeableness, and emotional
stability) as a basis for analysis, and found that text from social spambots exhibited higher levels
of emotional stability and openness than text written by humans. Consistent across all findings
was, once again, highly biased language traits compared to more varied human speech.</p>
      <p>AI-generated texts prompted to write positive language have been found to consistently
contain more positive emotional language, more adjectives, more analytic writing and be less
readable than human language [33]. Unlike humans, large language models tend to be neutral
by default and lack emotional expression. Other research has shown that ChatGPT contains
significantly less negative emotion, hate speech, punctuation implying subjective feelings, than
human-authored texts. Likewise, AI text also lacked purpose and readability [34, 35].</p>
      <p>Generated language is additionally dissimilar to human language when viewed through the
lens of deception. Human authors tasked with writing fake hotel reviews intentionally deceive
their audience and pull on their prior knowledge and experiences in their texts. However, models
tasked with generating hotel reviews will mirror language typical in reviews, but they will
fundamentally deceive people when referring to the product they could not have engaged with.
For example, in one such generated hotel review, ChatGPT states “...The room was spacious,
clean, and had all the amenities I needed for a comfortable stay. The bed was comfortable and I
slept like a baby every night...”. Clearly, the AI and its language are misleading the reader in a text
scenario that would naturally include mention of personal experiences, however this system
is necessarily ungrounded in the material world. It therefore engages in inherent deception
because AI cannot have an experience like a human, but it can write like it did.</p>
      <p>Next, we briefly review the deception and language literature—with a focus on individual
diferences—to further ground our examination. We draw on this literature to demonstrate
that our key individual diferences of interest (e.g., personality, gender, age) can not only be
extracted in natural language, but they are also directly relevant to deception research as well.</p>
      <sec id="sec-3-1">
        <title>2.1. An Overview of Deception and Language Scholarship</title>
        <p>A long history of deception research suggests liars communicate diferently than truth-tellers
across a range of verbal dimensions [36]. For example, liars often use more negative emotion
terms than truth-tellers [37], liars tend to use fewer details than truth-tellers [38, 39], and liars
also communicate less about the self compared to truth-tellers [40, 41, 42]. To evaluate these
patterns, most deception research has used human language to determine how people lied
or told the truth, but recent work has also considered how texts from AI and LLMs can also
reveal deception as well [33]. That is, Markowitz and colleagues [33] had ChatGPT write hotel
reviews and compared the text to human-written reviews that were intentionally deceptive or
honest [43]. ChatGPT outputs, which were deemed inherently deceptive (e.g., a chatbot cannot
have an experience like staying at a hotel, even if it wrote like it did), were more afective and
less descriptive than humans who were deceptive. This work suggests AI and humans are
behaviorally diferent at the language level when communicating about experiences, and served
as a springboard for our investigation to examine how other characteristics linked to deception,
(e.g., individual diferences) might be revealed in the language of AI and humans.</p>
        <p>Indeed, human deception research finds that liars and truth-tellers are distinct across a range
of individual diferences measures. Younger individuals and men tend to lie more than people
of other demographic groups [44, 13, 45], people who are high on aversive personality traits
(e.g., psychopathy, narcissism, Machiavellianism) tend to lie more than people who are low
on aversive personality traits [46, 12, 13], and Big 5 personality traits like extraversion predict
greater lying rates as well [47]. Therefore, individual diferences matter in deception, making
it critical to understand how such traits can be approximated linguistically and the degree to
which such features can distinguish human from non-human communicators.</p>
        <p>In sum, we used state-of-the-art NLP techniques to evaluate how individual diferences
(e.g., age, gender, personality) can be revealed in language, and how such linguistic features
characterize inherently deceptive LLMs versus intentionally deceptive humans when writing
about experiences (e.g., staying at a hotel). This work has several nontrivial implications.
First, it is presently unclear how individual diferences manifest in LLM outputs and thus,
we use a range of NLP techniques to identify the features that separate human and
nonhuman communicators at the language level. In parallel, we use extracted linguistic features
to provide one of the first studies to evaluate deception detection accuracy using human and
non-human language. The ability to detect deception is a natural question to consider following
feature extraction and therefore, we use the features from our previous empirical stages to
create a classifier that distinguishes inherent AI deception from intentional human deception
and truthful communication. The findings of this work can be immediately informative for
NLP and recommender systems research by ofering feature types (e.g., individual diferences
characteristics) and baseline detection accuracies across a range of models. This work may also
help certain organizations think about how to use NLP to curb the use of LLMs in contexts where
they prize genuine, human content generation (e.g., reviews on Yelp, TripAdvisor, Amazon etc.).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Data</title>
      <p>The data set consists of both truthful and deceptive hotel reviews. The truthful reviews were
written by humans; the deceptive reviews were written by both machines (i.e., AI-generated
text) and humans, neither of whom stayed at the hotels.</p>
      <p>The truthful hotel reviews (TruthH ) are from the data set collected by Ott et al. [43], where
reviews are collected from TripAdvisor from 20 hotels in Chicago, IL USA (see Appendix for list
of 20 hotels). For each hotel, 20 reviews were collected (400 total truthful reviews). Ott et al.
[43] manually preprocessed the reviews to ensure they were truthful and genuine.</p>
      <p>The AI-generated reviews (3  ) were collected through the OpenAI API for
ChatGPT3.5, for the same 20 hotels in Chicago, collecting 20 reviews for each hotel by independently
querying with the same prompt. The prompt sent to the API was “Write me a positive hotel
review for the &lt;HOTEL&gt; in Chicago. The review must be 120 words long.” The parameters
were– temperature: 1, frequency penalty: 2, and presence penalty: 1. These parameters gave
ChatGPT the best chance at producing diverse and creative responses. The number of words
in AI-generated positive deceptive reviews were 136.7 on an average, although the prompt
specified 120 words. We experimented with other settings, plus GPT-4. These tests are included
in the Appendix. The AI-generated reviews were collected for the present study.</p>
      <p>Human deceptive reviews (DeceptH ) were collected by presenting the same prompts to
humans who wrote reviews for each of the 20 hotels. These reviews were also collected by
Ott et al. [43]. Thus, we collect 400 generated deceptive reviews from ChatGPT-3.5 and 400
deceptive reviews from humans, for a total of 1200 reviews.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Estimating Human Traits</title>
      <p>We extracted 13 human traits from the reviews: two demographics (age, gender), five personality
traits (openness, conscientiousness, extraversion, agreeableness, emotional stability), empathy,
and five latent behaviors. Traits were developed in other work with details below.</p>
      <sec id="sec-5-1">
        <title>4.1. Demographics</title>
        <p>Age and gender were estimated using a predictive model built by Sap et al. [48]. This model was
trained on Twitter, Facebook, and blog data from over 70,000 people who self-reported their
age (continuous) and gender (binary female/male gender; multi-class gender was not available).
Unigrams were extracted from the social media posts and used within a penalized Ridge
regression (age) and a support vector classifier (gender). Out-of-sample prediction accuracy for
the age model was Pearson r = 0.86, while the gender model had an accuracy of 0.90. The gender
model, while trained to predict a binary gender, produces a continuous score, with negative
values being more “male” and positive values being more “female.”</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Personality</title>
        <p>Big 5 personality was estimated using the model by Park et al. [49]. This model was trained on
Facebook statuses to predict personality from over 66,000 people who reported their personality
via the International Personality Item Pool [50]. All items were on a 5 point likert scale and
the final personality dimensions were averages of all items within each trait (thus, a final
score between 1 and 5). A penalized ridge regression was trained on unigrams and Latent
Dirichlet Allocation (LDA) topics from the participants’ Facebook statuses. This model resulted
in out-of-sample prediction accuracies (Pearson r) of 0.43 (openness), 0.37 (conscientiousness),
0.42 (extraversion), 0.35 (agreeableness) and 0.35 (enotional stability).</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Empathy</title>
        <p>We used a model built to predict the Empathic Concern dimension (henceforth referred to
as empathy) [51] from the Interpersonal Reactivity Index [52] using a preexisting data set of
self-reported empathy and shared Facebook status updates [53, 54]. This model was trained
using a set of LDA topics (derived from the Facebook status updates) in a penalized ridge
regression model and resulted in an out-of-sample Pearson r of 0.26.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Behavioral Latent Dimensions</title>
        <p>We used five behavioral latent dimensions (BLTs) estimated from approximately 50,000 users
who shared their Facebook status updates originally built by Kulkarni et al. [55]. Factors are
extracted via a factor analysis over ngram frequencies extracted from Facebook statuses. These
factors are used to model everyday human language and can be viewed as a data-driven, open
vocabulary analog to the five personality dimensions. These dimensions have been found to
be more generalizable than personality (i.e., they predict non-survey-based outcomes such as
income) and are stable across time and populations.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Methods</title>
      <p>We first considered linguistic diferences in human trait distributions, and then examined how
such traits discriminated between human and AI reviews via out-of-sample classification.</p>
      <sec id="sec-6-1">
        <title>5.1. Human Trait Distributions</title>
        <p>We visualized the distributions of all reviews via a kernel density estimate to understand
diferences in means and variances. To formalize these diferences, we ran a two-sample
Kolmogorov–Smirnov (KS) test between all pairs of reviews (3  vs DeceptH , 3 
vs TruthH , and TruthH vs DeceptH ) which quantifies the distance between distributions. To
account for multiple comparisons, significance levels are Bonferroni corrected [ 56].</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Classification</title>
        <p>Next, we built classifiers that distinguished AI-generated hotel reviews from deceptive
humangenerated (3  vs DeceptH ), AI-generated reviews from truthful human-generated
reviews (3  vs TruthH ), and truthful human reviews from deceptive human reviews
(TruthH vs DeceptH ). There are 800 observations for each model (e.g., 400 truthful human
reviews and 400 deceptive human reviews in TruthH vs DeceptH ). Finally, we evaluate a multiclass
classifier, where we attempt to distinguish all three types: TruthH , DeceptH , and 3  .
This model contains 1200 observations (400 for each class).</p>
        <p>All feature representations were fit using a logistic regression model, as implented in
Scikitlearn [57], with an inverse regularization strength ( ) of 100,000 in order to approximate a ℓ0
penalty. Models were evaluated using 5-fold cross validation, where classes were stratified
across each fold. We report the Area under the ROC Curve (AUC) for each model, which varies
between 0 and 1, where a value of 0.5 is the result of random guessing. The multiclass classifier
is evaluated using the micro-average AUC. Human trait extraction and classification were all
done with the open source Python package DLATK [58].</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Results</title>
      <p>Figure 1 shows the density plots for each human trait. Across most traits, ChatGPT reviews
exhibited smaller variation (distributions were more peaked and less wide). ChatGPT reviews
were also higher in openness, emotional stability, empathy, and factors 1, 2 and 4 of the BLTs.
When comparing the human generated reviews (both deceptive and truthful, in dotted and
solid lines, respectively), both had a larger variation (wider distributions) and similar means,
suggesting that the human reviews, regardless of their intention, had similar human traits.</p>
      <p>Table 1 shows the results of the KS test. Here, only empathy was significantly diferent
across TruthH vs DeceptH , quantifying what was shown in the distributions: human reviews,
regardless of their intention, are similar. When comparing 3  to the human text, results
show that 3  was closer to DeceptH (i.e., small KS distances) than TruthH across 11 out
of 13 human traits (A2). Therefore, deceptive human text is closer to truthful human text than
deceptive ChatGPT text (last two columns), but deceptive ChatGPT text is closer to deceptive
human text than truthful human text (first two columns). Full results for the behavioral latent
traits are in Appendix Table A2.</p>
      <p>Table 2 lists the classification accuracies (AUC) and shows similar results to the KS tests.
For each feature set (row), we see that the most dificult task is TruthH vs DeceptH , which
distinguishes between humans. This has an average AUC of 0.596 (across all human traits).
Next, we see that 3  vs DeceptH has the second highest AUC, with an average value
of 0.766 (again, across all human trait models). Finally, 3  vs TruthH has the highest
average AUC of 0.814. This may indicate that 3  is closer to DeceptH than TruthH and
that deceptive text, regardless of who or what created it, is diferent than truthful text.</p>
      <p>The multiclass classifier attains the highest AUC for the Behavioral Latent Traits (AUC =
0.801) and all features combined (AUC = 0.798). This suggests that personality, empathy, and
demographics do not add additional predictive signal above the Behavioral Latent Traits. This is
not the case for the binary classifiers. Here the Behavioral Latent Traits are the most predictive
single class of features, but combining all feature sets results in a boost over the Behavioral
Latent Traits alone.</p>
    </sec>
    <sec id="sec-8">
      <title>7. Conclusions</title>
      <p>In this paper, we showed that human traits, represented in language, distinguish between
inherently deceptive ChatGPT texts, intentionally deceptive human texts, and truthful human
texts. ChatGPT was more limited in its expression of human traits with personality in particular
having very limited variations. In other words, recent versions of ChatGPT seem to produce
text equivalent to a median person for most traits, while for openness and emotional stability,
ChatGPT scored higher estimates in line with a positivity bias in LLMs [33]. Indeed, these results
also dovetail with previous research that found social spambots on Twitter to be extremely
limited in expressions of personality [32].</p>
      <p>The results from our predictive analyses suggest psychologically grounded human traits are
able to accurately distinguish between deceptive and truthful humans (AUC = 0.678) as well as
deceptive machines and truthful humans (AUC = 0.966). Considering the drastically diferent
dimensionality of features based on psychological-theory (thirteen features) as compared to
more sophisticated feature sets (100s to 1000s of features; see Appendix), such
psychologicallygrounded features with a rich history of empirical and qualitative work describing their meaning,
can ofer an avenue of interpretability for distinguishing text from LLMs.</p>
      <p>While we evaluated large language models using human traits, we do not mean to imply
that these systems are human and note that there are many risks when doing so. Personifying
and relating to automated systems may create transparency issues, which can be exasperated
by high stakes tasks; see Abercrombie et al. [59] for a detailed discussion. Furthermore, such
personifications may further propagate stereotypes, e.g., defaulting to female-gendered versus
ender-ambiguous systems [60]. The present work’s use of human traits shows that while these
systems do express reasonable values (e.g., an average estimated age of 35), there is a lack of
variance. In other words, the systems may look human, on average, but fail to express a range
or diversity of traits, thus highlighting their limitations.</p>
      <p>Future work should continue to interrogate these approaches to compare how they might
perform in other tasks that involve human and non-human text classifications. It is important
for future work to consider diferent types of experiential text to identify how the results in
the current paper hold. Further, as LLMs continue to improve in their ability to approximate
human communication, future studies should examine classification accuracies against our
results to benchmark how models can discriminate between human and non-human texts that
are communicated honestly and dishonestly.
that people cannot diferentiate ai-generated from human-written poetry, Computers in
human behavior 114 (2021) 106553. doi:1 0 . 1 0 1 6 / j . c h b . 2 0 2 0 . 1 0 6 5 5 3 .
[3] S. Kreps, R. M. McCain, M. Brundage, All the news that’s fit to fabricate: Ai-generated
text as a tool of media misinformation, Journal of experimental political science 9 (2022)
104–117. doi:1 0 . 1 0 1 7 / X P S . 2 0 2 0 . 3 7 .
[4] J. Zamora, I’m sorry, dave, i’m afraid i can’t do that: Chatbot perception and expectations,
in: Proceedings of the 5th international conference on human agent interaction, 2017, pp.
253–260. doi:1 0 . 1 1 4 5 / 3 1 2 5 7 3 9 . 3 1 2 5 7 6 6 .
[5] W. Jiao, W. Wang, J. tse Huang, X. Wang, Z. Tu, Is chatgpt a good translator? yes with
gpt-4 as the engine, 2023. doi:1 0 . 4 8 5 5 0 / a r X i v . 2 3 0 1 . 0 8 7 4 5 . a r X i v : 2 3 0 1 . 0 8 7 4 5 .
[6] D. Dillion, N. Tandon, Y. Gu, K. Gray, Can ai language models replace human participants?,</p>
      <p>Trends in Cognitive Sciences (2023). doi:1 0 . 1 0 1 6 / j . t i c s . 2 0 2 3 . 0 4 . 0 0 8 .
[7] G. Aher, R. I. Arriaga, A. T. Kalai, Using large language models to simulate multiple humans
and replicate human subject studies, 2023. a r X i v : 2 2 0 8 . 1 0 2 6 4 .
[8] M. L. Kern, J. C. Eichstaedt, H. A. Schwartz, L. Dziurzynski, L. H. Ungar, D. J. Stillwell,
M. Kosinski, S. M. Ramones, M. E. Seligman, The online social self: An open vocabulary
approach to personality, Assessment 21 (2014) 158–169. doi:1 0 . 1 1 7 7 / 1 0 7 3 1 9 1 1 1 3 5 1 4 1 0 4 .
[9] M. L. Newman, C. J. Groom, L. D. Handelman, J. W. Pennebaker, Gender diferences in
language use: An analysis of 14,000 text samples, Discourse processes 45 (2008) 211–236.
doi:1 0 . 1 0 8 0 / 0 1 6 3 8 5 3 0 8 0 2 0 7 3 7 1 2 .
[10] J. W. Pennebaker, L. A. King, Linguistic styles: language use as an individual diference.,
Journal of personality and social psychology 77 (1999) 1296. doi:1 0 . 1 0 3 7 / 0 0 2 2 - 3 5 1 4 . 7 7 . 6 .
1 2 9 6 .
[11] M. C. Ashton, K. Lee, R. E. De Vries, The hexaco honesty-humility, agreeableness, and
emotionality factors: A review of research and theory, Personality and Social Psychology
Review 18 (2014) 139–152.
[12] D. N. Jones, D. L. Paulhus, Duplicity among the dark triad: Three faces of deceit., Journal
of Personality and Social Psychology 113 (2017) 329. doi:1 0 . 1 0 3 7 / p s p p 0 0 0 0 1 3 9 .
[13] D. M. Markowitz, Toward a deeper understanding of prolific lying: Building a profile of
situation-level and individual-level characteristics, Communication Research 50 (2023)
80–105. doi:1 0 . 1 1 7 7 / 0 0 9 3 6 5 0 2 2 2 1 0 9 7 0 4 1 .
[14] D. M. Markowitz, T. R. Levine, It’s the situation and your disposition: A test of two
honesty hypotheses, Social Psychological and Personality Science 12 (2021) 213–224.
doi:1 0 . 1 1 7 7 / 1 9 4 8 5 5 0 6 1 9 8 9 8 9 7 6 .
[15] J. Zhang, R. Zhang, Y. Zhang, G. Yan, The rise of social botnets: Attacks and
countermeasures, IEEE Transactions on Dependable and Secure Computing 15 (2016) 1068–1082.
[16] S. Cresci, A decade of social bot detection, Communications of the ACM 63 (2020) 72–83.
[17] S. Abdelnabi, M. Fritz, Adversarial watermarking transformer: Towards tracing text
provenance with data hiding, in: 2021 IEEE Symposium on Security and Privacy (SP),
2021, pp. 121–140. doi:1 0 . 1 1 0 9 / S P 4 0 0 0 1 . 2 0 2 1 . 0 0 0 8 3 .
[18] S. Gehrmann, H. Strobelt, A. Rush, GLTR: Statistical detection and visualization of
generated text, in: Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics: System Demonstrations, Association for Computational
Linguistics, Florence, Italy, 2019, pp. 111–116. URL: https://aclanthology.org/P19-3019.
doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 3 0 1 9 .
[19] A. Najee-Ullah, L. Landeros, Y. Balytskyi, S.-Y. Chang, Towards detection of ai-generated
texts and misinformation, in: Socio-Technical Aspects in Security: 11th International
Workshop, STAST 2021, Virtual Event, October 8, 2021, Revised Selected Papers, Springer,
2022, pp. 194–205.
[20] T. Fagni, F. Falchi, M. Gambini, A. Martella, M. Tesconi, Tweepfake: About detecting
deepfake tweets, Plos one 16 (2021) e0251415.
[21] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, C. Finn, Detectgpt: Zero-shot
machinegenerated text detection using probability curvature, arXiv preprint arXiv:2301.11305
(2023).
[22] A. Garcia-Silva, C. Berrio, J. M. Gomez-Perez, Understanding transformers for bot detection
in twitter, arXiv preprint arXiv:2104.06182 (2021).
[23] J. Tourille, B. Sow, A. Popescu, Automatic detection of bot-generated tweets, in:
Proceedings of the 1st International Workshop on Multimedia AI against Disinformation, 2022, pp.
44–51.
[24] A. Bakhtin, S. Gross, M. Ott, Y. Deng, M. Ranzato, A. Szlam, Real or fake? learning to
discriminate machine from human generated text, arXiv preprint arXiv:1906.03351 (2019).
[25] V. S. Sadasivan, A. Kumar, S. Balasubramanian, W. Wang, S. Feizi, Can ai-generated text
be reliably detected?, arXiv preprint arXiv:2303.11156 (2023).
[26] K. Krishna, Y. Song, M. Karpinska, J. Wieting, M. Iyyer, Paraphrasing evades detectors of
ai-generated text, but retrieval is an efective defense, arXiv preprint arXiv:2303.13408
(2023).
[27] M. Gambini, T. Fagni, F. Falchi, M. Tesconi, On pushing deepfake tweet detection
capabilities to the limits, in: 14th ACM Web Science Conference 2022, 2022, pp. 154–163.
[28] R. Tang, Y.-N. Chuang, X. Hu, The science of detecting llm-generated texts, arXiv preprint
arXiv:2303.07205 (2023).
[29] J. Hohenstein, M. Jung, Ai as a moral crumple zone: The efects of ai-mediated
communication on attribution and trust, Computers in Human Behavior 106 (2020) 106190.
[30] D. Ippolito, D. Duckworth, C. Callison-Burch, D. Eck, Automatic detection of generated
text is easiest when humans are fooled, in: Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, Association for Computational Linguistics,
Online, 2020, pp. 1808–1822. URL: https://www.aclweb.org/anthology/2020.acl-main.164.
doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 1 6 4 .
[31] J. P. Dickerson, V. Kagan, V. Subrahmanian, Using sentiment to detect bots on twitter:
Are humans more opinionated than bots?, in: 2014 IEEE/ACM International Conference
on Advances in Social Networks Analysis and Mining (ASONAM 2014), IEEE, 2014, pp.
620–627.
[32] S. Giorgi, L. Ungar, H. A. Schwartz, Characterizing social spambots by their human
traits, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021,
Association for Computational Linguistics, Online, 2021, pp. 5148–5158. URL: https://
aclanthology.org/2021.findings-acl.457. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . f i n d i n g s - a c l . 4 5 7 .
[33] D. M. Markowitz, J. Hancock, J. Bailenson, Linguistic markers of ai-generated text versus
human-generated text: Evidence from hotel reviews and news headlines, 2023. URL:
psyarxiv.com/mnyz8. doi:1 0 . 3 1 2 3 4 / o s f . i o / m n y z 8 .
[34] L. Fröhling, A. Zubiaga, Feature-based detection of automated language models: tackling
gpt-2, gpt-3 and grover, PeerJ Computer Science 7 (2021) e443.
[35] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu, How close is chatgpt to
human experts? comparison corpus, evaluation, and detection, 2023. a r X i v : 2 3 0 1 . 0 7 5 9 7 .
[36] V. Hauch, I. Blandón-Gitlin, J. Masip, S. L. Sporer, Are computers efective lie detectors? a
meta-analysis of linguistic cues to deception, Personality and social psychology Review
19 (2015) 307–342. doi:1 0 . 1 1 7 7 / 1 0 8 8 8 6 8 3 1 4 5 5 6 5 3 9 .
[37] M. L. Newman, J. W. Pennebaker, D. S. Berry, J. M. Richards, Lying words: Predicting
deception from linguistic styles, Personality and social psychology bulletin 29 (2003)
665–675. doi:1 0 . 1 1 7 7 / 0 1 4 6 1 6 7 2 0 3 0 2 9 0 0 5 0 1 0 .
[38] M. K. Johnson, C. L. Raye, Reality monitoring., Psychological review 88 (1981) 67.
[39] D. M. Markowitz, J. T. Hancock, Linguistic traces of a scientific fraud: The case of diederik
stapel, PloS one 9 (2014) e105937. doi:1 0 . 1 3 7 1 / j o u r n a l . p o n e . 0 1 0 5 9 3 7 .
[40] G. D. Bond, A. Y. Lee, Language of lies in prison: Linguistic classification of prisoners’
truthful and deceptive natural language, Applied Cognitive Psychology 19 (2005) 313–329.
[41] J. T. Hancock, L. E. Curry, S. Goorha, M. Woodworth, On lying and being lied to: A linguistic
analysis of deception in computer-mediated communication, Discourse Processes 45 (2007)
1–23. doi:1 0 . 1 0 8 0 / 0 1 6 3 8 5 3 0 7 0 1 7 3 9 1 8 1 .
[42] D. M. Markowitz, D. J. Grifin, When context matters: How false, truthful, and
genrerelated communication styles are revealed in language, Psychology, Crime &amp; Law 26 (2020)
287–310. doi:1 0 . 1 0 8 0 / 1 0 6 8 3 1 6 X . 2 0 1 9 . 1 6 5 2 7 5 1 .
[43] M. Ott, Y. Choi, C. Cardie, J. T. Hancock, Finding deceptive opinion spam by any stretch
of the imagination, arXiv preprint arXiv:1107.4557 (2011).
[44] K. B. Serota, T. R. Levine, A few prolific liars: Variation in the prevalence of lying, Journal
of Language and Social Psychology 34 (2015) 138–157. doi:1 0 . 1 1 7 7 / 0 2 6 1 9 2 7 x 1 4 5 2 8 8 0 4 .
[45] T. R. Levine, K. B. Serota, F. Carey, D. Messer, Teenagers lie a lot: A further investigation
into the prevalence of lying, Communication Research Reports 30 (2013) 211–220. doi:1 0 .
1 0 8 0 / 0 8 8 2 4 0 9 6 . 2 0 1 3 . 8 0 6 2 5 4 .
[46] Y. Daiku, K. B. Serota, T. R. Levine, A few prolific liars in japan: Replication and the efects
of dark triad personality traits, Plos one 16 (2021) e0249815.
[47] J. Sarzyńska, M. Falkiewicz, M. Riegel, J. Babula, D. S. Margulies, E. Nęcka, A. Grabowska,
I. Szatkowska, More intelligent extraverts are more likely to deceive, PloS one 12 (2017)
e0176591. doi:1 0 . 1 3 7 1 / j o u r n a l . p o n e . 0 1 7 6 5 9 1 .
[48] M. Sap, G. Park, J. Eichstaedt, M. Kern, D. Stillwell, M. Kosinski, L. Ungar, H. A. Schwartz,
Developing age and gender predictive lexica over social media, in: Proceedings of the
2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp.
1146–1151.
[49] G. Park, H. A. Schwartz, J. C. Eichstaedt, M. L. Kern, M. Kosinski, D. J. Stillwell, L. H.</p>
      <p>Ungar, M. E. Seligman, Automatic personality assessment through social media language.,
Journal of personality and social psychology 108 (2015) 934.
[50] L. R. Goldberg, J. A. Johnson, H. W. Eber, R. Hogan, M. C. Ashton, C. R. Cloninger, H. G.</p>
      <p>Gough, The international personality item pool and the future of public-domain personality
measures, Journal of Research in personality 40 (2006) 84–96.
[51] S. Giorgi, S. Havaldar, F. Ahmed, Z. Akhtar, S. Vaidya, G. Pan, L. H. Ungar, H. A.</p>
      <p>Schwartz, J. Sedoc, Human-centered metrics for dialog system evaluation, arXiv preprint
arXiv:2305.14757 (2023).
[52] M. H. Davis, Measuring individual diferences in empathy: Evidence for a multidimensional
approach., Journal of personality and social psychology 44 (1983) 113.
[53] M. Abdul-Mageed, A. Bufone, H. Peng, S. Giorgi, J. Eichstaedt, L. Ungar, Recognizing
pathogenic empathy in social media, Proceedings of the International AAAI Conference
on Web and Social Media 11 (2017) 448–451. URL: https://ojs.aaai.org/index.php/ICWSM/
article/view/14942. doi:1 0 . 1 6 0 9 / i c w s m . v 1 1 i 1 . 1 4 9 4 2 .
[54] D. B. Yaden, S. Giorgi, M. Jordan, A. Bufone, J. C. Eichstaedt, H. A. Schwartz, L. H. Ungar,
P. Bloom, Characterizing empathy and compassion using computational linguistic analysis,
Emotion (2023). doi:1 0 . 1 0 3 7 / e m o 0 0 0 1 2 0 5 .
[55] V. Kulkarni, M. L. Kern, D. Stillwell, M. Kosinski, S. Matz, L. Ungar, S. Skiena, H. A. Schwartz,
Latent human traits in the language of social media: An open-vocabulary approach, PloS
one 13 (2018) e0201703.
[56] R. A. Armstrong, When to use the b onferroni correction, Ophthalmic and Physiological</p>
      <p>Optics 34 (2014) 502–508.
[57] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
Learning Research 12 (2011) 2825–2830.
[58] H. A. Schwartz, S. Giorgi, M. Sap, P. Crutchley, L. Ungar, J. Eichstaedt, Dlatk: Diferential
language analysis toolkit, in: Proceedings of the 2017 conference on empirical methods in
natural language processing: System demonstrations, 2017, pp. 55–60.
[59] G. Abercrombie, A. C. Curry, T. Dinkar, Z. Talat, Mirages: On anthropomorphism in
dialogue systems, arXiv preprint arXiv:2305.09800 (2023).
[60] A. Danielescu, S. A. Horowit-Hendler, A. Pabst, K. M. Stewart, E. M. Gallo, M. P. Aylett,
Creating inclusive voices for the 21st century: A non-binary text-to-speech for
conversational assistants, in: Proceedings of the 2023 CHI Conference on Human Factors in
Computing Systems, 2023, pp. 1–17.
[61] J. W. Pennebaker, R. L. Boyd, K. Jordan, K. Blackburn, The development and psychometric
properties of liwc2015, 2015.
[62] L. Zhuang, L. Wayne, S. Ya, Z. Jun, A robustly optimized BERT pre-training approach with
post-training, in: Proceedings of the 20th Chinese National Conference on Computational
Linguistics, Chinese Information Processing Society of China, Huhhot, China, 2021, pp.
1218–1227. URL: https://aclanthology.org/2021.ccl-1.108.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Hotel Data</title>
      <p>The data collected by Ott et al. [43] used a list of 20 hotels in Chicago, IL USA. We use this
same list to query ChatGPT. The hotels are as follows: Afinia, Allegro, Amalfi, Ambassador,
Conrad, Fairmont, Hardrock, Hilton, Homewood, Hyatt, Intercontinental, James, Knickerbocker,
Monaco, Omni, Palmer, Sheraton, Sofitel, Swissotel, and Talbott.</p>
    </sec>
    <sec id="sec-10">
      <title>B. Additional AI Generated Reviews</title>
      <p>To test how ChatGPT generates hotel reviews with default API parameters that are not as
creative, we also prompt ChatGPT with a temperature of 1, frequency penalty of 0.0 and
presence penalty of 0.0 for the same prompts as 3   – we call this data set of 400 reviews
3  . We also directly prompt the ChatGPT3.5 and ChatGPT4 GUI interface by prompting
with “Write me 20 positive hotel reviews for the &lt;HOTEL&gt; in Chicago. Each review must be
around 120 words long.” – the diference being that we directly prompt for the generation of
20 hotel reviews in a single query. The reviews in both the cases were much shorter than 120
words, with the average length of a review calculated to be 52.8 words for ChatGPT3.5 (3 ℎ )
and 62.7 words for ChatGPT4 (4 ℎ ).</p>
      <p>Table A1 shows that the default parameters for ChatGPT resulted in larger KS distances (i.e.,
the distributions are less similar), which is reasonable since ChatGPT’s default parameters are
designed to be less creative and more deterministic than settings used for 3   . GPT-4
produced the smallest KS distances, often showing distances similar to TruthH vs DeceptH ,
suggesting GPT-4 shows human traits more similar to actual human writing than ChatGPT.
Table A1
KS Test results when comparing 3 ℎ , 4 ℎ , 3  against truthful human (TruthH) text and deceptive
human (DeceptH) text. Bonferroni corrected significance level: ∗ &lt; 0.05 , † &lt; 0.01 , ‡ &lt; 0.001
3 ℎ v TH 3 ℎ v DH 4 ℎ v TH 4 ℎ v DH 3  v TH 3  v DH
against human text.
Since each model was fit using a small number of features, chosen a priori, it is natural to wonder
how a larger, open vocabulary feature space would perform (e.g., contextual embeddings). Thus,
we considered several baselines to establish a predictive ceiling, which may reveal how dificult
this classification task is. These baselines included (a) n-grams: 1, 2 and 3-grams extracted from
the hotel reviews, (b) meta-linguistic features: the total number of unigrams per review and
the average unigram length, (c) Linguistic Inquiry and Word Count (LIWC) [61]: a manually
curated dictionary of 73 categories, and (d) contextual embeddings vias RoBERTA-large [62]:
embeddings are extracted from the penultimate hidden layer (1024 dimensions) of
RoBERTAlarge for each review. The language model was not further finetuned to our data set, instead
the classifier was trained on the 1024 dimensions obtained from RoBERTA-large out of the box.
Again, a logistic regression model was used within a 5-fold cross validation framework. Table
A3 shows these results.
vs TruthH</p>
      <p>TruthH vs DeceptH
0.981
0.998
0.998
0.999</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>August</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Serrano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Haduong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>All that's' human'is not gold: Evaluating human evaluation of generated text</article-title>
          ,
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <source>doi:1 0 . 1 8</source>
          <volume>6 5 3</volume>
          / v 1 /
          <article-title>2 0 2 1</article-title>
          . a c l - l
          <source>o n g . 5</source>
          <volume>6</volume>
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Köbis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Mossink</surname>
          </string-name>
          ,
          <source>Artificial intelligence versus maya angelou: Experimental evidence 0</source>
          .
          <fpage>104</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>