The Impact of Prompts on Zero-Shot Detection of
                                AI-Generated Text
                                Kaito Taguchi1,2 , Yujie Gu1 and Kouichi Sakurai1
                                1
                                    Kyushu University, Fukuoka, Japan
                                2
                                    Skydisc Inc., Fukuoka, Japan


                                                  Abstract
                                                  In recent years, there have been significant advancements in the development of Large Language Models (LLMs). While
                                                  their potential for misuse, such as generating fake news and committing plagiarism, has posed significant concerns. To
                                                  address this issue, detectors have been developed to evaluate whether a given text is human-generated or AI-generated.
                                                  Among others, zero-shot detectors stand out as effective approaches that do not require additional training data and are often
                                                  likelihood-based. In chat-based applications, users commonly input prompts and utilize the AI-generated texts. However,
                                                  zero-shot detectors typically analyze these texts in isolation, neglecting the impact of the original prompts. It is conceivable
                                                  that this approach may lead to a discrepancy in likelihood assessments between the text generation phase and the detection
                                                  phase. So far, there remains an unverified gap concerning how the presence or absence of prompts impacts detection accuracy
                                                  for zero-shot detectors. In this paper, we introduce an evaluative framework to empirically analyze the impact of prompts on
                                                  the detection accuracy of AI-generated text. We assess various zero-shot detectors using both white-box detection, which
                                                  leverages the prompt, and black-box detection, which operates without prompt information. Our experiments reveal the
                                                  significant influence of prompts on detection accuracy. Remarkably, compared with black-box detection without prompts, the
                                                  white-box methods using prompts demonstrate a significant increase in AUC across all zero-shot detectors tested, which calls
                                                  for attention to the impact of prompts on zero-shot detectors. Code is available: https://github.com/kaito25atugich/Detector.

                                                  Keywords
                                                  zero-shot detector, AI-generated text, prompt, LLM


                                1. Introduction                                                                   by prompts. It may potentially result in differences in
                                                                                                                  likelihood evaluations between the text generation and
                                Recent years have seen significant advancements in the                            detection stages. A summary of zero-shot detectors is
                                development of Large Language Models (LLMs) [1, 2, 3],                            illustrated in Table 1.
                                and their practical applications have become widespread.                             In this paper, we assess to what extent this phenomenon
                                Meanwhile, their potential misuse have raised significant                         affects likelihood-based zero-shot detectors. First, we pro-
                                concerns. In particular, the generation of fake news and                          pose two methods for detecting AI-generated text using
                                plagiarism using LLMs is a notable issue. Detectors that                          zero-shot detectors: white-box detection, which lever-
                                evaluate whether a given text is human-generated or AI-                           ages the prompts used to generate the text, and black-box
                                generated serve as a defense mechanism against such                               detection, which detects AI-generated text without rely-
                                misuse.                                                                           ing on a prompt. Next, we conduct extensive experiments
                                   Detectors for AI-generated text can be broadly classi-                         and demonstrate a decrease in detection accuracy for ex-
                                fied into three categories: a zero-shot detector leveraging                       isting zero-shot detectors in black-box detection.
                                statistical properties [4, 5, 6, 7, 8, 9, 10, 11], a detector                        Our results show a significant difference in the perfor-
                                employing supervised learning [12, 13, 14, 15], and a de-                         mance of zero-shot detectors for AI-generated text with
                                tector utilizing watermarking [16, 17].                                           and without prompts, highlighting the need to consider
                                   Zero-shot detectors, such as DetectGPT [5], which do                           the impact of prompts on these detectors.
                                not require additional training, are designed in many                                These results further point out that likelihood-based
                                methods using likelihood-based scores. In other words,                            zero-shot detectors face challenges for practical use. Ad-
                                the zero-shot detection is carried out by replicating the                         ditionally, the experimental results demonstrate that fast
                                likelihood at the generation phase. When using LLMs,                              zero-shot detectors are more robust compared to other
                                we usually input prompts and utilize the generated out-                           detectors due to their higher sampling rate.
                                put. However, at the detection phase, it is anticipated
                                that reproducing likelihood becomes challenging due
                                to the absence of the contextual information provided                             2. Related work
                                                                                                                                         In the context of intentionally undermining detection ac-
                                Envelope-Open k-taguchi@skydisc.jp (K. Taguchi); gu@inf.kyushu-u.ac.jp                                   curacy using prompts, two main categories of studies can
                                (Y. Gu); sakurai@inf.kyushu-u.ac.jp (K. Sakurai)                                                         be identified. The first category involves the deliberate
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1
Summary of Zero-shot Detectors
       Method               Summary
       Log-likelihood         Detect using the log likelihood of the given text.
                              Calculate the likelihood of the given text and convert the likelihood of each
       Rank
                              token into ranks based on the entire vocabulary, then use it to detect.

                              Calculate the likelihood of the given text and transform the likelihood of each
       Log-Rank               token into ranks based on the entire vocabulary, then apply logarithm to these
                              ranks for detection.
       Entropy                Detect by calculating entropy using the likelihood of tokens in the vocabulary.

                              Using a masked language model, randomly replace words in the text. Observe
       DetectGPT [5]          the likelihood of the replaced text and the original text using a scoring model,
                              and utilize the change to detect alterations.
                              Replace the mask model in DetectGPT with a auto-regressive model similar to
       FastDetectGPT [6]      the scoring model. Sample words randomly from the vocabulary to replace words.
                              Calculate scores in the same manner as DetectGPT.
       LRR [7]                Detect using the ratio of log-likelihood to log-rank.

                              Similar to DetectGPT, utilize logarithmic ranks rather than logarithmic
       NPR [7]
                              likelihood for scoring calculation.
                              Utilize models trained with slightly different amounts of data and calculate the
       Binoculars [8]
                              perplexity of each model. Then leverage the difference in perplexity for detection.


crafting of prompts with malicious intent to deliberately text and updating the content of the instructions to re-
reduce detection accuracy. In contrast, the second cat- duce detection accuracy.
egory encompasses research that employs tasks with             Kumarage et al. [21] proposed an attack named Soft
benign prompts, devoid of malicious intent.                 Prompt, which generates a vector using reinforcement
                                                            learning to induce misclassification by detectors. This
2.1. Malicious prompts                                      Soft Prompt vector is then used as input for DetectGPT
                                                            and RoBERTa-based detectors [12], demonstrating a de-
First, we delve into studies that specifically concentrate crease in detection accuracy [21].
on the deliberate creation of malicious prompts.
   In [19], Koike et al. proposed OUTFOX, utilizing in-
                                                            2.2. Benign prompts
context learning with the problem statement 𝑃, human-
generated text 𝐻, and AI-generated text 𝐴. By construct- We review cases involving tasks with benign prompts.
ing prompts such as “𝑝𝑖 ∈ 𝑃 → ℎ𝑖 ∈ 𝐻 is the correct label      Liu et al. conducted experiments using the CheckGPT
by humans, and 𝑝𝑖 ∈ 𝑃 → 𝑎𝑖 ∈ 𝐴 is the correct label model, an approach based on supervised learning. Their
by AI,” they aim to generate text for a given problem findings indicate that when using different prompts, al-
statement in such a way that the generated text aligns though all surpass 90%, there is an experimental demon-
with human-authored content. This approach makes the stration of approximately a 7% decrease in detection ac-
detection of artificially generated content challenging.    curacy [15].
   Shi et al. conducted an attack on OpenAI’s Detec-           Dou et al. [14] performed experiments envisioning
tor [22] by employing an Instructional Prompt, confirm- the utilization of LLMs by students. In their study, they
ing a decrease in detection accuracy [18]. The Instruc- demonstrated a decrease in DetectGPT’s detection accu-
tional Prompt involves adding a reference text 𝑋𝑟𝑒𝑓 and racy when prompts were employed.
an instructional text 𝑋𝑖𝑛𝑠 with characteristics that reduce    Hans et al. [8] pointed out the difficulty in reproduc-
the detection accuracy to the original input 𝑋, thereby ing likelihoods depending on the presence or absence of
undermining the detection accuracy.                         prompts, using unique prompts like “Write about a capy-
   In [20], Lu et al. proposed SICO, a method that low- bara astronomer.” In response to the capybara problem,
ers detection accuracy by instructing the model within they proposed Binoculars.
prompts to mimic the writing style of human-authored           We assume performing benign tasks such as summa-
rization. Therefore, unlike malicious prompt attacks,             3.2.2. Entropy
there is no need to deliberately choose prompts that
                                                                  Entropy is a method that utilizes the entropy of the vocab-
would lower accuracy using the detector when construct-
                                                                  ulary for detection. The formula is shown in (3). Entropy
ing prompts, nor is there a requirement to collect pairs
                                                                  is calculated using the likelihood of the vocabulary, tak-
of data for in-context learning.
                                                                  ing the average across each context.
   On the other hand, Dou et al. [14] experimentally
demonstrated unintended decreases in detection accu-                                           𝑁   𝐶
racy. However, they did not delve into why the accuracy                           −1
                                                                     Entropy =         ∑ ∑ 𝑃 (𝑗|𝑆 ) log 𝑃𝑇𝜃 (𝑗|𝑆<𝑖 ).         (3)
decreases or make references to other likelihood-based                           𝑁 − 1 𝑖=2 𝑗=1 𝑇𝜃 <𝑖
zero-shot detectors. Additionally, Hans et al. [8] did not
provide specific verification regarding the impact of a           3.2.3. Rank
detector knowing or not knowing the prompt on detec-
                                                                  Rank is a method that utilizes the order of likelihood
tion accuracy. Therefore, the resilience of Binoculars
                                                                  magnitude of tokens in the vocabulary when sorted. The
to changes in likelihood due to prompts has not been
                                                                  formula is presented in (4). Rank is the average position
adequately assessed. The supervised learning based ap-
                                                                  of tokens constituting a given text. The function 𝑠𝑜𝑟𝑡
proach [15] is excluded from our experiments in this
                                                                  is a function that sorts the given array in descending
context.
                                                                  order, and 𝑖𝑛𝑑𝑒𝑥 is a function that, given an array and an
   In this study, we demonstrate that even in ordinary
                                                                  element as input, returns the index of the element within
tasks such as summarization, the presence or absence of
                                                                  the given array.
prompts unintentionally leads to a decrease in accuracy
when using likelihood-based zero-shot detectors.                                       𝑁
                                                                               −1
                                                                     rank =         ∑ 𝑖𝑛𝑑𝑒𝑥(𝑠𝑜𝑟𝑡(log 𝑃𝑇𝜃 (𝑆𝑖 |𝑆<𝑖 )), 𝑆𝑖 ).   (4)
                                                                              𝑁 − 1 𝑖=2
3. Preliminary
                                                                  3.2.4. DetectGPT
3.1. Language model
                                                                  The language model aims to maximize likelihood during
A model that captures the probability of generating words         text generation, whereas humans create text indepen-
or sentences is referred to as a language model. Let 𝑉            dently of likelihood. DetectGPT focuses on this phe-
represent the vocabulary. The language model for a word           nomenon and posits a hypothesis that by rewriting cer-
sequence of length 𝑛, denoted as 𝑥1 , 𝑥2 , … , 𝑥𝑛 where 𝑥𝑖 ∈ 𝑉,   tain words, the likelihood of the text decreases for AI-
is defined by the following (1).                                  generated content and can either increase or decrease for
                                                                  human-generated content [5].
                                 𝑛
                                                                     The overview of DetectGPT is presented in Figure
          𝑃(𝑥1 , 𝑥2 , … , 𝑥𝑛 ) = ∏ 𝑃(𝑥𝑡 |𝑥1 , … , 𝑥𝑡−1 )   (1)
                                𝑡=1                               1. The replacement process is achieved by utilizing a
                                                                  mask model 𝑃𝑀 , such as T5 [24], on some of the words
3.2. Existing zero-shot detectors                                 contained in the given text 𝑆. This operation is repeated
                                                                  for a total of 𝑘 iterations, and the average log-likelihood
We provide a brief introduction to existing zero-shot             of the obtained 𝑘 replacement texts is then computed. (5)
detectors, summarized in Table 1. Here, 𝑃𝑇𝜃 refers to the         represents the score, calculating the difference between
language model utilized for detection. The vocabulary 𝑉           the log-likelihood of the original text and the average
is composed of 𝐶 tokens. The input text 𝑆 is composed             log-likelihood of the acquired replacement texts. It is
of 𝑁 tokens, represented as 𝑆 = {𝑆1 , 𝑆2 , … , 𝑆𝑁 }, and the      permissible to standardize by dividing by the standard
token sequence from 𝑆1 to 𝑆𝑖−1 is denoted as 𝑆<𝑖 .                deviation of the log-likelihood of the replacement texts.
                                                                  If the score is above the threshold 𝜀, it is deemed to be
3.2.1. Log-Likelihood                                             AI-generated text.
The log-likelihood is a method that utilizes the likelihood                                            log 𝑃𝑇𝜃 (𝑆) − 𝑚̃
of tokens composing a text for detection. The formula is                         DetectGPT =                                  (5)
                                                                                                             𝜎𝑆̃
presented in (2). The log-likelihood is the average of the
log-likelihoods of tokens constituting a given text.              where

                                      𝑁                                                    𝑘
                               1                                                       1
        Log-likelihood =           ∑ log 𝑃𝑇𝜃 (𝑆𝑖 |𝑆<𝑖 ).   (2)                  𝑚̃ =     ∑ log 𝑃𝑇𝜃 (𝑆𝑖̃ )
                             𝑁 − 1 𝑖=2                                                 𝑘 𝑖=1
                                                                                                   𝑘
                                                                                         1
                                                                               𝜎𝑆̃ =         ∑(log 𝑃𝑇𝜃 (𝑆𝑖̃ ) − 𝑢)̃ 2
                                                                                       𝑘 − 1 𝑖=1
                            Replace words          calculate the likelihood
                                          𝑠                        𝑃𝑇𝜃 (𝑠)
                𝑠                         𝑠ǁ1                      𝑃𝑇𝜃 (𝑠ǁ1 )
                                                    LLM
                𝑠         mask            𝑠ǁ2                      𝑃𝑇𝜃 (𝑠ǁ2 )
                          model                     𝑃𝑇𝜃                               Mean: 𝑢,
                                                                                            ෤ Variance:𝜎෤𝑠2


                    ・・・


                                    ・・・
                           𝑃𝑀
                𝑠                         𝑠ǁ𝑘                      𝑃𝑇𝜃 (𝑠ǁ𝑘 )


Figure 1: DetectGPT Overview


and 𝑆𝑖̃ ∼ 𝑃𝑀 (𝑆𝑖 ) represent the mean, sample variance, and      On the other hand, NPR, like DetectGPT, performs the
a sample from 𝑃𝑀 (𝑆𝑖 ), respectively.                         substitution of words in the text 𝑘 times. It takes the ratio
                                                              of the average log-rank of the obtained substituted texts
3.2.5. FastDetectGPT                                          to the log-rank of the original text. This is defined in (7).

In [6], Bao et al. highlighted challenges in DetectGPT’s                                   1 𝑘
                                                                                            ∑    log 𝑟𝜃 (𝑆𝑝̃ )
use of different models for substitution and score cal-                                    𝑘 𝑝=1
                                                                                𝑁 𝑃𝑅 =                                     (7)
culation, as well as the cost-related aspect of requiring                                         log 𝑟𝜃 (𝑆)
model access for each substitution iteration. In response,
FastDetectGPT is a modified detector that reduces access      3.2.7. Binoculars
to the model, addressing the cost issue while enabling        Hans et al. proposed Binoculars, a detection method
substitutions. Although the methodology involves set-         utilizing two closely related language models, Falcon-
ting hypotheses similar to DetectGPT, there is no funda-      7b [26] and Falcon-7b-instruct, by employing a metric
mental change. It still operates on the assumption that       called cross-perplexity [8]. The overall framework is
“AI-generated text is likely to be around the maximum         illustrated in Figure 3.
likelihood, whereas human-generated text is not.”                Let the first model be denoted as 𝑀1 (such as Falcon-
   We present the overall architecture of FastDetectGPT       7b), and the second model as 𝑀2 (like Falcon-7b-instruct).
in Figure 2. In FastDetectGPT, the substitution process is    In this case, using 𝑀1 , we calculate the log perplexity as
replaced with an alternative method that does not rely on     shown in (8).
a mask model. Similar to the detection model, it utilizes
an autoregressive model, and 𝑃𝑇𝜃 and 𝑃𝑈𝜃 can be the same.                                           𝑁
                                                                                                 1
The substitution for the 𝑖-th word involves randomly                   log 𝑃𝑃𝐿𝑀1 (𝑆) = −           ∑ log(𝑀1 (𝑆𝑖 |𝑆<𝑖 ))    (8)
                                                                                                 𝑁 𝑖=1
extracting a word from the next-word list, considering
the context up to the (𝑖 − 1)-th word in the input text,        Next, using 𝑀1 and 𝑀2 , we calculate the cross-perplexity,
and replacing the word with the chosen one. In other          as shown in (9). Here, the symbol ⋅ represents the dot
words, performing this substitution 𝑁 times results in        product.
the substituted text 𝑆,̃ and by conducting sampling dur-
ing word selection, the replacement process generates 𝑘          log 𝑋 -𝑃𝑃𝐿𝑀1 ,𝑀2 (𝑆) =
substitution texts in a single access.                                                 𝑁     𝐶
   The subsequent score calculation is omitted as it fol-                           1
                                                                                −     ∑ ∑ 𝑀 (𝑗|𝑆 ) ⋅ log(𝑀2 (𝑗|𝑆<𝑖 ))      (9)
lows the same procedure as DetectGPT.                                               𝑁 𝑖=1 𝑗=1 1 <𝑖

3.2.6. LLR & NPR                                                The score in Binoculars is determined by (10).

LLR (Likelihood Log-Rank ratio) and NPR (Normalized                                              log 𝑃𝑃𝐿𝑀1 (𝑆)
perturbed log rank) are classical log-rank enhancement                       𝐵𝑀1 ,𝑀2 (𝑆) =                                (10)
                                                                                             log 𝑋 -𝑃𝑃𝐿𝑀1 ,𝑀2 (𝑆)
techniques proposed by Su et al. [7]. Both methods have
simple configurations. LLR literally takes the ratio of
log-likelihood to log-rank, as expressed in (6). Here, 𝑟𝜃     4. Proposal
represents the rank when using 𝑃𝑇𝜃 .
                                                              In this study, we propose a detection flow to investigate
                      𝑡
                     ∑𝑖=1 log 𝑃𝑇𝜃 (𝑆𝑖 |𝑆<𝑖 )                  the impact of prompts on likelihood. Before present-
              𝐿𝑅𝑅 = − 𝑡                                 (6)   ing the experimental setup, we introduce an additional
                     ∑𝑖=1 log 𝑟𝜃 (𝑆𝑖 |𝑆<𝑖 )                   detection method.
                                                                                                           Vocabulary
                                                                                                               𝑉
                                                                                                               like
                              Replace words   calculate the likelihood
                                    𝑠ǁ1                      𝑃𝑇𝜃 (𝑠ǁ1 )     𝑠 =“I like apples”                 love

                        LLM           𝑠ǁ2     LLM            𝑃𝑇𝜃 (𝑠ǁ2 )                                        am
                𝑠                                                            ↓ Replace randomly with
                                                                                                             bananas


                                ・・・
                        𝑃𝑈𝜃                   𝑃𝑇𝜃
                                                                               tokens from set 𝑉
                                      𝑠ǁ𝑘                    𝑃𝑇𝜃 (𝑠ǁ𝑘 )      𝑠1 =“I am dogs”                  apples
                                                                                                              dogs

                                                                            Replace words in Fast Series        I


Figure 2: FastDetectGPT and Sampling Overview


                                  calculate the PPL and X-PPL


                                      𝑀1               log 𝑃𝑃𝐿𝑀1
                                                                                                   log 𝑃𝑃𝐿𝑀1
                    𝑆                                                                  𝑠𝑐𝑜𝑟𝑒 =
                                                                                                 log 𝑋 𝑃𝑃𝐿𝑀1,𝑀2
                                                       log 𝑋 𝑃𝑃𝐿𝑀1,𝑀2
                                      𝑀2


Figure 3: Binoculars Overview


4.1. FastNPR                                                              5. Experiment
Word replacements in NPR are performed using a masked
                                                               5.1. Configuration
model. In this research, aiming for cost reduction, we em-
ploy FastNPR, a method that replaces word replacements To begin, we utilize the GPT2-XL [23] as the detection
with sampling, akin to FastDetectGPT.                          model, excluding Binoculars. Due to GPU constraints,
                                                               Binoculars employs the pre-trained and instruct-tuned
4.2. Detection methods                                         Phi1.5 [27] instead of Falcon.
                                                                  For DetectGPT and NPR, we generate five replacement
We explain the detection methodology. Let 𝑥 represent sentences for 10% of the entire text, while the Fast series
the text to be detected, and if 𝑥 is an AI-generated text, let generates 10,000 replacement sentences. T5-Large [24]
𝑝 denote the prompt used for its generation. Detection is used for word replacement in DetectGPT and NPR,
can be categorized into two patterns: Black-box detection while the Fast series employs the GPT2-XL, the same
and White-box detection. An overview is presented in detection model. Also, we use the XSum dataset [28].
Figure 4.                                                      For human-generated text, we extract 200 samples from
   Black-box detection occurs when the detector is un- the XSum dataset, and for AI-generated text, we employ
aware of prompt information, essentially mirroring exist- the Llama2 7B Chat model [25], generating up to 200
ing detection methods. In this scenario, only the content tokens. The prompt used is “Would you summarize the
of 𝑥 is provided to the detector.                              following sentences, please? text”.
   White-box detection, on the other hand, involves the
detector having knowledge of prompt information. For
                                                               5.2. Result
human-generated text, only 𝑥 is input. In the case of
AI-generated text, the input consists of 𝑝 + 𝑥. It is impor- As evident from the results in Table 2, white-box detec-
tant to note that, in White-box detection, the prompt is tion exhibits higher accuracy, while black-box detection
used solely for likelihood calculation and is not directly shows lower accuracy. As anticipated, modifying likeli-
included in the score computation.                             hood through prompts leads to a decrease in the detection
                                                               accuracy of likelihood-based detectors. Notably, there is
                                                               a consistent decrease of 0.1 or more across all methods,
                              AI-generated Text： What is 1+1? 1+1 equals 2.
                                                                                       prompt
                  Black-box(Detectors unaware prompt)                                  Use for calculating score

                                                          Zero-shot
                                      1+1 equals 2.                       𝑠𝑐𝑜𝑟𝑒(1+1 equals 2.)
                                                          Detector

                  White-box(Detectors know prompt)


                          What is 1+1? 1+1 equals 2.      Zero-shot
                                                                          𝑠𝑐𝑜𝑟𝑒(1+1 equals 2.| What is 1+1? )
                                                          Detector


Figure 4: Proposed Detection Methods Overview


Table 2                                                         Particularly in recent years, there is a trend toward prac-
Detection of Generated Summaries: Discrepancies Between         tical applications, emphasizing high true positive rates at
Cases with and Without Prompts                                  low false positive rates, suggesting that at least an AUC in
        Method           Black-box     White-box                the late 0.9s would be necessary [30, 8]. Furthermore, the
        DetectGPT        0.453         1.000                    lack of improvement in detection accuracy with Detect-
        FastDetectGPT    0.819         0.958                    GPT and NPR may be attributed to the limited number
        LRR              0.532         0.995                    of substitutable tokens.
        NPR              0.560         0.934
        FastNPR          0.768         0.993                    Table 3
        Entropy          0.330         0.978                    Effect of Substitution Rate(SR) and Sample Size(SS) Variation
        Log-likelihood   0.474         0.998                    on AUC(DetectGPT)
        Rank             0.432         0.977
        Log-Rank         0.485         0.999                                  Method               SR         SS      AUC
        Binoculars       0.877         0.999                                  FastDetectGPT        10%        5       0.640
                                                                              FastDetectGPT        20%        5       0.697
                                                                              FastDetectGPT        100%       5       0.779
highlighting a significant observation.                                       FastDetectGPT        10%        10      0.704
    Binoculars and the Fast series detectors demonstrate                      FastDetectGPT        20%        10      0.739
                                                                              FastDetectGPT        100%       10      0.821
robustness compared to other methods. In particular, the
                                                                              FastDetectGPT        100%       10000   0.819
Fast series detector maintains the same scoring calcu-                        DetectGPT            10%        5       0.453
lation as conventional methods, suggesting robustness                         DetectGPT            20%        5       0.522
factors in the sampling process. For further verification,                    DetectGPT            30%        5       0.490
we conduct additional experiments.                                            DetectGPT            10%        10      0.446
    In this experiment, we investigate the differences in                     DetectGPT            30%        10      0.446
detection accuracy when varying the replacement ratio,
indicating the extent to which tokens in the text are re-
placed, and the sample size, representing the number
                                                             Table 4
of replacement sentences. DetectGPT and NPR require Effect of Substitution Rate(SR) and Sample Size(SS) Variation
the use of a masked language model to replace plausible on AUC(NPR)
tokens, making replacement not always feasible, espe-
                                                                     Method   SR      SS       AUC
cially for higher replacement percentages. Therefore, we
primarily vary the replacement ratio in the Fast series to           FastNPR  10%     5        0.628
conduct the investigation.                                           FastNPR 20%      5        0.661
                                                                     FastNPR 100% 5            0.747
    The results for DetectGPT are presented in Table 3,
                                                                     FastNPR 10%      10       0.647
and the results for NPR are shown in Table 4. From these             FastNPR 20%      10       0.715
results, it is evident that increasing the replacement ratio         FastNPR 100% 10           0.750
and sample size helps mitigate the decrease in detection             FastNPR 100% 10000 0.763
accuracy. This observation is similar to Chakraborty et              NPR      10%     5        0.560
al.’s assertion that increasing the sample size can enable           NPR      20%     5        0.590
detection if the distribution slightly differs [29].                 NPR      30%     5        0.577
    However, in our validation, the improvement in accu-             NPR      10%     10       0.589
racy plateaus at around 10 samples, reaching a maximum               NPR      30%     10       0.588
AUC of approximately 0.8, which is not considered high.
6. Limitation and future work                               6.5. Relationship with watermarking
                                                            Watermarking techniques utilize statistical methods for
6.1. Hypotheses for zero-shot detectors
                                                            verification [16]. Since these methods are based on like-
While our investigation has focused solely on prompts, lihood during both generation and verification, a fail-
similar phenomena could potentially be observed with ure to reproduce likelihood during the verification stage
other factors. For instance, variations in Temperature or may lead to a decrease in accuracy. On the other hand,
Penalty Repetition between the generation and detection robust watermarking techniques against paraphrase at-
stages might introduce differences in the selected tokens, tacks have emerged [17]. These methods may exhibit
making detection challenging based on likelihood. Gen- robustness against prompts as well.
eralizing these observations, we hypothesize that any
act that fails to replicate the likelihood during language 6.6. Towards resilient zero-shot detectors
generation could undermine the detection accuracy of
zero-shot detectors relying on likelihood from next-word Currently, many methods perform likelihood-based de-
prediction.                                                 tection. Combining these methods with other sophis-
                                                            ticated techniques may lead to more robust detection.
                                                            One such approach is Intrinsic Dimension [11]. Intrin-
6.2. Tasks
                                                            sic Dimension refers to the minimum dimension needed
While our investigation has focused on summary text to represent a given text. Tulchinskii et al. propose a
generation, there are several other potential tasks to con- detector based on Persistent Homology to estimate the
sider, such as paraphrase generation, story generation, Intrinsic Dimension and use it as a score. However, this
and translation text generation. It is plausible that de- method requires a constant length of text and was not
tection accuracy could also decrease in these common applicable in our experiment. It would be interesting to
tasks. Since these tasks may be utilized without mali- explore the application of this method in experiments
cious intent, it is crucial to conduct similar evaluations involving longer texts.
for them.                                                      Approaches utilizing representations obtained with
                                                            masked language models, including Intrinsic Dimension,
6.3. Number of parameters                                   calculate likelihood in a different way from the detectors
                                                            used in our experiment, which are based on autoregres-
In this study, each detection method utilized a language sive language models. Combining these elements may
model of approximately 1 billion parameters. It would be lead to the development of a more robust zero-shot de-
of interest to investigate whether increased robustness tector.
can be observed when experimenting with larger lan-
guage models. Conversely, there are experimental stud-
ies that have demonstrated the ability of smaller language 7. Conclusion
models to achieve a higher likelihood for AI-generated
texts across a broader range of language models [31]. In this paper, we experimentally demonstrated a signifi-
Considering these findings, conducting experiments with cant gap in the detection of AI-generated text with and
smaller language models and verifying if there are differ- without prompts for likelihood-based zero-shot detec-
ences in robustness could also provide valuable insights. tors. These findings call for attention to the impact of
                                                            prompts on enhancing zero-shot detectors in practical
                                                            applications.
6.4. Relationship with supervised
     learning detectors
                                                            References
Even when using supervised learning, it has been noted
that generated text from prompt-based tasks may exhibit      [1] OpenAI. (2023). GPT-4 Technical Report, arXiv.
decreased detection accuracy [15]. However, there is a       [2] Microsoft. Microsoft Copilot, Retrieved October 31,
possibility that these models could be more robust com-          2023, from https://adoption.microsoft.com/ja-jp/
pared to zero-shot detectors. For instance, RADAR [13]           copilot/.
achieved an AUC of 0.939 in the task used in this experi-    [3] Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac,
ment. In comparison, the RoBERTa-large detector [12]             J. B., Yu, J., ... & Ahn, J. (2023). Gemini: A
had an AUC of 0.767. This suggests that robust detectors         family of highly capable multimodal models.
against paraphrase attacks might demonstrate similarly           arXiv:2312.11805.
robust results in other tasks.                               [4] Gehrmann, S., Strobelt, H., & Rush, A. (2019). GLTR:
                                                                 Statistical detection and visualization of generated
     text. In Proceedings of the 57th Annual Meeting                 & Hsieh, C. J. (2023). Red teaming language model
     of the Association for Computational Linguistics:               detectors with language models. arXiv:2305.19713.
     System Demonstrations (pp. 111–116).                       [19] Koike, R., Kaneko, M., & Okazaki, N. (2023). Out-
 [5] Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D.,           fox: LLM-generated essay detection through in-
     & Finn, C. (2023). DetectGPT: Zero-shot machine-                context learning with adversarially generated ex-
     generated text detection using probability curva-               amples. arXiv:2307.11729.
     ture. In ICML 2023.                                        [20] Lu, N., Liu, S., He, R., & Tang, K. (2023). Large lan-
 [6] Bao, G., Zhao, Y., Teng, Z., Yang, L., & Zhang, Y.              guage models can be guided to evade AI-generated
     (2023). Fast-DetectGPT: Efficient zero-shot detec-              text detection. arXiv:2305.10847.
     tion of machine-generated text via conditional prob-       [21] Kumarage, T., Sheth, P., Moraffah, R., Garland, J., &
     ability curvature. arXiv:2310.05130.                            Liu, H. (2023). How reliable are AI-generated-text
 [7] Su, J., Zhuo, T. Y., Wang, D., & Nakov, P. (2023).              detectors? An assessment framework using evasive
     DetectLLM: Leveraging log rank information for                  soft prompts. arXiv:2310.05095.
     zero-shot detection of machine-generated text.             [22] OpenAI. (2023). New AI classifier for indicating AI-
     arXiv:2306.05540.                                               written text, Retrieved November 30, 2023.
 [8] Hans, A., Schwarzschild, A., Cherepanova, V.,              [23] Radford, A., Wu, J., Child, R., Luan, D., Amodei,
     Kazemi, H., Saha, A., Goldblum, M., ... & Gold-                 D., & Sutskever, I. (2019). Language models are un-
     stein, T. (2024). Spotting LLMs with Binoculars:                supervised multitask learners. OpenAI blog, 1(8),
     Zero-shot detection of machine-generated text.                  9.
     arXiv:2401.12070.                                          [24] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang,
 [9] Liu, S., Liu, X., Wang, Y., Cheng, Z., Li, C., Zhang,           S., Matena, M., ... & Liu, P. J. (2020). Exploring the
     Z., ... & Shen, C. (2024). Does DetectGPT fully uti-            limits of transfer learning with a unified text-to-
     lize perturbation? Selective perturbation on model-             text transformer. The Journal of Machine Learning
     based contrastive learning detector would be better.            Research, 21(1), 5485-5551.
     arXiv:2402.00263.                                          [25] Touvron, H., Martin, L., Stone, K., Albert, P., Alma-
[10] Sasse, K., Barham, S., Kayi, E. S., & Staley, E. W.             hairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama
     (2024). To burst or not to burst: Generating and                2: Open foundation and fine-tuned chat models.
     quantifying improbable text. arXiv:2401.15476.                  arXiv:2307.09288.
[11] Tulchinskii, E., Kuznetsov, K., Kushnareva, L., Cher-      [26] Almazrouei, E., Alobeidli, H., Alshamsi, A., Cap-
     niavskii, D., Barannikov, S., Piontkovskaya, I., ...            pelli, A., Cojocaru, R., Debbah, M., ... & Penedo, G.
     & Burnaev, E. (2023). Intrinsic dimension estima-               (2023). The falcon series of open language models.
     tion for robust detection of AI-generated texts.                arXiv:2311.16867.
     arXiv:2306.04723.                                          [27] Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gu-
[12] Solaiman, I., Brundage, M., Clark, J., Askell, A.,              nasekar, S., & Lee, Y. T. (2023). Textbooks are all you
     Herbert-Voss, A., Wu, J., ... & Wang, J. (2019). Re-            need ii: phi-1.5 technical report. arXiv:2309.05463.
     lease strategies and the social impacts of language        [28] Narayan, S., Cohen, S. B., & Lapata, M. (2018). Don’t
     models. arXiv:1908.09203.                                       give me the details, Just the summary! Topic-aware
[13] Hu, X., Chen, P. Y., & Ho, T. Y. (2023). RADAR:                 convolutional neural networks for extreme summa-
     Robust AI-text detection via adversarial learning.              rization. In Proceedings of the 2018 Conference on
     arXiv:2307.03838.                                               Empirical Methods in Natural Language Processing
[14] Dou, Z., Guo, Y., Chang, C. C., Nguyen, H. H., &                (pp. 1797–1807).
     Echizen, I. (2024). Enhancing robustness of LLM-           [29] Chakraborty, S., Bedi, A. S., Zhu, S., An,
     synthetic text detectors for academic writing: A                B., Manocha, D., & Huang, F. (2023). On
     comprehensive analysis. arXiv:2401.08046.                       the possibilities of AI-generated text detection.
[15] Liu, Z., Yao, Z., Li, F., & Luo, B. (2023). Check me if         arXiv:2304.04736.
     you can: Detecting ChatGPT-generated academic              [30] Krishna, K., Song, Y., Karpinska, M., Wieting, J.,
     writing using CheckGPT. arXiv:2306.05524.                       & Iyyer, M. (2023). Paraphrasing evades detectors
[16] Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers,        of ai-generated text, but retrieval is an effective
     I. & Goldstein, T. (2023). A watermark for large                defense. arXiv:2303.13408.
     language models. ICML 2023.                                [31] Mireshghallah, F., Mattern, J., Gao, S., Shokri, R., &
[17] Ren, J., Xu, H., Liu, Y., Cui, Y., Wang, S., Yin, D.,           Berg-Kirkpatrick, T. (2023). Smaller language mod-
     & Tang, J. (2023). A robust semantics-based water-              els are better black-box machine-generated text de-
     mark for large language model against paraphras-                tectors. arXiv:2305.09859.
     ing. arXiv:2311.08721.
[18] Shi, Z., Wang, Y., Yin, F., Chen, X., Chang, K. W.,