The Impact of Prompts on Zero-Shot Detection of AI-Generated Text Kaito Taguchi1,2 , Yujie Gu1 and Kouichi Sakurai1 1 Kyushu University, Fukuoka, Japan 2 Skydisc Inc., Fukuoka, Japan Abstract In recent years, there have been significant advancements in the development of Large Language Models (LLMs). While their potential for misuse, such as generating fake news and committing plagiarism, has posed significant concerns. To address this issue, detectors have been developed to evaluate whether a given text is human-generated or AI-generated. Among others, zero-shot detectors stand out as effective approaches that do not require additional training data and are often likelihood-based. In chat-based applications, users commonly input prompts and utilize the AI-generated texts. However, zero-shot detectors typically analyze these texts in isolation, neglecting the impact of the original prompts. It is conceivable that this approach may lead to a discrepancy in likelihood assessments between the text generation phase and the detection phase. So far, there remains an unverified gap concerning how the presence or absence of prompts impacts detection accuracy for zero-shot detectors. In this paper, we introduce an evaluative framework to empirically analyze the impact of prompts on the detection accuracy of AI-generated text. We assess various zero-shot detectors using both white-box detection, which leverages the prompt, and black-box detection, which operates without prompt information. Our experiments reveal the significant influence of prompts on detection accuracy. Remarkably, compared with black-box detection without prompts, the white-box methods using prompts demonstrate a significant increase in AUC across all zero-shot detectors tested, which calls for attention to the impact of prompts on zero-shot detectors. Code is available: https://github.com/kaito25atugich/Detector. Keywords zero-shot detector, AI-generated text, prompt, LLM 1. Introduction by prompts. It may potentially result in differences in likelihood evaluations between the text generation and Recent years have seen significant advancements in the detection stages. A summary of zero-shot detectors is development of Large Language Models (LLMs) [1, 2, 3], illustrated in Table 1. and their practical applications have become widespread. In this paper, we assess to what extent this phenomenon Meanwhile, their potential misuse have raised significant affects likelihood-based zero-shot detectors. First, we pro- concerns. In particular, the generation of fake news and pose two methods for detecting AI-generated text using plagiarism using LLMs is a notable issue. Detectors that zero-shot detectors: white-box detection, which lever- evaluate whether a given text is human-generated or AI- ages the prompts used to generate the text, and black-box generated serve as a defense mechanism against such detection, which detects AI-generated text without rely- misuse. ing on a prompt. Next, we conduct extensive experiments Detectors for AI-generated text can be broadly classi- and demonstrate a decrease in detection accuracy for ex- fied into three categories: a zero-shot detector leveraging isting zero-shot detectors in black-box detection. statistical properties [4, 5, 6, 7, 8, 9, 10, 11], a detector Our results show a significant difference in the perfor- employing supervised learning [12, 13, 14, 15], and a de- mance of zero-shot detectors for AI-generated text with tector utilizing watermarking [16, 17]. and without prompts, highlighting the need to consider Zero-shot detectors, such as DetectGPT [5], which do the impact of prompts on these detectors. not require additional training, are designed in many These results further point out that likelihood-based methods using likelihood-based scores. In other words, zero-shot detectors face challenges for practical use. Ad- the zero-shot detection is carried out by replicating the ditionally, the experimental results demonstrate that fast likelihood at the generation phase. When using LLMs, zero-shot detectors are more robust compared to other we usually input prompts and utilize the generated out- detectors due to their higher sampling rate. put. However, at the detection phase, it is anticipated that reproducing likelihood becomes challenging due to the absence of the contextual information provided 2. Related work In the context of intentionally undermining detection ac- Envelope-Open k-taguchi@skydisc.jp (K. Taguchi); gu@inf.kyushu-u.ac.jp curacy using prompts, two main categories of studies can (Y. Gu); sakurai@inf.kyushu-u.ac.jp (K. Sakurai) be identified. The first category involves the deliberate Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Table 1 Summary of Zero-shot Detectors Method Summary Log-likelihood Detect using the log likelihood of the given text. Calculate the likelihood of the given text and convert the likelihood of each Rank token into ranks based on the entire vocabulary, then use it to detect. Calculate the likelihood of the given text and transform the likelihood of each Log-Rank token into ranks based on the entire vocabulary, then apply logarithm to these ranks for detection. Entropy Detect by calculating entropy using the likelihood of tokens in the vocabulary. Using a masked language model, randomly replace words in the text. Observe DetectGPT [5] the likelihood of the replaced text and the original text using a scoring model, and utilize the change to detect alterations. Replace the mask model in DetectGPT with a auto-regressive model similar to FastDetectGPT [6] the scoring model. Sample words randomly from the vocabulary to replace words. Calculate scores in the same manner as DetectGPT. LRR [7] Detect using the ratio of log-likelihood to log-rank. Similar to DetectGPT, utilize logarithmic ranks rather than logarithmic NPR [7] likelihood for scoring calculation. Utilize models trained with slightly different amounts of data and calculate the Binoculars [8] perplexity of each model. Then leverage the difference in perplexity for detection. crafting of prompts with malicious intent to deliberately text and updating the content of the instructions to re- reduce detection accuracy. In contrast, the second cat- duce detection accuracy. egory encompasses research that employs tasks with Kumarage et al. [21] proposed an attack named Soft benign prompts, devoid of malicious intent. Prompt, which generates a vector using reinforcement learning to induce misclassification by detectors. This 2.1. Malicious prompts Soft Prompt vector is then used as input for DetectGPT and RoBERTa-based detectors [12], demonstrating a de- First, we delve into studies that specifically concentrate crease in detection accuracy [21]. on the deliberate creation of malicious prompts. In [19], Koike et al. proposed OUTFOX, utilizing in- 2.2. Benign prompts context learning with the problem statement 𝑃, human- generated text 𝐻, and AI-generated text 𝐴. By construct- We review cases involving tasks with benign prompts. ing prompts such as β€œπ‘π‘– ∈ 𝑃 β†’ β„Žπ‘– ∈ 𝐻 is the correct label Liu et al. conducted experiments using the CheckGPT by humans, and 𝑝𝑖 ∈ 𝑃 β†’ π‘Žπ‘– ∈ 𝐴 is the correct label model, an approach based on supervised learning. Their by AI,” they aim to generate text for a given problem findings indicate that when using different prompts, al- statement in such a way that the generated text aligns though all surpass 90%, there is an experimental demon- with human-authored content. This approach makes the stration of approximately a 7% decrease in detection ac- detection of artificially generated content challenging. curacy [15]. Shi et al. conducted an attack on OpenAI’s Detec- Dou et al. [14] performed experiments envisioning tor [22] by employing an Instructional Prompt, confirm- the utilization of LLMs by students. In their study, they ing a decrease in detection accuracy [18]. The Instruc- demonstrated a decrease in DetectGPT’s detection accu- tional Prompt involves adding a reference text π‘‹π‘Ÿπ‘’π‘“ and racy when prompts were employed. an instructional text 𝑋𝑖𝑛𝑠 with characteristics that reduce Hans et al. [8] pointed out the difficulty in reproduc- the detection accuracy to the original input 𝑋, thereby ing likelihoods depending on the presence or absence of undermining the detection accuracy. prompts, using unique prompts like β€œWrite about a capy- In [20], Lu et al. proposed SICO, a method that low- bara astronomer.” In response to the capybara problem, ers detection accuracy by instructing the model within they proposed Binoculars. prompts to mimic the writing style of human-authored We assume performing benign tasks such as summa- rization. Therefore, unlike malicious prompt attacks, 3.2.2. Entropy there is no need to deliberately choose prompts that Entropy is a method that utilizes the entropy of the vocab- would lower accuracy using the detector when construct- ulary for detection. The formula is shown in (3). Entropy ing prompts, nor is there a requirement to collect pairs is calculated using the likelihood of the vocabulary, tak- of data for in-context learning. ing the average across each context. On the other hand, Dou et al. [14] experimentally demonstrated unintended decreases in detection accu- 𝑁 𝐢 racy. However, they did not delve into why the accuracy βˆ’1 Entropy = βˆ‘ βˆ‘ 𝑃 (𝑗|𝑆 ) log π‘ƒπ‘‡πœƒ (𝑗|𝑆<𝑖 ). (3) decreases or make references to other likelihood-based 𝑁 βˆ’ 1 𝑖=2 𝑗=1 π‘‡πœƒ <𝑖 zero-shot detectors. Additionally, Hans et al. [8] did not provide specific verification regarding the impact of a 3.2.3. Rank detector knowing or not knowing the prompt on detec- Rank is a method that utilizes the order of likelihood tion accuracy. Therefore, the resilience of Binoculars magnitude of tokens in the vocabulary when sorted. The to changes in likelihood due to prompts has not been formula is presented in (4). Rank is the average position adequately assessed. The supervised learning based ap- of tokens constituting a given text. The function π‘ π‘œπ‘Ÿπ‘‘ proach [15] is excluded from our experiments in this is a function that sorts the given array in descending context. order, and 𝑖𝑛𝑑𝑒π‘₯ is a function that, given an array and an In this study, we demonstrate that even in ordinary element as input, returns the index of the element within tasks such as summarization, the presence or absence of the given array. prompts unintentionally leads to a decrease in accuracy when using likelihood-based zero-shot detectors. 𝑁 βˆ’1 rank = βˆ‘ 𝑖𝑛𝑑𝑒π‘₯(π‘ π‘œπ‘Ÿπ‘‘(log π‘ƒπ‘‡πœƒ (𝑆𝑖 |𝑆<𝑖 )), 𝑆𝑖 ). (4) 𝑁 βˆ’ 1 𝑖=2 3. Preliminary 3.2.4. DetectGPT 3.1. Language model The language model aims to maximize likelihood during A model that captures the probability of generating words text generation, whereas humans create text indepen- or sentences is referred to as a language model. Let 𝑉 dently of likelihood. DetectGPT focuses on this phe- represent the vocabulary. The language model for a word nomenon and posits a hypothesis that by rewriting cer- sequence of length 𝑛, denoted as π‘₯1 , π‘₯2 , … , π‘₯𝑛 where π‘₯𝑖 ∈ 𝑉, tain words, the likelihood of the text decreases for AI- is defined by the following (1). generated content and can either increase or decrease for human-generated content [5]. 𝑛 The overview of DetectGPT is presented in Figure 𝑃(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) = ∏ 𝑃(π‘₯𝑑 |π‘₯1 , … , π‘₯π‘‘βˆ’1 ) (1) 𝑑=1 1. The replacement process is achieved by utilizing a mask model 𝑃𝑀 , such as T5 [24], on some of the words 3.2. Existing zero-shot detectors contained in the given text 𝑆. This operation is repeated for a total of π‘˜ iterations, and the average log-likelihood We provide a brief introduction to existing zero-shot of the obtained π‘˜ replacement texts is then computed. (5) detectors, summarized in Table 1. Here, π‘ƒπ‘‡πœƒ refers to the represents the score, calculating the difference between language model utilized for detection. The vocabulary 𝑉 the log-likelihood of the original text and the average is composed of 𝐢 tokens. The input text 𝑆 is composed log-likelihood of the acquired replacement texts. It is of 𝑁 tokens, represented as 𝑆 = {𝑆1 , 𝑆2 , … , 𝑆𝑁 }, and the permissible to standardize by dividing by the standard token sequence from 𝑆1 to π‘†π‘–βˆ’1 is denoted as 𝑆<𝑖 . deviation of the log-likelihood of the replacement texts. If the score is above the threshold πœ€, it is deemed to be 3.2.1. Log-Likelihood AI-generated text. The log-likelihood is a method that utilizes the likelihood log π‘ƒπ‘‡πœƒ (𝑆) βˆ’ π‘šΜƒ of tokens composing a text for detection. The formula is DetectGPT = (5) πœŽπ‘†Μƒ presented in (2). The log-likelihood is the average of the log-likelihoods of tokens constituting a given text. where 𝑁 π‘˜ 1 1 Log-likelihood = βˆ‘ log π‘ƒπ‘‡πœƒ (𝑆𝑖 |𝑆<𝑖 ). (2) π‘šΜƒ = βˆ‘ log π‘ƒπ‘‡πœƒ (𝑆𝑖̃ ) 𝑁 βˆ’ 1 𝑖=2 π‘˜ 𝑖=1 π‘˜ 1 πœŽπ‘†Μƒ = βˆ‘(log π‘ƒπ‘‡πœƒ (𝑆𝑖̃ ) βˆ’ 𝑒)Μƒ 2 π‘˜ βˆ’ 1 𝑖=1 Replace words calculate the likelihood 𝑠 π‘ƒπ‘‡πœƒ (𝑠) 𝑠 𝑠ǁ1 π‘ƒπ‘‡πœƒ (𝑠ǁ1 ) LLM 𝑠 mask 𝑠ǁ2 π‘ƒπ‘‡πœƒ (𝑠ǁ2 ) model π‘ƒπ‘‡πœƒ Mean: 𝑒, ΰ·€ Variance:πœŽΰ·€π‘ 2 ・・・ ・・・ 𝑃𝑀 𝑠 π‘ Ηπ‘˜ π‘ƒπ‘‡πœƒ (π‘ Ηπ‘˜ ) Figure 1: DetectGPT Overview and 𝑆𝑖̃ ∼ 𝑃𝑀 (𝑆𝑖 ) represent the mean, sample variance, and On the other hand, NPR, like DetectGPT, performs the a sample from 𝑃𝑀 (𝑆𝑖 ), respectively. substitution of words in the text π‘˜ times. It takes the ratio of the average log-rank of the obtained substituted texts 3.2.5. FastDetectGPT to the log-rank of the original text. This is defined in (7). In [6], Bao et al. highlighted challenges in DetectGPT’s 1 π‘˜ βˆ‘ log π‘Ÿπœƒ (𝑆𝑝̃ ) use of different models for substitution and score cal- π‘˜ 𝑝=1 𝑁 𝑃𝑅 = (7) culation, as well as the cost-related aspect of requiring log π‘Ÿπœƒ (𝑆) model access for each substitution iteration. In response, FastDetectGPT is a modified detector that reduces access 3.2.7. Binoculars to the model, addressing the cost issue while enabling Hans et al. proposed Binoculars, a detection method substitutions. Although the methodology involves set- utilizing two closely related language models, Falcon- ting hypotheses similar to DetectGPT, there is no funda- 7b [26] and Falcon-7b-instruct, by employing a metric mental change. It still operates on the assumption that called cross-perplexity [8]. The overall framework is β€œAI-generated text is likely to be around the maximum illustrated in Figure 3. likelihood, whereas human-generated text is not.” Let the first model be denoted as 𝑀1 (such as Falcon- We present the overall architecture of FastDetectGPT 7b), and the second model as 𝑀2 (like Falcon-7b-instruct). in Figure 2. In FastDetectGPT, the substitution process is In this case, using 𝑀1 , we calculate the log perplexity as replaced with an alternative method that does not rely on shown in (8). a mask model. Similar to the detection model, it utilizes an autoregressive model, and π‘ƒπ‘‡πœƒ and π‘ƒπ‘ˆπœƒ can be the same. 𝑁 1 The substitution for the 𝑖-th word involves randomly log 𝑃𝑃𝐿𝑀1 (𝑆) = βˆ’ βˆ‘ log(𝑀1 (𝑆𝑖 |𝑆<𝑖 )) (8) 𝑁 𝑖=1 extracting a word from the next-word list, considering the context up to the (𝑖 βˆ’ 1)-th word in the input text, Next, using 𝑀1 and 𝑀2 , we calculate the cross-perplexity, and replacing the word with the chosen one. In other as shown in (9). Here, the symbol β‹… represents the dot words, performing this substitution 𝑁 times results in product. the substituted text 𝑆,Μƒ and by conducting sampling dur- ing word selection, the replacement process generates π‘˜ log 𝑋 -𝑃𝑃𝐿𝑀1 ,𝑀2 (𝑆) = substitution texts in a single access. 𝑁 𝐢 The subsequent score calculation is omitted as it fol- 1 βˆ’ βˆ‘ βˆ‘ 𝑀 (𝑗|𝑆 ) β‹… log(𝑀2 (𝑗|𝑆<𝑖 )) (9) lows the same procedure as DetectGPT. 𝑁 𝑖=1 𝑗=1 1 <𝑖 3.2.6. LLR & NPR The score in Binoculars is determined by (10). LLR (Likelihood Log-Rank ratio) and NPR (Normalized log 𝑃𝑃𝐿𝑀1 (𝑆) perturbed log rank) are classical log-rank enhancement 𝐡𝑀1 ,𝑀2 (𝑆) = (10) log 𝑋 -𝑃𝑃𝐿𝑀1 ,𝑀2 (𝑆) techniques proposed by Su et al. [7]. Both methods have simple configurations. LLR literally takes the ratio of log-likelihood to log-rank, as expressed in (6). Here, π‘Ÿπœƒ 4. Proposal represents the rank when using π‘ƒπ‘‡πœƒ . In this study, we propose a detection flow to investigate 𝑑 βˆ‘π‘–=1 log π‘ƒπ‘‡πœƒ (𝑆𝑖 |𝑆<𝑖 ) the impact of prompts on likelihood. Before present- 𝐿𝑅𝑅 = βˆ’ 𝑑 (6) ing the experimental setup, we introduce an additional βˆ‘π‘–=1 log π‘Ÿπœƒ (𝑆𝑖 |𝑆<𝑖 ) detection method. Vocabulary 𝑉 like Replace words calculate the likelihood 𝑠ǁ1 π‘ƒπ‘‡πœƒ (𝑠ǁ1 ) 𝑠 =β€œI like apples” love LLM 𝑠ǁ2 LLM π‘ƒπ‘‡πœƒ (𝑠ǁ2 ) am 𝑠 ↓ Replace randomly with bananas ・・・ π‘ƒπ‘ˆπœƒ π‘ƒπ‘‡πœƒ tokens from set 𝑉 π‘ Ηπ‘˜ π‘ƒπ‘‡πœƒ (π‘ Ηπ‘˜ ) 𝑠1 =β€œI am dogs” apples dogs Replace words in Fast Series I Figure 2: FastDetectGPT and Sampling Overview calculate the PPL and X-PPL 𝑀1 log 𝑃𝑃𝐿𝑀1 log 𝑃𝑃𝐿𝑀1 𝑆 π‘ π‘π‘œπ‘Ÿπ‘’ = log 𝑋 𝑃𝑃𝐿𝑀1,𝑀2 log 𝑋 𝑃𝑃𝐿𝑀1,𝑀2 𝑀2 Figure 3: Binoculars Overview 4.1. FastNPR 5. Experiment Word replacements in NPR are performed using a masked 5.1. Configuration model. In this research, aiming for cost reduction, we em- ploy FastNPR, a method that replaces word replacements To begin, we utilize the GPT2-XL [23] as the detection with sampling, akin to FastDetectGPT. model, excluding Binoculars. Due to GPU constraints, Binoculars employs the pre-trained and instruct-tuned 4.2. Detection methods Phi1.5 [27] instead of Falcon. For DetectGPT and NPR, we generate five replacement We explain the detection methodology. Let π‘₯ represent sentences for 10% of the entire text, while the Fast series the text to be detected, and if π‘₯ is an AI-generated text, let generates 10,000 replacement sentences. T5-Large [24] 𝑝 denote the prompt used for its generation. Detection is used for word replacement in DetectGPT and NPR, can be categorized into two patterns: Black-box detection while the Fast series employs the GPT2-XL, the same and White-box detection. An overview is presented in detection model. Also, we use the XSum dataset [28]. Figure 4. For human-generated text, we extract 200 samples from Black-box detection occurs when the detector is un- the XSum dataset, and for AI-generated text, we employ aware of prompt information, essentially mirroring exist- the Llama2 7B Chat model [25], generating up to 200 ing detection methods. In this scenario, only the content tokens. The prompt used is β€œWould you summarize the of π‘₯ is provided to the detector. following sentences, please? text”. White-box detection, on the other hand, involves the detector having knowledge of prompt information. For 5.2. Result human-generated text, only π‘₯ is input. In the case of AI-generated text, the input consists of 𝑝 + π‘₯. It is impor- As evident from the results in Table 2, white-box detec- tant to note that, in White-box detection, the prompt is tion exhibits higher accuracy, while black-box detection used solely for likelihood calculation and is not directly shows lower accuracy. As anticipated, modifying likeli- included in the score computation. hood through prompts leads to a decrease in the detection accuracy of likelihood-based detectors. Notably, there is a consistent decrease of 0.1 or more across all methods, AI-generated Text: What is 1+1? 1+1 equals 2. prompt Black-box(Detectors unaware prompt) Use for calculating score Zero-shot 1+1 equals 2. π‘ π‘π‘œπ‘Ÿπ‘’(1+1 equals 2.) Detector White-box(Detectors know prompt) What is 1+1? 1+1 equals 2. Zero-shot π‘ π‘π‘œπ‘Ÿπ‘’(1+1 equals 2.| What is 1+1? ) Detector Figure 4: Proposed Detection Methods Overview Table 2 Particularly in recent years, there is a trend toward prac- Detection of Generated Summaries: Discrepancies Between tical applications, emphasizing high true positive rates at Cases with and Without Prompts low false positive rates, suggesting that at least an AUC in Method Black-box White-box the late 0.9s would be necessary [30, 8]. Furthermore, the DetectGPT 0.453 1.000 lack of improvement in detection accuracy with Detect- FastDetectGPT 0.819 0.958 GPT and NPR may be attributed to the limited number LRR 0.532 0.995 of substitutable tokens. NPR 0.560 0.934 FastNPR 0.768 0.993 Table 3 Entropy 0.330 0.978 Effect of Substitution Rate(SR) and Sample Size(SS) Variation Log-likelihood 0.474 0.998 on AUC(DetectGPT) Rank 0.432 0.977 Log-Rank 0.485 0.999 Method SR SS AUC Binoculars 0.877 0.999 FastDetectGPT 10% 5 0.640 FastDetectGPT 20% 5 0.697 FastDetectGPT 100% 5 0.779 highlighting a significant observation. FastDetectGPT 10% 10 0.704 Binoculars and the Fast series detectors demonstrate FastDetectGPT 20% 10 0.739 FastDetectGPT 100% 10 0.821 robustness compared to other methods. In particular, the FastDetectGPT 100% 10000 0.819 Fast series detector maintains the same scoring calcu- DetectGPT 10% 5 0.453 lation as conventional methods, suggesting robustness DetectGPT 20% 5 0.522 factors in the sampling process. For further verification, DetectGPT 30% 5 0.490 we conduct additional experiments. DetectGPT 10% 10 0.446 In this experiment, we investigate the differences in DetectGPT 30% 10 0.446 detection accuracy when varying the replacement ratio, indicating the extent to which tokens in the text are re- placed, and the sample size, representing the number Table 4 of replacement sentences. DetectGPT and NPR require Effect of Substitution Rate(SR) and Sample Size(SS) Variation the use of a masked language model to replace plausible on AUC(NPR) tokens, making replacement not always feasible, espe- Method SR SS AUC cially for higher replacement percentages. Therefore, we primarily vary the replacement ratio in the Fast series to FastNPR 10% 5 0.628 conduct the investigation. FastNPR 20% 5 0.661 FastNPR 100% 5 0.747 The results for DetectGPT are presented in Table 3, FastNPR 10% 10 0.647 and the results for NPR are shown in Table 4. From these FastNPR 20% 10 0.715 results, it is evident that increasing the replacement ratio FastNPR 100% 10 0.750 and sample size helps mitigate the decrease in detection FastNPR 100% 10000 0.763 accuracy. This observation is similar to Chakraborty et NPR 10% 5 0.560 al.’s assertion that increasing the sample size can enable NPR 20% 5 0.590 detection if the distribution slightly differs [29]. NPR 30% 5 0.577 However, in our validation, the improvement in accu- NPR 10% 10 0.589 racy plateaus at around 10 samples, reaching a maximum NPR 30% 10 0.588 AUC of approximately 0.8, which is not considered high. 6. Limitation and future work 6.5. Relationship with watermarking Watermarking techniques utilize statistical methods for 6.1. Hypotheses for zero-shot detectors verification [16]. Since these methods are based on like- While our investigation has focused solely on prompts, lihood during both generation and verification, a fail- similar phenomena could potentially be observed with ure to reproduce likelihood during the verification stage other factors. For instance, variations in Temperature or may lead to a decrease in accuracy. On the other hand, Penalty Repetition between the generation and detection robust watermarking techniques against paraphrase at- stages might introduce differences in the selected tokens, tacks have emerged [17]. These methods may exhibit making detection challenging based on likelihood. Gen- robustness against prompts as well. eralizing these observations, we hypothesize that any act that fails to replicate the likelihood during language 6.6. Towards resilient zero-shot detectors generation could undermine the detection accuracy of zero-shot detectors relying on likelihood from next-word Currently, many methods perform likelihood-based de- prediction. tection. Combining these methods with other sophis- ticated techniques may lead to more robust detection. One such approach is Intrinsic Dimension [11]. Intrin- 6.2. Tasks sic Dimension refers to the minimum dimension needed While our investigation has focused on summary text to represent a given text. Tulchinskii et al. propose a generation, there are several other potential tasks to con- detector based on Persistent Homology to estimate the sider, such as paraphrase generation, story generation, Intrinsic Dimension and use it as a score. However, this and translation text generation. It is plausible that de- method requires a constant length of text and was not tection accuracy could also decrease in these common applicable in our experiment. It would be interesting to tasks. Since these tasks may be utilized without mali- explore the application of this method in experiments cious intent, it is crucial to conduct similar evaluations involving longer texts. for them. Approaches utilizing representations obtained with masked language models, including Intrinsic Dimension, 6.3. Number of parameters calculate likelihood in a different way from the detectors used in our experiment, which are based on autoregres- In this study, each detection method utilized a language sive language models. Combining these elements may model of approximately 1 billion parameters. It would be lead to the development of a more robust zero-shot de- of interest to investigate whether increased robustness tector. can be observed when experimenting with larger lan- guage models. Conversely, there are experimental stud- ies that have demonstrated the ability of smaller language 7. Conclusion models to achieve a higher likelihood for AI-generated texts across a broader range of language models [31]. In this paper, we experimentally demonstrated a signifi- Considering these findings, conducting experiments with cant gap in the detection of AI-generated text with and smaller language models and verifying if there are differ- without prompts for likelihood-based zero-shot detec- ences in robustness could also provide valuable insights. tors. These findings call for attention to the impact of prompts on enhancing zero-shot detectors in practical applications. 6.4. Relationship with supervised learning detectors References Even when using supervised learning, it has been noted that generated text from prompt-based tasks may exhibit [1] OpenAI. (2023). GPT-4 Technical Report, arXiv. decreased detection accuracy [15]. However, there is a [2] Microsoft. Microsoft Copilot, Retrieved October 31, possibility that these models could be more robust com- 2023, from https://adoption.microsoft.com/ja-jp/ pared to zero-shot detectors. For instance, RADAR [13] copilot/. achieved an AUC of 0.939 in the task used in this experi- [3] Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, ment. In comparison, the RoBERTa-large detector [12] J. B., Yu, J., ... & Ahn, J. (2023). Gemini: A had an AUC of 0.767. This suggests that robust detectors family of highly capable multimodal models. against paraphrase attacks might demonstrate similarly arXiv:2312.11805. robust results in other tasks. [4] Gehrmann, S., Strobelt, H., & Rush, A. (2019). GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting & Hsieh, C. J. (2023). Red teaming language model of the Association for Computational Linguistics: detectors with language models. arXiv:2305.19713. System Demonstrations (pp. 111–116). [19] Koike, R., Kaneko, M., & Okazaki, N. (2023). Out- [5] Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., fox: LLM-generated essay detection through in- & Finn, C. (2023). DetectGPT: Zero-shot machine- context learning with adversarially generated ex- generated text detection using probability curva- amples. arXiv:2307.11729. ture. In ICML 2023. [20] Lu, N., Liu, S., He, R., & Tang, K. (2023). Large lan- [6] Bao, G., Zhao, Y., Teng, Z., Yang, L., & Zhang, Y. guage models can be guided to evade AI-generated (2023). Fast-DetectGPT: Efficient zero-shot detec- text detection. arXiv:2305.10847. tion of machine-generated text via conditional prob- [21] Kumarage, T., Sheth, P., Moraffah, R., Garland, J., & ability curvature. arXiv:2310.05130. Liu, H. (2023). How reliable are AI-generated-text [7] Su, J., Zhuo, T. Y., Wang, D., & Nakov, P. (2023). detectors? An assessment framework using evasive DetectLLM: Leveraging log rank information for soft prompts. arXiv:2310.05095. zero-shot detection of machine-generated text. [22] OpenAI. (2023). New AI classifier for indicating AI- arXiv:2306.05540. written text, Retrieved November 30, 2023. [8] Hans, A., Schwarzschild, A., Cherepanova, V., [23] Radford, A., Wu, J., Child, R., Luan, D., Amodei, Kazemi, H., Saha, A., Goldblum, M., ... & Gold- D., & Sutskever, I. (2019). Language models are un- stein, T. (2024). Spotting LLMs with Binoculars: supervised multitask learners. OpenAI blog, 1(8), Zero-shot detection of machine-generated text. 9. arXiv:2401.12070. [24] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, [9] Liu, S., Liu, X., Wang, Y., Cheng, Z., Li, C., Zhang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the Z., ... & Shen, C. (2024). Does DetectGPT fully uti- limits of transfer learning with a unified text-to- lize perturbation? Selective perturbation on model- text transformer. The Journal of Machine Learning based contrastive learning detector would be better. Research, 21(1), 5485-5551. arXiv:2402.00263. [25] Touvron, H., Martin, L., Stone, K., Albert, P., Alma- [10] Sasse, K., Barham, S., Kayi, E. S., & Staley, E. W. hairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama (2024). To burst or not to burst: Generating and 2: Open foundation and fine-tuned chat models. quantifying improbable text. arXiv:2401.15476. arXiv:2307.09288. [11] Tulchinskii, E., Kuznetsov, K., Kushnareva, L., Cher- [26] Almazrouei, E., Alobeidli, H., Alshamsi, A., Cap- niavskii, D., Barannikov, S., Piontkovskaya, I., ... pelli, A., Cojocaru, R., Debbah, M., ... & Penedo, G. & Burnaev, E. (2023). Intrinsic dimension estima- (2023). The falcon series of open language models. tion for robust detection of AI-generated texts. arXiv:2311.16867. arXiv:2306.04723. [27] Li, Y., Bubeck, S., Eldan, R., Del Giorno, A., Gu- [12] Solaiman, I., Brundage, M., Clark, J., Askell, A., nasekar, S., & Lee, Y. T. (2023). Textbooks are all you Herbert-Voss, A., Wu, J., ... & Wang, J. (2019). Re- need ii: phi-1.5 technical report. arXiv:2309.05463. lease strategies and the social impacts of language [28] Narayan, S., Cohen, S. B., & Lapata, M. (2018). Don’t models. arXiv:1908.09203. give me the details, Just the summary! Topic-aware [13] Hu, X., Chen, P. Y., & Ho, T. Y. (2023). RADAR: convolutional neural networks for extreme summa- Robust AI-text detection via adversarial learning. rization. In Proceedings of the 2018 Conference on arXiv:2307.03838. Empirical Methods in Natural Language Processing [14] Dou, Z., Guo, Y., Chang, C. C., Nguyen, H. H., & (pp. 1797–1807). Echizen, I. (2024). Enhancing robustness of LLM- [29] Chakraborty, S., Bedi, A. S., Zhu, S., An, synthetic text detectors for academic writing: A B., Manocha, D., & Huang, F. (2023). On comprehensive analysis. arXiv:2401.08046. the possibilities of AI-generated text detection. [15] Liu, Z., Yao, Z., Li, F., & Luo, B. (2023). Check me if arXiv:2304.04736. you can: Detecting ChatGPT-generated academic [30] Krishna, K., Song, Y., Karpinska, M., Wieting, J., writing using CheckGPT. arXiv:2306.05524. & Iyyer, M. (2023). Paraphrasing evades detectors [16] Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, of ai-generated text, but retrieval is an effective I. & Goldstein, T. (2023). A watermark for large defense. arXiv:2303.13408. language models. ICML 2023. [31] Mireshghallah, F., Mattern, J., Gao, S., Shokri, R., & [17] Ren, J., Xu, H., Liu, Y., Cui, Y., Wang, S., Yin, D., Berg-Kirkpatrick, T. (2023). Smaller language mod- & Tang, J. (2023). A robust semantics-based water- els are better black-box machine-generated text de- mark for large language model against paraphras- tectors. arXiv:2305.09859. ing. arXiv:2311.08721. [18] Shi, Z., Wang, Y., Yin, F., Chen, X., Chang, K. W.,