Team lm-detector at PAN: Can NLI be an Appropriate
                         Approach to Machine-Generated Text Detection
                         Notebook for the PAN Lab at CLEF 2024

                         Guojun Wu1,† , Qinghao Guan1,†
                         1
                             University of Zurich, Zurich, 8050, Switzerland


                                         Abstract
                                         The ability to accurately detect machine-generated text is becoming increasingly important in various fields,
                                         including academia, journalism, and online security. In this study, we propose a novel method for detecting
                                         machine-generated text, predicated on the hypothesis that the probability of reasoning from human-generated text
                                         to machine-generated text is inherently higher. Our approach is inspired by the principles of Natural Language
                                         Inference (NLI), leveraging the differences in logical consistency and contextual coherence between human and
                                         machine-generated texts. However, our experimental results indicate that this method may not be as effective
                                         as anticipated. Despite the theoretical foundation, the practical application of our method revealed significant
                                         limitations, suggesting that it might not be a reliable solution for detecting machine-generated text. Further
                                         research and refinement are necessary to enhance the efficacy of detection techniques.

                                         Keywords
                                         Machine-Generated Text Detection, Natural Language Inference, Probability


                         1. Introduction
                         The rapid advancement of artificial intelligence has led to the widespread use of machine-generated
                         text in various domains [1]. Recent development of Large Language Models, such as ChatGPT [2],
                         LLaMA2 [3], can generate human-like texts for various downstream tasks. The performance has been
                         proven to be better than humans in some specific tasks. From automated news articles to customer
                         service chatbots, these texts are becoming indistinguishable from those written by humans [4]. While
                         this technological progress brings many benefits, it also poses significant challenges, particularly in the
                         realm of text authenticity and content verification.
                            Detecting machine-generated text is crucial for maintaining the integrity of information. In academia,
                         it helps prevent plagiarism and ensures the originality of scholarly work. In journalism, it safeguards
                         against the dissemination of fake news and misinformation. In online platforms, it enhances security by
                         identifying automated accounts and reducing the spread of malicious content. Despite the growing need
                         for effective detection methods, current techniques often fall short. Traditional approaches typically
                         focus on stylistic and linguistic features, which can be easily manipulated by advanced language models.
                         As a result, there is a pressing need for more robust and reliable detection methods.
                            In this study, we propose a novel approach inspired by Natural Language Inference (NLI). Our method
                         is based on the hypothesis that we are able to judge which text is generated by human by comparing the
                         probability of reasoning (See Section 4). By leveraging the logical consistency and contextual coherence
                         differences between human and machine-generated texts, we aim to develop a more accurate detection
                         model.
                            However, our experimental results suggest that this method may not be as effective as initially
                         anticipated. Despite its theoretical promise, practical application revealed significant limitations,
                         highlighting the complexity of the detection problem. This paper presents our findings and discusses
                         the implications for future research in this area.

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         †
                           Authors contributed equally
                          $ guojun.wu@uzh.ch (G. Wu); qinghao.guan@uzh.ch (Q. Guan)
                           0000-0003-0062-4502 (Q. Guan)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Background
This work was developed for the PAN task——Generative AI Authorship Verification [5] [6] [7] [8]
——where we are given two texts, one authored by a human, and another by a machine, and our target
is to pick out the one generated by a human. The dataset was generated by the PAN organizers which
is another PAN task, where the participants were asked to build models that can create texts as similar
as human-written. The bootstrap dataset consists of multiple text genres, including news articles,
Wikipedia texts, and fiction.


Figure 1: An example of the bootstrap dataset


3. Previous Work
Several studies have approached the detection of AI-generated text as a binary classification problem
using neural network-based detectors [9]. For instance, OpenAI has fine-tuned RoBERTa-based GPT-2
detector models to differentiate between human-generated and GPT-2-generated texts [10]. Also, some
researchers explored the zero-shot detection method for AI-generated text, such as [11], who noted that
AI-generated passages typically exhibit negative curvature in the log probability of texts and proposed
DetectGPT, a zero-shot detection method that capitalizes on this observation.
   However, relying on neural networks for detection can expose these methods to adversarial and
poisoning attacks [12] [13]. To address this, some researchers have explored watermarking AI-generated
texts to facilitate detection [14] [15]. Watermarking involves embedding specific patterns in the text,
making detection easier. This method provides consistent detection across various contexts and model
updates, maintaining its effectiveness without the need for frequent re-training. Watermarking is
computationally efficient, requiring minimal additional resources during text generation and enabling
quick verification processes. Furthermore, it enhances security by complicating adversarial attempts
to alter the text undetected and supports traceability by linking the generated content back to the
specific model instance, aiding in accountability and auditing efforts. Overall, watermarking presents
a low-overhead, resilient, and scalable approach to managing the challenges of AI-generated text
detection.


4. System Overview
NLI is an NLP task involving determining the relationship between two sentences: whether one sentence
(the hypothesis) can be inferred from another sentence (the premise). It has been proven that NLI can be
used for inconsistency detection in summarization where the source document acts as the premise, and
the generated summary acts as the hypothesis [16]. The NLI model evaluates whether the information
in the summary can logically be inferred from the source document.
   Inspired by the usage of NLI in the summarization task, we detect the machine-generated text by
ways of detecting the logical relationship between the premise and hypothesis.
    The model checks for three possible relationships between the premise and hypothesis:
    Entailment: The hypothesis (summary) logically follows from the premise (source text).
    Contradiction: The hypothesis contradicts the premise.
    Neutral: There is no clear logical relationship, meaning the hypothesis might add information not
present in the premise or omit critical information.
    Given two texts, text_a and text_b, one authored by a human and the other generated by a machine,
we calculated the probability of reasoning for each text pair independently. Assume that the probability
of reasoning from text_a to text_b is 𝑃 (𝑇text_a → 𝑇text_b ) while the probability of reasoning from text_b
to text_a is 𝑃 (𝑇text_b → 𝑇text_a ). If 𝑃 (𝑇text_a → 𝑇text_b ) is larger than 𝑃 (𝑇text_b → 𝑇text_a ), we could
assume that text_a was written by human. It is worth noting that we did not conduct any pre-processing
(i.e. segmentation) in order to provide sufficient contexts for ratiocination by our model. Our hypothesis
is as follows.
    As known, the premise provides the basis or groundwork for a conclusion while the hypothesis, in
a logical structure, is a statement whose validity is supported by the premise. On the one hand, the
machine-generated text in our task was generated based on the human-written text, which means that
the human-generated text provides the foundation thus the human-written text should be the premise.
On the other hand, the text generated by AI may not match human authors in terms of semantic
coherence and logical depth [17]. Accordingly, it is impossible to derive the human-generated text on
the basis of the machine-written one.
    Beside, we define that if the difference between 𝑃 (𝑇text_a → 𝑇text_b ) and 𝑃 (𝑇text_a → 𝑇text_b ) is lower
than 0.05, their relation is neutral, meaning there is no clear logical relationship between these two
texts.
    The language model for NLI was DeBERTa-v3-large-mnli-fever-anli-ling-wanli which is a fine-tuned
model specifically for NLI tasks [18] for the reason that this model was fine-tuned on distinct datasets
including FEVER (Fact Extraction and VERification), ANLI (Adversarial NLI), and WANLI (Weakly-
supervised ANLI).


Figure 2: Figure 2: Pipeline of our detector model


5. Results
We compared our model, detector, with the baseline models. The performance metrics indicates that
the detector model significantly underperforms compared to all baseline approaches. Specifically, the
detector (our model) achieved a ROC-AUC of 0.548, which is the lowest among all models, indicating
poor discriminative ability. Its Brier score is 0.622, suggesting less accurate probabilistic predictions,
while its C@1 score of 0.489 is the lowest, reflecting suboptimal performance. The detector’s F1 score
of 0.442 and F0.5u score of 0.461 are also the lowest, indicating poor balance and precision-focused
performance, respectively. In contrast, Baseline Binoculars exhibits the highest performance across
most metrics, with a ROC-AUC of 0.972, a Brier score of 0.957, and C@1, F1, and F0.5u scores all around
0.965. The overall mean score of Baseline Binoculars is 0.965, compared to the detector’s mean of
0.512. The Fast-DetectGPT (Mistral) baseline also performed well, with a ROC-AUC of 0.876 and a
mean score of 0.866. Quantile-based evaluations show the 95-th quantile achieving the highest scores,
with a ROC-AUC of 0.994 and a mean score of 0.990, underscoring the best performance of the top 5
percentages of models.

Table 1
Overview of the accuracy in detecting if a text is written by an human in task 4 on PAN 2024 (Voight-Kampff
Generative AI Authorship Verification). We report ROC-AUC, Brier, C@1, F1 , F0.5𝑢 and their mean.
           Approach                            ROC-AUC Brier C@1             F1     F0.5𝑢 Mean
           detector                               0.548     0.622   0.489   0.442 0.461     0.512
           Baseline Binoculars                    0.972     0.957   0.966   0.964   0.965   0.965
           Baseline Fast-DetectGPT (Mistral)      0.876      0.8    0.886   0.883   0.883   0.866
           Baseline PPMd                          0.795     0.798   0.754   0.753   0.749    0.77
           Baseline Unmasking                     0.697     0.774   0.691   0.658   0.666   0.697
           Baseline Fast-DetectGPT                0.668     0.776   0.695   0.69    0.691   0.704
           95-th quantile                         0.994     0.987   0.989   0.989   0.989   0.990
           75-th quantile                         0.969     0.925   0.950   0.933   0.939   0.941
           Median                                 0.909     0.890   0.887   0.871   0.867   0.889
           25-th quantile                         0.701     0.768   0.683   0.657   0.670   0.689
           Min                                    0.131     0.265   0.005   0.006   0.007   0.224

   Table 2 also shows the results, initially pre-filled with the official baselines provided by the PAN
organizers and summary statistics of all submissions to the task (i.e., the maximum, median, minimum,
and 95-th, 75-th, and 25-th percentiles over all submissions to the task).
   We analyzed the reason why our model has such bad performance.
   Firstly, our method relies on a single feature—logical inference—which might be insufficient for a
comprehensive detection mechanism. Successful detection methods typically incorporate multiple
features, including linguistic, syntactic, and semantic analysis, to capture the multifaceted nature
of human versus machine-generated text. It suggests that we should establish more comprehensive
classification features.
   Besides, modern AI models like GPT-3 and GPT-4 are designed to generate text that closely mimics
human writing, including coherence and detail. Consequently, the distinction between detailed AI-
generated text and detailed human text becomes blurred. Human writers can also produce highly detailed
and coherent text, especially in structured or formal contexts. This overlap reduces the effectiveness of
using coherence and detail as discriminative features.
   Human-generated text can also exhibit inferential relationships, especially in informative or ex-
planatory writing. For instance, when humans explain concepts or provide detailed descriptions, their
sentences can logically infer one another. As mentioned, the dataset involves multiple genres. Our
method might frequently misclassify detailed and coherent human text (such as news articles) as
AI-generated, leading to a high rate of false positives.
   From the NLI model’s perspective, the method we use is zero-shot which means that our model
has not been specifically trained or fine-tuned on a dataset of human vs. AI-generated texts. Also,
DeBERTa’s strength in recognizing logical relationships might lead it to frequently detect coherent
inferences in both human and AI texts, making it difficult to distinguish between them based solely on
coherence. This means it may not be optimized to distinguish the subtle differences between the two
types of text.
Table 2
Overview of the mean accuracy over 9 variants of the test set. We report the minumum, median, the maximum,
the 25-th, and the 75-th quantile, of the mean per the 9 datasets.
    Approach                            Minimum 25-th Quantile Median 75-th Quantile Max
    detector                              0.405          0.505        0.521         0.571       0.622
    Baseline Binoculars                   0.342          0.818        0.844         0.965       0.996
    Baseline Fast-DetectGPT (Mistral)     0.095          0.793        0.842         0.931       0.958
    Baseline PPMd                         0.270          0.546        0.750         0.770       0.863
    Baseline Unmasking                    0.250          0.662        0.696         0.697       0.762
    Baseline Fast-DetectGPT               0.159          0.579        0.704         0.719       0.982
    95-th quantile                        0.863          0.971        0.978         0.990       1.000
    75-th quantile                        0.758          0.865        0.933         0.959       0.991
    Median                                0.605          0.645        0.875         0.889       0.936
    25-th quantile                        0.353          0.496        0.658         0.675       0.711
    Min                                   0.015          0.038        0.231         0.244       0.252


6. Further Direction
To enhance the performance of AI-generated text detection method, it is crucial to fine-tune the
DeBERTa model specifically on a dataset tailored for distinguishing human and AI-generated text. This
specialized training will help the model learn the unique patterns and nuances of the task. Additionally,
incorporating a broader feature set, including stylistic markers, syntactic complexity, and lexical
diversity, can provide a more robust classification framework. Employing ensemble methods that
combine zero-shot NLI models with supervised models trained on the detection task can further
improve performance by leveraging the strengths of different approaches. Regular evaluation and
refinement using diverse and updated datasets will ensure the model adapts to new patterns in text
generation. Lastly, utilizing contextual embedding techniques can capture richer text representations,
enabling deeper contextual analysis beyond simple logical inference.


7. Conclusion
In this study, we explored the potential of using Natural Language Inference (NLI) to detect machine-
generated text by examining the logical relationship between premises and hypotheses. Our hypothesis
was that machine-generated text, being more detailed and coherent due to probabilistic generation,
would differ significantly from human text in inferential relationships. However, our experimental
results revealed significant limitations in this approach. Specifically, our zero-shot method using the
"DeBERTa-v3-large-mnli-fever-anli-ling-wanli" model underperformed across various metrics, including
ROC-AUC, Brier score, C@1, F1, and F0.5u scores, when compared to baseline models. The primary
reasons for this underperformance include the overlap in coherence and detail between human and
AI-generated texts, the limitations of a single-feature approach based solely on logical inference, and
the model’s lack of fine-tuning on a task-specific dataset. Our findings suggest that successful detection
of AI-generated text requires a multifaceted approach, incorporating diverse linguistic features and
specialized training. Future work should focus on fine-tuning models on relevant datasets and integrating
additional classification features to improve the robustness and accuracy of detection methods.


Acknowledgments
We appreciate the help from Simon Clematide and Andrianos Michail who provided their suggestions
to improve our work. We would also extend our sincere gratitude to the anonymous reviewer whose
insightful comments and suggestions significantly contributed to the improvement of this manuscript.
References
 [1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image synthesis with
     latent diffusion models (2022) 10684–10695.
 [2] OpenAI, Chatgpt: Optimizing language models for dialogue (2022).
 [3] H. Touvron, L. Martin, K. R. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
     P. Bhargava, S. Bhosale, D. M. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu,
     J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. S. Hartshorn, S. Hosseini,
     R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. M. Kloumann, A. V. Korenev, P. S. Koura,
     M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra,
     I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.
     Smith, R. Subramanian, X. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,
     Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, Llama
     2: Open foundation and fine-tuned chat models, arxiv preprint arXiv:2307.09288. (2023).
 [4] L. Dugan, D. Ippolito, A. Kirubarajan, S. Shi, C. Callison-Burch, Real or fake text?: Investigating
     human ability to detect boundaries between human-written and machine-generated text, AAAI
     (2022). arXiv:2212.12672.
 [5] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Ko-
     renčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova,
     E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024:
     Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
     Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot,
     D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro
     (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of
     the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in
     Computer Science, Springer, Berlin Heidelberg New York, 2024.
 [6] J. Bevendorff, M. Wiegmann, E. Stamatatos, M. Potthast, B. Stein, Overview of the Voight-Kampff
     Generative AI Authorship Verification Task at PAN 2024, in: G. F. N. Ferro, P. Galuščáková, A. G. S.
     de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum,
     CEUR-WS.org, 2024.
 [7] A. A. Ayele, N. Babakov, J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag,
     M. Fröbe, D. Korenčić, M. Mayerl, D. Moskovskiy, A. Mukherjee, A. Panchenko, M. Potthast,
     F. Rangel, N. Rizwan, P. Rosso, F. Schneider, A. Smirnova, E. Stamatatos, E. Stakovskii, B. Stein,
     M. Taulé, D. Ustalov, X. Wang, M. Wiegmann, S. M. Yimam, E. Zangerle, Overview of PAN 2024:
     Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
     Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot,
     D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro
     (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of
     the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in
     Computer Science, Springer, Berlin Heidelberg New York, 2024.
 [8] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
     Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
     F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in
     Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in
     Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241.
 [9] G. Jawahar, M. Abdul-Mageed, L. V. Lakshmanan, Automatic detection of machine generated text:
     A critical survey, arXiv preprint arXiv:2011.01314 (2020).
[10] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[11] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, C. Finn, Detectgpt: Zero-shot machine-generated
     text detection using probability curvature, arXiv preprint arXiv:2301.11305 (2023).
[12] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, arXiv
     preprint arXiv:1412.6572 (2014).
[13] V. S. Sadasivan, M. Soltanolkotabi, S. Feizi, Cuda: Convolution-based unlearnable datasets, arXiv
     preprint arXiv:2303.04278 (2023).
[14] J. Kirchenbauer, J. Geiping, Y. Wen, M. Shu, K. Saifullah, K. Kong, K. Fernando, A. Saha, M. Gold-
     blum, T. Goldstein, On the reliability of watermarks for large language models, arXiv preprint
     arXiv:2303.04278 (2023).
[15] X. Zhao, Y.-X. Wang, L. Li, Protecting language generation models via invisible watermarking,
     arXiv preprint arXiv:2302.03162 (2023).
[16] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst, Summac: Re-visiting nli-based models for
     inconsistency detection in summarization, Transactions of the Association for Computational
     Linguistics 10 (2022) 163–177.
[17] O. Marchenko, O. Radyvonenko, T. Ignatova, P. Titarchuk, D. Zhelezniakov, Improving text
     generation through introducing coherence metrics, Cybernetics and Systems Analysis 56 (2020)
     13–21.
[18] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with
     gradient-disentangled embedding sharing, arXiv preprint arXiv:2111.09543 (2021).