<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Distinctive Data Centric Approach for the Voight Kampf Task ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rohit R. Gunti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Tennessee, School of Information Sciences</institution>
          ,
          <addr-line>Knoxville, 1345 Circle Park Drive, Suite 412</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data plays a crucial role in finetuning. Training the pretrained model using a specific dataset to adapt it to a particular task or domain improves relevance and enhances adaptation of the stylistic features. The study employs a distinctive data-centric approach by collecting and preprocessing several renowned datasets to improve text quality and generate human-like sentences for given test questions in Voight Kampf 2025. The findings indicate that the data preprocessing resulted in fruitful text generations, and future work needs a more data-centric approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Voight Kampf</kwd>
        <kwd>Llama 3</kwd>
        <kwd>GPT 2</kwd>
        <kwd>Data Preprocessing</kwd>
        <kwd>Quantization</kwd>
        <kwd>Finetune</kwd>
        <kwd>Perplexity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The study acknowledges the crucial steps of researching the detectors used in the prior shared tasks
(PAN) before developing an LLM that can fool an LLM detector for Voight Kampf 2025 [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. In
pre-LLM era, some authors working on author verification tasks attempted a simplistic approach
(compression-based method using PPM algorithm) to achieve competitive accuracy without any text
preprocessing or training [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. While some authors used the classifiers (e.g., SVM) that were highly
efective (accuracy over 95 percent) for author verification tasks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The training process and text
preprocessing (chunked long texts from books) captured key stylistic diferences that a single classifier
might miss. The SVM classifier again proved its potential when a novel approach of using linear SVM
repeatedly on word frequency features (from Bootstrap dataset) achieved approximately 80 percent
accuracy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In any case, the classifier (SVM) was dependent on the data preprocessing before training
to detect if the given text is from the same author or a diferent author. In this LLM era, tasks and
methods have evolved to detect stylistic features such as consistency of tone, robustness, and pronoun
use. Since the AI-generated text raised the misinformation concerns, finetuning an LLM as a stylometric
detector to tackle misinformation gave high accuracy (approx. 94 percent F1) for detecting AI-generated
text when only one percent of an article was AI-generated [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The key to the finetuning is a mixed
ifnetune dataset of some journalist-written articles and LLM (Grover) generated text. However, the
ifndings for the misinformation detection task indicated that, as LLM style doesn’t change with intent
(whether truthful or false content), relying solely on stylometric features could not completely detect
AI-generated misinformation. In a similar study, authors also demonstrated that using LLM based model
greatly improved detection accuracy (approx. 98 percent F1) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Despite some zero-shot learning,
LLM detectors do not require training data and can achieve a notable accuracy ( approx. 90 percent),
SVM/LLM detectors or evaluators to distinguish the text require data for better performance [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ].
Hence, it is clear that involving the data in preprocessing can give promising and high performance as
a detector. However, it implied that the LLM text generators that produce the output to be evaluated
require more data or features to identify. In other words, for tasks such as Voight Kampf, the developed
LLM should require more data to train in order to fool an LLM-driven detector/evaluator. More evidence
comes from another detection study where the findings indicated that strong detectors can be fooled
by an attacker applying a slight rephrasing technique to the benchmark datasets [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This data
and technique requirement is observed for the tasks in which LLMs are used as evaluators that can
distinguish LLM output from human writing. However, in the case of human evaluators or judges, the
evaluation resulted in poor performance in analyzing style and linguistic features [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. Therefore,
for Voight Kampf 2025, a strategic data collection and preprocessing are attempted to generate the
output to fool an LLM evaluator.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Data Collection</title>
        <p>The study reports the methodology in the following three sections: (1) Data Collection, (2) Data
Preprocessing, and (3) Finetune. Lastly, the findings section includes the preliminary results and shared
results evaluated by the Voight Kampf (2025) committee.</p>
        <p>Four open-access datasets are collected from sources such as Zenedo and Kaggle. All the datasets are
selected to capture human linguistic features and address the misinformation concerns mentioned in
the literature.</p>
        <p>
          1. First, as the use of the Bootstrap dataset has been encouraged by the PAN Clef 2024 committee,
we have downloaded it from Zenedo, acquiring the permissions [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ]. The dataset consists
of 13 AI-generated articles and one human-written article of 1,087 topics (paragraphs of text).
However, only one human-written text is used and advanced for data preprocessing to capture
the stylistic features.
2. Second, the FakeNewsNet dataset is downloaded from a Github repo included on Kaggle datasets,
which contains articles and large-scale social media data linked to each article [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. The dataset
contains articles from two platforms, Politifact and Gossipcop, for politics and entertainment,
respectively. Out of four files in the dataset, only the “Politofact_real.csv” file (917 entries of
URLs to the news articles) that are labelled as true claims is used to capture the human responses
(linguistic features) and tackle misinformation.
3. Third, the ISOT fake news dataset is split into two sub-datasets, where only one sub-dataset
dataset selected as it contains 21,417 real news articles from Reuters [16, 17]. In the sub dataset,
each article has a title, text, and publication date in CSV format. The intention is to capture the
writing style, structure, and vocabulary from each real news article’s text entry.
4. Fourth, the LIAR dataset is extracted from the Kaggle, which contains 10,269 political statements
labeled by PolitiFact’s fact-checkers. Each statement is judged based on truthfulness, ranging on
a six-point scale (true, mostly true, half-true, barely-true, false, pants on fire). These statements
correlate with the features such as language style, and credibility, that the finetuned model can
capture from only true statements in the “train.csv” file.
        </p>
        <p>Overall, all four datasets, as shown in Table 1, are chosen to capture the human linguistic styles and
address misinformation that has been a major concern in prior studies. All the datasets are primarily
collected/used for detection tasks. However, as per Voight Kampf 2025, this study focuses on capturing
more human linguistic styles that the Voight Kampf LLM evaluator cannot detect.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data Preprocessing</title>
        <p>The final datasets selected in the data collection process are subject to data preprocessing. follows three
stages, i.e., data retrieving, text extraction, and classification. However, not all the selected datasets go
through all the Data preprocessing steps.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Data Retrieval</title>
          <p>The data retrieval is only attempted on “Politofact_real.csv,” where there are three columns (e.g., id,
news URL, title) for which each column is associated with a PolitiFact ID. However, after inspection,
there are missing ID for some URLs and titles. To ensure that we collect data from PolitiFact, we
removed all the URLs and titles that do not have a PolitiFact ID. That leaves 797 rows with the PolitiFact
ID, associated URL, and title. Next, using the URLs and required Python packages, the text (at least
500 words) from retrievable (open-source) articles is retrieved. We were able to retrieve text from
195 articles. To ensure that the finetuning model captures meaningful patterns in text, 195 retrieved
articles are further preprocessed to remove unwanted articles that have noise characters, whitespaces
normalization, and handling missing values. Therefore, the final cleaned version includes 193 cleaned
articles in CSV format. The text from other datasets is plain and needs no additional cleaning.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Text Extraction</title>
          <p>Now that there is text available, the next stage involves extracting text from all four datasets. Since a
couple of files (cleaned politofact and ISOT true) are in the same CSV format, the plain text is simply
extracted to an output file. Since LIAR (train.tsv) is labelled on six classes, where true is legitimate
statements, only the statements from the true class are extracted and augmented to the output file.
Lastly, the text from the Bootstrap dataset (human-written) is extracted and augmented to the output
ifle. Overall, the output file contains 24,284 text entries, with approximately 88 percent from ISOT, 6
percent from LIAR, 4.5 percent from Bootstrap, and 0.8 percent from FakeNewsNet. The output file is
ifltered out by removing the text entries of less than 100 words, leaving 19,401 entries in the final file.
This truncation of the sentences is performed in the bold belief that longer paragraphs will give the
ifnetuning model more clues.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.3. Classification</title>
          <p>Since output files contain raw text with 19,401 entries, the necessity of understanding the genres and
style during finetuning is essential. Therefore, a BART Large MNLI model provided by Facebook is
utilized for identifying genre and style for each entry. In other words, the Zero Shot Classification
across all the entries is enabled, which allows the BART model to generate genre and style even if
the model hasn’t explicitly trained. To allow the model to capture diferent styles and genres, various
categories such as science fiction, fantasy, romance, narrative style, persuasive style, and so forth, are
predefined. Next, to match the Voight Kampf 2024 data sample format, along with genre and style, we
extract the keywords (using YAKE extractor) for each entry for better finetuning. The final result is
matched with the task requirement and converted to JSON format before finetuning. The final dataset
is referred to as finetune dataset. The genres, styles, and keywords from the result is depicted in figures
1-3.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Finetune</title>
        <p>Llava 3 using a Low Rank Adaptation (LoRa) approach is utilized for eficient memory usage. The
ifnetuning involves customizing the model to generate nuanced responses based on finetune dataset.
The generated responses are expected to be more human-written when evaluated by the Voight Kampf
2025 evaluator. Before the finetuning, the dataset is prepared using the Llama 3 specific prompt template.
Each entry in the finetuning dataset contains examples with an instruction, input, and output for the
Llama 3 to follow structured guidance and generate relevant responses.</p>
        <sec id="sec-2-3-1">
          <title>2.3.1. LoRa Configuration</title>
          <p>The Llama 3 model, loaded in 4-bit quantization for eficiency, is set up using a specific training
configuration. Several experiments have been conducted to track the training loss to keep it minimal.
To supervise the finetuning, the SFT trainer is enabled. The SFT trainer configuration, along with LoRa
setup, where the training loss is minimal, is referred to as the optimal training configuration, as shown
in Table 2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Findings</title>
      <sec id="sec-3-1">
        <title>3.1. Preliminary Findings</title>
        <p>There is evidence that GPT-2 is an ideal large language model (LLM) for text quality evaluation [18,
19, 20]. Since the Voight Kampf task evaluation is ongoing, the study employs GPT-2 to gauge the
ifnetuned model’s perplexity estimation as preliminary results. Out of 22 Voight-Kampf test questions,
seven files indicated a perplexity score above forty. In other words, approximately thirty-two percent
were more like human-written, and the GPT-2 evaluator failed to detect the AI-generated content.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Voight Kampf Task Findings</title>
        <p>
          The Voight Kampf Task in Eloquent Lab 2025 is designed in a way that encourages participants to
build a system that fools detectors developed by the Voight Kampf AI detection Task at PAN 2025. The
PAN detectors evaluate the submissions (text generations) of the finetuned model using five metrics:
Brier, c@1, F1, F0.5u, and arithmetic mean of all the metrics [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Post-PAN detectors evaluation,
the finetuned Llama 3 c@1(0.74) and f05u(0.88) Scores indicate that the model handled uncertainty
cases very cautiously, similar to humans. Another supporting evaluation of F1 (0.80) suggests that the
ifnetuned Llama 3 generated responses are well balanced in both precision and recall. This implies
that the model generates responses that are both accurate and safe. However, when evaluating the
calibration (Brier score), the moderately high score (0.78) typically suggests that models’ predicted
probabilities likely deviate from actual outcomes. Though there is uncertainty of the model being
overconfident or underconfident (Brier score), the arithmetic mean (0.64) of all metrics demonstrates
the finetuned model’s moderate to strong overall performance.
3.2.1. Finetuned Llama 3 Sample Output
{
voight-kampfftesttopics: {
language: en,
date: 2024,
type: example,
source: eloquent organisers,
prompt: Write a text of about 500 words which covers the following items: ,
topics: [
{id: 001,
Genre: Advance-Fee Fraud / Scam Letter (commonly known as a 419 scam)
Style: Formal and persuasive, but suspiciously flattering and manipulative. It
mimics official language to appear legitimate, yet it contains telltale signs
of deception and illegitimacy.
        </p>
        <p>Content:
The letter is from someone claiming to be Prince Joe Eboh, Chairman of the Contract</p>
        <p>Award Committee of the Niger Delta Development Commission (NDDC).</p>
        <p>The sender explains that a surplus of $25 million USD from petroleum contracts
needs to be discreetly transferred out of Nigeria.</p>
        <p>Due to local laws prohibiting civil servants from holding foreign accounts, they
seek a foreign partner to temporarily receive the funds.</p>
        <p>The recipient is promised 20% of the amount for their cooperation, while 75% will
go to committee members and 5% for expenses.</p>
        <p>The sender requests personal and banking details from the recipient to initiate
the transfer.</p>
        <p>The letter emphasizes secrecy and urgency, aiming to complete the transaction in
21 working days.</p>
        <p>The sender thanks the recipient for their cooperation and assures that God will
reward your kindness and patience.</p>
        <p>LLM response: Received via email on 26 January 2013. Original subject: Transfer of
NDDC Fund. The text of the letter is a body of text within an email. There is
no signature or introduction. The letter is directed at a generic recipient, as
it uses a common greeting of Dear SirMadam. There is no return address. The
language and style of the letter are formal. The tone is persuasive, polite,
and courteous, but also suspiciously flattering and manipulative. The tone is
urgent and secretive. The overall appearance of the letter indicates that it is
likely a scam. The text of the letter indicates that the scam is likely an
advance-fee scam and a phishing attempt. There is an intention to part the
recipient from their money, likely in exchange for nothing. There is a risk of
a penalty for attempting to transfer money out of the country without a proper
license. There is a risk of legal prosecution for attempting to do as the
sender requests because the sender’s requests are likely illegal. There is a
likelihood of being added to a spam filter for a short time because of the word</p>
        <p>JOE in the subject. There is a high risk of the message being phishing for
personal or banking information. The message could contain malware. The message
is likely a response to a job inquiry. The message is likely a response to a
request for money. The message could be part of a pyramid scheme. The message
is likely a scam .... }
]
}
}</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>The moderately high overall performance of the finetuned Llama 3 proves that the data-centric approach
is fruitful in breaking the AI detectors. However, the approach’s limitations include repeating given
input and compromising reasoning in the responses when attempting to generate human-like responses,
as shown in the above sample output. Despite the limitations, the findings proved that the study’s
approach is seen as competitive on the Eloquent leader board. The findings suggest that the companies
that rely solely on detectors must evolve, and existing detectors are insuficient. The insights from
this data-centric approach should for detectors to accurate labeling that reduces the risk of deception.
For instance, the finetuned model, available on HuggingFace , can be used to generate more samples
that can be added to the detector’s training set labeled as AI-generated so that detectors learn from
their own blind spot. Additional inspection includes finding what features helped to fool the detectors
and updating the detectors with new feature weights to detect these subtle human-like generations.
Therefore, the study’s insights can benefit companies labeling AI-generated content to develop detectors
in support of the EU’s suggestion to tackle misinformation in sensitive domains like news, education,
and so on.
1,087 paragraphs of texts authored by a human
917 text samples related to real news collected from Politifact
21,417 true articles obtained from diferent legitimate news sites
10,269 statements from online platforms</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The author would like to acknowledge the university’s High Performance &amp; Scientific Computing Team
for providing us access to the Nvidia H100 GPU for finetuning and evaluating the Llama 3 and GPT-2
LLMs.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Grammarly in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
content, social context, and spatiotemporal information for studying fake news on social media,
Big data 8 (2020) 171–188.
[16] H. Ahmed, I. Traore, S. Saad, Detecting opinion spams and fake news using text classification,</p>
      <p>Security and Privacy 1 (2018) e9.
[17] H. Ahmed, I. Traore, S. Saad, Detection of online fake news using n-gram analysis and machine
learning techniques, in: Intelligent, Secure, and Dependable Systems in Distributed and Cloud
Environments: First International Conference, ISDDC 2017, Vancouver, BC, Canada, October 26-28,
2017, Proceedings 1, Springer, 2017, pp. 127–138.
[18] Y. Liu, et al., Assessing text readability and quality with language models (2020).
[19] S. Dudy, S. Bedrick, Are some words worth more than others?, in: Proceedings of the Conference
on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in
Natural Language Processing, volume 2020, 2020, p. 131.
[20] A. Das, R. M. Verma, Can machines tell stories? a comparative study of deep neural language
models and metrics, IEEE Access 8 (2020) 181258–181292.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khazatsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          , Detectgpt:
          <article-title>Zero-shot machine-generated text detection using probability curvature</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>24950</fpage>
          -
          <lpage>24962</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guillou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          ,
          <article-title>Eloquent clef shared tasks for evaluation of generative language model quality</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>459</fpage>
          -
          <lpage>465</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Halvani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Graner</surname>
          </string-name>
          ,
          <article-title>On the usefulness of compression models for authorship verification</article-title>
          ,
          <source>in: Proceedings of the 12th international conference on availability, reliability and security</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schler</surname>
          </string-name>
          ,
          <article-title>Authorship verification as a one-class classification problem</article-title>
          ,
          <source>in: Proceedings of the twenty-first international conference on Machine learning</source>
          ,
          <year>2004</year>
          , p.
          <fpage>62</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Generalizing unmasking for short texts</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>654</fpage>
          -
          <lpage>659</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Barzilay</surname>
          </string-name>
          ,
          <article-title>The limitations of stylometry for detecting machine-generated fake news</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>46</volume>
          (
          <year>2020</year>
          )
          <fpage>499</fpage>
          -
          <lpage>510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Uchendu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          , T. Le,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Turingbench: A benchmark environment for turing test in the age of neural text generation</article-title>
          ,
          <source>arXiv preprint arXiv:2109.13296</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwarzschild</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cherepanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kazemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goldblum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geiping</surname>
          </string-name>
          , T. Goldstein,
          <article-title>Spotting llms with binoculars: Zero-shot detection of machine-generated text</article-title>
          ,
          <source>arXiv preprint arXiv:2401.12070</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Y.</given-names>
            <surname>Zhuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , Detectllm:
          <article-title>Leveraging log rank information for zero-shot detection of machine-generated text</article-title>
          ,
          <source>arXiv preprint arXiv:2306.05540</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Sadasivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Balasubramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feizi</surname>
          </string-name>
          ,
          <article-title>Can ai-generated text be reliably detected?</article-title>
          ,
          <source>arXiv preprint arXiv:2303.11156</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ippolito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Duckworth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eck</surname>
          </string-name>
          ,
          <article-title>Automatic detection of generated text is easiest when humans are fooled</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>00650</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakesch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Hancock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naaman</surname>
          </string-name>
          ,
          <article-title>Human heuristics for ai-generated language are flawed</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>120</volume>
          (
          <year>2023</year>
          )
          <article-title>e2208839120</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          , E. Stamatatos,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the “voight-kampf” generative ai authorship verification task at pan and eloquent 2024, in: 25th Working Notes of the Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2024</year>
          . Grenoble, France 9
          <source>September 2024 through 12 September</source>
          <year>2024</year>
          , volume
          <volume>3740</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>2486</fpage>
          -
          <lpage>2506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Voight-kampf generative ai authorship verification based on t5</article-title>
          , Working Notes of CLEF (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mahudeswaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          , H. Liu,
          <article-title>Fakenewsnet: A data repository with news</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>