<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Binoculars, BART, and Adversaries: Multi-Faceted AI Text Detection for PAN 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Ostrower</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Poonam Dongare</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mahitha Thekkinkattuvalappil Unnikrishnan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>225 North Avenue, Atlanta, 30332</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>With each passing year AI generated text methods continue to improve, blurring the line between human and machine. Our submissions recreate existing methods on competition data to achieve high scores on subtask 1 for the Voight-Kampf AI detection sensitivity task. The PAN competition is a part of the CLEF conference program [1] allowing contestants from across the globe to compete across an array of tasks related to natural language processing. This year the PAN workshop ofered a generative ai text detection task [ 2], this paper focuses on subtask 1 where given a (potentially obfuscated) text decide whether it was written by a human or an ai where submissions are evaluated through the TIRA platform [3]. The committee provides a dataset that participants can use to train their own submissions. It is constructed of 13 LLMs that rephrased a human prompt so that each human authored prompt has atleast one LLM counterpart. This year the topics for the training set spanned essays, news, or fiction. A twist is also included that the evaluation set will include new models and other obfuscations not present in the provided dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2025</kwd>
        <kwd>Adversarial fine tuning</kwd>
        <kwd>AI text detection</kwd>
        <kwd>BART</kwd>
        <kwd>Binoculars</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>1.1. EDA
We began by exploring the dataset to look for patterns that might help distinguish human-written text
from machine-generated text. First, we examined the word count distribution per author. Most texts
ranged from 2,000 to 4,000 words, except for the Alpaca models, which were outliers (see Figure 1).
However, this didn’t reveal any clear signal for identifying human authorship.</p>
      <p>Next, we looked at the usage of stop words using NLTK’s stop word list (see Figure 3). The diferent
authors—both human and machine—used stop words at very similar rates, again providing little insight
for distinguishing authorship.</p>
      <p>
        Finally, we applied a technique from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] involving the "rank" of each word using GPT-2. For each
text, we used a sliding window approach: starting with the first two words, then the first three, and so
on, up to the model’s maximum context window. At each step, we fed the current window into GPT-2
and checked how likely it thought the actual next word was. If the true next word was GPT-2’s top
prediction, we counted it as a match.
      </p>
      <p>We found that human-written text had the lowest rate of top-ranked predictions by GPT-2, while
texts from the GPT2-based instruct models had the highest overlap with GPT-2’s predictions. This
suggested that humans are less predictable by GPT-2, and that this kind of "next-word predictability"
could be a useful signal for detecting machine-generated text. This insight helped guide our approach
to the competition, opting to incorporate fine-tuned language models and even utilizing the method in
the token cohesiveness solution (section 2.2).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Binocular Score based XGBoost model</title>
        <p>In the landscape of AI-generated text detection, text generated by modern LLMs is thought to be dificult
to detect, as both humans and language models show a wide range of complex behaviors.</p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Binocular Score</title>
          <p>
            This section provides a background on the definition of binocular score and its calculation. The binocular
score [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], method introduces an innovative idea to identify AI generated content by leveraging the
diference between two nearly similar language models. This method uses two language models viz.
observer model and the performer model. Perplexity shows how confused a language model is when
trying to predict the next word. The binocular score is calculated using this concept of perplexity. As
stated by Hans et al. (2024), It is defined as the ratio between the perplexity of the performer model and
the cross-perplexity, where the latter refers to how perplexed the performer model is when evaluated
using the observer model. Because one language model can easily understand and predict the output of
another language model as compared to human written text. This pattern helps us to identify if the
content is Human written or AI generated.
          </p>
          <p>Based on the approach discussed above, below are the calculations for the perplexity , cross perplexity
and Binocular score. As described by Hans et al. (2024), A character string  can be divided into a
sequence of tokens and mapped to a list of token indices ⃗ through a tokenizer  . Each token index 
denotes the ID of the -th token, where  ∈  = 1, 2, . . . , , and  is the vocabulary of the language
model. Given this input sequence, a language model  produces a probability distribution over the
vocabulary to predict the next token</p>
          <p>( ()) =  (⃗) = 
 =  ( | 0 : − 1) for all  ∈ .</p>
          <p>For simplicity, the notation  () instead of  ( ()), assuming the tokenizer applied to the text 
is the same as the one used to train the language model  . Log-perplexity (log PPL) is a metric used to
evaluate how well a language model predicts a given sequence of text. It is calculated for each token in
the input by taking the average of the negative log-likelihood values. Hence it is defined as below:
1 ∑︁ log( ),
log PPL () = −  =1
where ⃗ =  (),  =  (⃗), and  is the total number of tokens in the string ."</p>
          <p>
            This approach [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], also evaluates how unexpected the output of one model is when interpreted by
another model. To measure the relationship between two models’ predictions on the same input, we
adopt the concept of cross-perplexity, as described by Hans et al. (2024). This metric measures how
diferent the outputs of two language models, 1 and 2, are by calculating the average cross-entropy
for each token in the tokenized version of the input .
          </p>
          <p>1 ∑︁ 1() · log(2()),
log X-PPL1,2 () = −  =1
where the symbol · indicates the dot product between two vector-valued outputs corresponding to
the -th token prediction.</p>
          <p>The Binoculars score  acts as a refined version of perplexity, ofering a normalized perspective on
text predictability. Specifically, it is defined as the ratio between standard perplexity and cross-perplexity.
Given two models, 1 and 2, and an input string , the Binoculars score is given by:
1,2 () =</p>
          <p>log PPL1 ()
log X-PPL1,2 ()
In this approach, the numerator indicates how surprising the input text is to model 1, based on
its perplexity score. The denominator shows how unexpected model 1 finds the predictions made
by model 2. If 1 and 2 are more alike than either is to a human, we would expect a human’s
predictions to difer more from 1’s than 2’s predictions do.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Model architecture and feature integration</title>
          <p>We employ an ensemble learning approach based on the XGBoost algorithm to enhance predictive
performance for the AI generated text detection task. The ensemble integrates a rich and heterogeneous
set of features designed to capture lexical, semantic, syntactic, and stylistic aspects of the text.
Specifically, we include the Binoculars score, a normalized metric that quantifies the divergence between
language models by comparing perplexity with cross-perplexity, thereby capturing diferences in model
interpretation of text.</p>
          <p>
            To Calculate the Binocular score we utilized GPT-2[
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] model from Hugging face library as both
observer and performer model. While we explored other models during our primary experimentation
such as "falcon-7b"[
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], "TinyDeepSeek-1.5b" [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] etc. Due to limitations in memory resources, the GPT-2
[
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] model was chosen as the final architecture for calculating the Binocular score.
          </p>
          <p>In addition, Term Frequency–Inverse Document Frequency (TF-IDF) is utilized to measure term
significance across the entire corpus, producing a sparse vector format that highlights informative and
distinguishing terms. To complement this with richer semantic understanding, we extract sentence-level
embeddings from BERT, which provides dense, pre-trained representations encoding both syntactic
and semantic information.</p>
          <p>
            We used the "bert-base-uncased" [
            <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
            ] checkpoint from Hugging Face to extract the BERT
embeddings. The BERT base uncased model is not specifically optimized for sentence level tasks. In this
architecture, the [CLS] token is mainly pre-trained to support classification task. Hence,to capture more
richer sentence semantics we decided to use mean pooling over last hidden layer’s token embeddings.
          </p>
          <p>We integrated stylometric features such as the average length of sentences and the cumulative
length of the text to reflect patterns inherent in the author’s writing style. Furthermore, linguistic
features—most notably various readability metrics (e.g., Flesch-Kincaid)—are computed to assess the
complexity and accessibility of the language used.</p>
          <p>By combining these diverse feature categories, the XGBoost ensemble utilizes both lexical-level
attributes and high-level linguistic representations, thereby improving the precision and contextual
relevance of its predictions.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Token Cohesiveness</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] presents Token Cohesiveness as a novel feature, defined as the expected semantic variation between
a given input text (x) and its modified version ( ˜), generated by randomly removing a small subset
of tokens. Due to the causal self-attention mechanism used by large language models (LLMs) in text
generation, each token maintains a strong contextual relationship with preceding tokens. As a result,
AI-generated text exhibits greater token cohesiveness compared to human-written content. The study
incorporates token cohesiveness alongside several established zero-shot detection methods to enhance
text classification and AI-authorship verification.
        </p>
        <p>
          Based on the approach described in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], we computed token cohesiveness as a feature in our analysis.
For a given input text, we generated multiple perturbed versions by systematically removing a fixed
proportion of tokens. The token cohesiveness score is determined by calculating the average negative
BARTScore [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] between the original text and its modified copies. BARTScore [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] measures the sum
of weighted probabilities for each token in (y), conditioned on the preceding tokens in (y) and another
reference text (x). In our framework, (y) represents the perturbed text, while (x) corresponds to the
original input text. Unlike the original weighting scheme, we assign equal weight to all tokens in the
evaluation process.
        </p>
        <p>
          B  : (|,  ) = ∏︁ (|&lt;, ,  )
=1
(1)
Tℎ : () = ∑︀=1 DIFF(, ˜) (2)

DIFF is a semantic diference metric, here we are using negative BARTScore. The paper proposes a
dual channel paradigm for text classification by passing input text to upper channel to calculate token
cohesiveness and lower channel make prediction using existing zero shot detector. The two scores are
combined together as follows
() =
{︃() × (), if () ≥ 0
− () × (), if () &lt; 0
(3)
We are using 3 existing zero shot detectors for our work and they are Likelihood[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], LRR[14] and
FastDetectGpt[15]. Token cohesiveness scores combining each of the zero shot detectors and negative
BARTScore are the features created using this approach.
        </p>
        <p>In addition to token cohesiveness, the total number of tokens in a given text is also considered as a
distinguishing feature. [16] explores how stylistic and complexity-based attributes can be leveraged to
diferentiate AI-generated text from human-authored content. Based on the findings of this study, the
complexity characteristics examined include the word count, representing the total number of words,
and the type-token ratio (TTR), which measures the proportion of unique vocabulary words relative to
the total word count.</p>
        <p>The stylistic features analyzed encompass several linguistic characteristics: the frequency of stop
words, determined using the NLTK predefined stopword list; the frequency of words absent from the
SentiWordNet dictionary; and the distribution of specific parts of speech (POS) - including nouns (POS
tags: NNP, NNPS, NN, NNS), verbs (POS tags: VB, VBD, VBG, VBN, VBP, VBZ), and past-tense verbs
(POS tags: VBD, VBN). These features collectively contribute to the identification of diferent patterns
between AI-generated and human-authored texts.</p>
        <p>In [17], the concept of intrinsic dimensionality is introduced as a means of distinguishing between
AI-generated and human-generated text. The paper suggests that human-authored text exhibits a
higher intrinsic dimensionality compared to text produced by AI models. Although the original study
employs Persistent Homology Dimension Estimation for this calculation, we opt instead for Maximum
Likelihood Estimation (MLE) using the Scikit-learn library to determine the intrinsic dimension of the
given text and incorporate it as a feature in our analysis.</p>
        <p>Unfortunately, this approach resulted in too large of a submission to comply with the competitions
15GB container format and was unable to be submitted, it did reach accuracy levels over 90% on the
provided validation datasets.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Adversarial Training</title>
        <p>Our Adversarial approach closely mirrors that found in [18] and can be summarized in 3 steps:
• Data preparation: We take the human prompt, randomly sample an AI generated prompt, and
paraphrased the AI prompt with an of the shelf paraphraser to create 3 separate texts.
• Paraphraser update: We collect the reward (correctness of the detector) on only the paraphrased
texts and use this reward to update the paraphraser
• Detector update: We use all 3 samples (human, AI, paraphrased) to update the detector with a
logistic loss function.</p>
        <p>We collect a tuple of (human,AI, and paraphrased) 10 at a time to store in our memory bufer due
to memory constraints. In our case the paraphraser we used was LLama 3.2-1B and the detector
was a roberta-large model for sequence classification with 2 labels. Both models used an AdamW
optimizer with a lr of 5e-6.</p>
        <sec id="sec-2-3-1">
          <title>2.3.1. Training the paraphraser</title>
          <p>
            The reward is defined on the left side as the predicted output of the Detector model on the
paraphrased text () and the log probability of the text  is defined as the right hand side of
equation 4.

(, ) = () ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]; log ( | ) = ∑︁ log (
 | , 1:− 1)
=1
(4)
The authors from [18] designate equation 5 as Clipped PPO with Entropy Penalty (cppo-ep) to
optimize the paraphraser. The importance sampling ratio, (, ) , is defined as the ratio
between the new policy and the old policy in this instance our policies being the paraphrasers.
The authors also added an entropy regularization term (eq 7) that is "introduced to encourage the
paraphraser to explore a more diverse generation policy" [18] in this case  is a coeficient to
control the weights of the advantage term and generation term for our experiment we set this
equal to 0.2. Prior to completion of the paraphraser update step we save the current paraphraser
(′) before applying equation 5 to update to the new policy (). After normalizing each
individual reward with the mean and std dev of the entire collection in the memory bufer we
trained and updated the model on each individual collection in the memory bufer. For the
clipping we set our epsilon to 0.002.
          </p>
          <p>() = E(,)∼ ′ − min (clip ((, ), 1 − , 1 +  ) , (, )) (, ) −  () (5)
(, ) =
( | )
′( | )
() = E(,)∼ ′ − ( | ) log ( | )
(6)
(7)</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>2.3.2. Training the detector</title>
          <p>Our loss function for the detector is a triple weighted logistic loss defined in equation 8. Like in
training the paraphraser and due to memory constraints our batch size is 1. For each observation
we have 1 human and 2 non-human texts to combat the model overweighting the non-human
texts they added a  term to account for this, in our experiments we set this to 0.5.
() = − Eℎ∼ℋ log (ℎ)+ E∼ℳ − log (1 − ())+ E∼ − log (1 − (())) (8)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>The following metrics were used in evaluation of approaches:
• ROC-AUC: The area under the ROC (Receiver Operating Characteristic) curve.
• Brier: The complement of the Brier score (mean squared loss).
• C@1: A modified accuracy score that assigns non-answers (score = 0.5) the average accuracy of
the remaining cases.
• F1: The harmonic mean of precision and recall.
• F0.5u: A modified F0.5 measure (precision-weighted F measure) that treats non-answers (score =
0.5) as false negatives.
• The arithmetic mean of all the metrics above.</p>
      <p>• A confusion matrix for calculating true/false positive/negative rates.</p>
      <p>The adversarial method performed best on the validation set obtaining a score of at least 0.992 across
each category.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and spelling
check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
PMLR, 2023, pp. 24950–24962.
[14] J. Su, T. Y. Zhuo, D. Wang, P. Nakov, Detectllm: Leveraging log rank information for zero-shot
detection of machine-generated text, 2023. URL: https://arxiv.org/abs/2306.05540. arXiv:2306.05540.
[15] G. Bao, Y. Zhao, Z. Teng, L. Yang, Y. Zhang, Fast-detectgpt: Eficient zero-shot detection of
machinegenerated text via conditional probability curvature, 2024. URL: https://arxiv.org/abs/2310.05130.
arXiv:2310.05130.
[16] A. Aich, S. Bhattacharya, N. Parde, Demystifying neural fake news via linguistic feature-based
interpretation, in: N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi, P.-M.
Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He,
T. K. Lee, E. Santus, F. Bond, S.-H. Na (Eds.), Proceedings of the 29th International Conference
on Computational Linguistics, International Committee on Computational Linguistics, Gyeongju,
Republic of Korea, 2022, pp. 6586–6599. URL: https://aclanthology.org/2022.coling-1.573/.
[17] E. Tulchinskii, K. Kuznetsov, L. Kushnareva, D. Cherniavskii, S. Barannikov, I. Piontkovskaya,
S. Nikolenko, E. Burnaev, Intrinsic dimension estimation for robust detection of ai-generated texts,
2023. URL: https://arxiv.org/abs/2306.04723. arXiv:2306.04723.
[18] X. Hu, P.-Y. Chen, T.-Y. Ho, Radar: Robust ai-text detection via adversarial learning, Advances in
neural information processing systems 36 (2023) 15077–15095.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2025:
          <article-title>Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsivgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abassy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mansurov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Ta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Elozeiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. V.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Habash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the “VoightKampf” Generative AI Authorship Verification Task at PAN</article-title>
          and
          <article-title>ELOQUENT 2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolyada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Grahm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elstner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Loebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Continuous Integration for Reproducible Shared Tasks with TIRA.io</article-title>
          ,
          <source>in: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR</source>
          <year>2023</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2023</year>
          , pp.
          <fpage>236</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Strobelt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Gltr:
          <article-title>Statistical detection and visualization of generated text</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>04043</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwarzschild</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cherepanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kazemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goldblum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geiping</surname>
          </string-name>
          , T. Goldstein,
          <article-title>Spotting llms with binoculars: Zero-shot detection of machine-generated text</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2401.12070. arXiv:
          <volume>2401</volume>
          .
          <fpage>12070</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI Blog 1</source>
          (
          <year>2019</year>
          ). https://cdn.openai.
          <article-title>com/better-language-models/ language_models_are_unsupervised_multitask_learners</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Almazrouei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Alobeidli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alshamsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cappelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cojocaru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Debbah</surname>
          </string-name>
          , Gofinet,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hesslow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Launay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Malartic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mazzotta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Noune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pannier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Penedo,</surname>
          </string-name>
          <article-title>The falcon series of open language models</article-title>
          ,
          <source>arXiv preprint arXiv:2311.16867</source>
          (
          <year>2023</year>
          ).
          <article-title>Describes Falcon-7B, Falcon-40B, and Falcon-180B open-source LLMs under Apache2.0 license.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] EQUES / TinyDeepSeek, TinyDeepSeek-
          <volume>1</volume>
          .5B, https://huggingface.co/EQUES/TinyDeepSeek-1.5B, ????
          <source>Accessed July</source>
          <year>2025</year>
          ;
          <volume>1</volume>
          .54B parameters,
          <source>Apache-2</source>
          .0 license.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brew</surname>
          </string-name>
          , Transformers:
          <article-title>State-of-the-art natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Zero-shot detection of llm-generated text using token cohesiveness</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2409.16914. arXiv:
          <volume>2409</volume>
          .
          <fpage>16914</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu, Bartscore:
          <article-title>Evaluating generated text as text generation</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.11520. arXiv:
          <volume>2106</volume>
          .
          <fpage>11520</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khazatsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          , Detectgpt:
          <article-title>Zero-shot machine-generated text detection using probability curvature</article-title>
          , in: International Conference on Machine Learning,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>