<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team SINAI-INTA at PAN 2025: Uncovering Machine Generated Text with Linguistic Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Jimeno-Gonzalez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugenio Martínez-Cámara</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noelia Fernandez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro Díaz-García</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Alfonso Ureña-López</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INTA</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UC3M</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>UJA</institution>
          ,
          <addr-line>Jaen</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Addressing the escalating text generation capabilities of large language models, PAN and the ELOQUENT Lab have introduced the Voight-Kampf Generative AI Authorship Verification task, which aims to distinguish between human and machine-generated texts. In response, this paper proposes a lightweight approach that combines syntactic, structural, and lexical features with TF-IDF representations of the raw text. The method is designed to be computationally eficient, making it suitable for practical applications without requiring extensive resources. On the validation set, our approach outperforms the provided baselines, albeit with a modest margin.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2025</kwd>
        <kwd>Voight-Kampf Generative AI Authorship Verification</kwd>
        <kwd>Text classification</kwd>
        <kwd>AI-Generated Text Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Thanks to advances in large language models (LLMs), it is now possible to generate high-quality texts
with diverse and varied applications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] . Language modeling has long been a focus of study for both
language creation and comprehension (if language is identified as a complex system of expressions
governed by a set of grammatical rules), but it was not until the release of the ChatGPT model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that
this fascinating field became accessible to the public. With this new tool, one can extract information
(such as relationships or events), summarize texts, or generate original content, such as a poem or an
email.
      </p>
      <p>As these models continue to evolve, their output has become increasingly indistinguishable from
human writing—not only in grammatical accuracy, but also in style, tone, and rhetorical complexity.
The line between machine-generated and human-written text has become increasingly blurred, as LLMs
learn to replicate not only grammatical structures but also stylistic nuances, rhetorical devices, and
even domain-specific jargon.</p>
      <p>
        However, these advances raise significant challenges regarding the authenticity and regulation of
their use. Between January 1, 2022, and May 1, 2023, the relative number of synthetically generated
news articles increased by more than half (53.3 %) on respected news websites [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] . On disinformation
sites, this increase was 474%.
      </p>
      <p>
        This qualitative leap creates a paradox: while LLMs democratize access to creative tools, they also
erode traditional mechanisms of authorship attribution. Determining the authorship of a text, that is,
whether it was written by a human or a machine, has become a problem of unprecedented relevance.
These tools have the potential to be used for unethical purposes, such as plagiarism, the creation of fake
news, or spinning (mass production of messages), which can impact not only individuals but society as
a whole [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Moreover, regardless of whether LLMs are used maliciously, there is another issue: hallucinations
produced by these models [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ]. These errors occur unpredictably and cannot be anticipated in advance.
Hallucinations are fictitious statements presented as truths. This problem becomes particularly severe
when an LLM is faced with tasks that require expert knowledge in a specific domain. The mere
possibility that a machine could have authored a given text underscores the importance of the task at
hand. Accurately determining whether a text has been written by a human or a machine is becoming
increasingly relevant in everyday contexts.
      </p>
      <p>
        In its simplest form, the original problem is deciding whether a text was written by a human or a
machine. Methodologically, the problem is framed as a binary classification (human vs. AI). However,
this approach is deceptively simple. One of the greatest challenges is the statistical convergence between
human and artificial texts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Since LLMs are trained on vast amounts of human-written texts, they
have learned not only syntactic structures but also stylistic patterns and cognitive biases, blurring the
boundary that might initially seem clear. However, it is true that these models do not merely replicate
these patterns—they optimize them, potentially creating exploitable stylistic perfection.
      </p>
      <p>
        To boost this area of research, the PAN 2025 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] workshop introduced the ’Generative AI Authorship
Verification Task’ [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] that is divided into two sub-tasks. Task 1 focuses on the robustness and sensitivity
of detection systems. In response to this challenge, we proposed an architecture that combines
handcrafted linguistic features with textual representations. Specifically, we integrated syntactic, structural,
and lexical features alongside TF-IDF representations of the raw text. These features are then used in a
stacking ensemble classifier comprising Random Forest, XGBoost, and LinearSVC as base learners, with
Logistic Regression serving as the final estimator. This traditional machine learning pipeline allows for
interpretability and flexibility while achieving competitive performance.
      </p>
      <p>The central objective of this work is to develop a reliable and interpretable method for distinguishing
between human-written and AI-generated text. In conclusion, this work contributes a promising
approach that supports both efective classification and transparency, addressing key challenges in the
ifeld of generative AI content detection.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        One of the most accessible and widely used approaches for detection is the use of statistics based on
linguistic features [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ]. This set of features forms the foundation of the approach we will explore
in the present work.
      </p>
      <p>The clear advantage of this approach lies in the fact that it relies solely on the text to be classified, which
facilitates its practical application in contexts where the generative model is not accessible. However, it
is important to note that its efectiveness often depends on the availability of a representative reference
corpus containing both human and machine-generated texts. This allows for proper calibration of
decision thresholds and validation of the robustness of the identified patterns.</p>
      <p>
        Among the features analyzed are lexical density (the ratio of content words to function words), the
average number of sentences per paragraph, the distribution across grammatical categories (POS tags),
and the atypical frequency of certain k-grams (contiguous sequences of k words). These characteristics
can capture subtle diferences in style, syntactic coherence, or lexical variability between human and
generative model texts [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ].
      </p>
      <p>When trained on labeled corpora, classifiers built on these statistical features have demonstrated
competitive accuracy in distinguishing between human and machine-generated texts</p>
      <p>Nevertheless, these statistical measures are not the only ones used in the task of classifying texts
generated by language models. Following the taxonomy proposed by Wu et al. (2025) [14], statistical
methods can be categorized into two major groups: white-box and black-box approaches.</p>
      <p>White-box methods [15, 16, 17] require direct access to the original model, meaning access to its
architecture and raw parameters. These variables are especially valuable for understanding how text
is generated and how the model selects certain words or structures—i.e., for analyzing the model’s
decision-making processes in detail.</p>
      <p>The statistics derived from this type of analysis are crucial for attributing authorship of a text to a
specific model, as they rely on the model’s internal outputs (such as logits) and architectural behavior.
Among these metrics are: Rank [18], which indicates the position of a token in the ordered list of logits
(with higher-ranking tokens being considered more probable by the model); Log-likelihood [19], which
refers to the sum of the log-probabilities of each token given its preceding context; and Log-Likelihood
Ratio Ranking (LLR) [20], which combines the previous two metrics for a more robust classification.</p>
      <p>During the model development process, perplexity was also analyzed—a metric that measures the
model’s ability to correctly predict a sequence of text. In other words, it evaluates the model’s level of
“surprise” when processing a given input. This metric was employed to validate the hypothesis proposed
by Li et al. (2024) [21], which states that automatically generated texts exhibit increased perplexity after
undergoing a rewriting process, due to a greater deviation from the linguistic distributions expected by
the model. The results were not encouraging.</p>
      <p>Although white-box methods are highly efective in detecting texts generated by the model they
are designed for, their performance significantly decreases when analyzing texts generated by other
models.</p>
      <p>Complementary to white-box strategies, black-box methods [22, 23, 24] ofer a more flexible yet
computationally demanding alternative for text classification tasks. Black-box statistical methods
are employed in scenarios where direct access to the internal parameters of the generative model is
unavailable. This approach, characterized by its greater methodological diversity, relies exclusively on
the analysis of the generated text itself, without requiring any supplementary information about the
underlying model.</p>
      <p>However, one of the primary limitations of black-box methods lies in their computational intensity,
as mentioned before. The complexity of the required analyses can result in high latency times, thereby
limiting their suitability for real-time applications or contexts requiring rapid response.</p>
      <p>
        Emerging techniques for detecting text generated by language models include digital
watermarking [25, 26] and deep neural network-based approaches [
        <xref ref-type="bibr" rid="ref11">27, 11, 28</xref>
        ], notably leveraging large language
models (LLMs).
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. PAN dataset</title>
        <p>Released by the PAN shared task organizers, the PAN dataset, contains both human-authored and
AI-generated text, with the twist: the LLMs were instructed to change their style and mimic a specific
human author. It includes a total of 23,707 samples, consisting of 9,101 (61%) human-authored texts and
14,606 (38%) AI-generated texts produced using twenty-two diferent LLMs.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Pre-Processing</title>
        <p>We performed an analysis of the data to study the presence of featuring patterns of human and machine
generated texts.
3.2.1. Lexical Complexity and Vocabulary
• Lexical Diversity: It is a central concept in quantitative linguistics, assesses the range and
variability of vocabulary used in a text sample [29]. In our study, this measure helps identify patterns
of lexical richness in texts produced by humans versus generative models. As shown in Figure 2,
human-authored texts tend to display a more centered distribution with less dispersion at the
extremes. In contrast, AI-generated texts show a higher concentration at elevated diversity levels,
which may be interpreted as more uniform and stylistically refined output.
• Lexical Frequency : To evaluate the lexical relevance of terms within each document, we calculated
the average TF-IDF (Term Frequency–Inverse Document Frequency) score. This metric weights
term frequency according to its relative presence in the corpus, highlighting the most distinctive
linguistic elements of each text. Its inclusion captures the balance between common words and
infrequent terms that may provide unique semantic value. No major diferences were found
between both distributions, aside from the recurring observation that human-written texts tend
to be less polarized. Similarly, it was observed that, in terms of average TF-IDF values, human
texts exhibit higher scores than those generated by machines.</p>
        <sec id="sec-3-2-1">
          <title>3.2.2. Text Structure</title>
          <p>In the actual and the following section, we have employed the spaCy natural language processing
library. Specifically, we utilized spaCy’s [ 30] built-in part-of-speech (POS) tagger, which is integrated
into the language models provided by the library (in our case, en_core_web_sm for English).
• Average Sentence Length: Calculated as the mean number of words per sentence, this metric
provides insight into the structural complexity of the text.
• Average Word Length: Measures the average number of characters per word. Longer words are
generally associated with more technical or sophisticated vocabulary.
• Total Number of Sentences: This feature allows control over the overall length of the text, which
may afect the stability of other computed metrics.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.3. Syntax and Part-of-Speech (POS)</title>
          <p>A relative frequency analysis of various grammatical categories was conducted using Part-of-Speech
tagging. The categories considered include determiners, adjectives, nouns, verbs, conjunctions
(coordinating and subordinating), adverbs, ad-positions (prepositions and post-positions), auxiliaries, pronouns,
unrecognized tokens, and punctuation marks.</p>
          <p>The results (see Figure 2) show that texts generated by models exhibit higher usage of determiners,
nouns, adjectives, and ad-positions. Conversely, human-written texts are characterized by more frequent
use of punctuation, adverbs, conjunctions, and pronouns.</p>
          <p>These diferences suggest that human texts tend to show greater segmentation of ideas and a more
coordinated style, likely influenced by communicative intent and personal context (as reflected in
pronoun usage). In contrast, automatically generated texts display a more formal, informative, and
grammatically structured construction, reflected in a higher proportion of ad-positions, determiners,
and nouns.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Design and Classification Approach</title>
        <p>Building on the previously discussed importance of statistical and linguistic features, the proposed
model aims to combine the explanatory power of these variables with the strength of automatic text
representation techniques, such as TF-IDF. To achieve this, a processing pipeline has been designed
integrating both the full vectorization of the textual content—including unigrams—and the linguistic
variables described earlier, preserving their structure divided into lexical, structural, and syntactic
components. The full scope of variables its described in the table 1.</p>
        <p>Once preprocessed, all these features are concatenated into a single feature space and used as input
for a stacked ensemble classification model. This strategy allows the integration of diferent supervised
learning approaches to enhance the system’s robustness and generalization ability.</p>
        <p>The ensemble consists of the following base classifiers:
• Random Forest: A decision tree-based model that introduces randomness in both data sampling
and feature selection, thereby reducing overfitting and capturing nonlinear feature interactions.
• XGBoost: A boosting technique that iteratively optimizes a set of trees by minimizing the loss
function, improving probabilistic classification performance.
• Support Vector Classifier (Linear SVC) : A robust linear classifier, particularly efective in
highdimensional spaces such as those generated by TF-IDF vectors. Textual features were vectorized
using TfidfVectorizer from the scikit-learn library, with default parameter settings. This
settings correspond to a unigram-based representation (ngram_range = (1,1)), where each term is
weighted according to its term frequency-inverse document frequency (TF-IDF) value, normalized
using the L2 norm. No explicit constraints were placed on vocabulary size (max_features was
left unspecified), and all terms occurring in at least one document were included ( min_df = 1,
max_df = 1.0). Binary weighting was disabled (binary = False), and standard smoothing was
applied (smooth_idf = True).</p>
        <p>The intermediate predictions generated by these base models are combined using a logistic regression
meta-model, which learns to weight the partial outputs to produce the final prediction. This architecture
leverages the complementarity of models with diferent inductive capabilities, balancing performance
and interpretability.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, we present an evaluation of our AI-generated text detection experiments. The comparison
is conducted using the designated evaluation split of the dataset. We report results using well-established
performance metrics, as outlined in the oficial PAN@CLEF 2025 evaluation guidelines 1.</p>
      <p>Table 2 presents a comparative evaluation of the state-of-the-art baselines on the PAN validation set
using six key metrics: ROC-AUC, Brier score, C@1, F1, F05U, and a computed mean of all metrics. For
each test instance, we predicted the corresponding label (human or machine-generated) and produced
calibrated probability scores, following the evaluation recommendations provided by the benchmark
organizers.</p>
      <p>Notably, our Approach attains a perfect or near-perfect performance, yielding the highest scores in
every metric: a ROC-AUC of 0.996, Brier score of 0.978, C@1 of 0.976, F1 of 0.981, F05U of 0.986, and an
overall mean of 0.983.</p>
      <p>When compared to the strongest baseline, the Linear SVM with TF-IDF features, our Approach
maintains equivalent performance in ROC-AUC (0.996) while demonstrating notable improvements in
the Brier score (+0.027), F05U (+0.005), and mean score (+0.005). This indicates that our method not only
preserves strong discriminative capability but also enhances probability estimation and performance
on metrics that emphasize partial correctness (such as F05U and C@1).</p>
      <p>In summary, the results highlight the eficacy of Our Model in outperforming both traditional
featurebased classifiers and more unconventional methods across a comprehensive set of evaluation metrics,
thereby establishing it as a robust and reliable solution for the task evaluated in the PAN validation set.</p>
      <p>Table 3 presents the performance of Our Approach on the PAN test set, as reported after the final
1https://pan.webis.de/clef25/pan25-web/generated-content-analysis.html</p>
      <p>F1</p>
      <p>F1
submission to the TIRA evaluation platform [31]. The model achieves strong and consistent results
across all evaluation metrics: ROC-AUC of 0.970, Brier score of 0.903, C@1 of 0.882, F1 score of 0.957,
F05U of 0.938, and a mean score of 0.910. In the same table, we can also see the final test scores, where
our approach placed 17th out of 24 participating teams.</p>
      <p>Compared to the validation results reported in Table 2, these outcomes demonstrate the model’s ability
to generalize efectively to unseen data, with only modest declines in performance, which are expected
due to the inherent distributional shift between validation and test splits. Importantly, the model retains
a high ROC-AUC and F1 score, indicating sustained discriminative power and classification accuracy.
The Brier score and C@1 values remain competitive, further attesting to the model’s well-calibrated
probability outputs and its efectiveness in high-confidence decision-making scenarios.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we presented our submission to the PAN shared task on generative AI content detection.
The central objective of our work was to develop a reliable and interpretable approach for distinguishing
between human-written and AI-generated text. Our experimental results confirm that this objective
has been successfully met: the proposed method demonstrated competitive performance relative to
state-of-the-art systems and passed the oficial evaluation on the TIRA platform, qualifying for the final
competition results.</p>
      <p>The combination of linguistic feature engineering and ensemble learning enabled both strong
classiifcation capabilities and interpretability, aligning with the goals stated at the outset. These findings
validate the efectiveness of our approach in addressing the challenges posed by generative authorship
verification.</p>
      <p>For future work, we aim to further enhance the model’s generalizability by evaluating its performance
across a wider array of datasets to better assess its robustness under diverse real-world conditions.
Additionally, we plan to examine the system’s resilience to adversarial attacks by introducing controlled
perturbations, thereby deepening our understanding of its limitations and improving its reliability in
adversarial contexts.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work was partly supported by the grants FedDAP (PID2020-116118GA-I00), MODERATES
(TED2021-130145B-I00), SocialTOX (PDC2022-133146-C21) and CONSENSO (PID2021-122263OB- C21)
funded by MCIN/AEI/10.13039/501100011033, “ERDF A way of making Europe” and “European Union
NextGenerationEU/PRTR”. This work was also funded by the Ministerio para la Transformación Digital
y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU –
NextGenerationEU within the framework of the project Desarrollo Modelos ALIA.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used ChatGPT in order to: Grammar and spelling check
as well as text translation. After using this tool, the author reviewed and edited the content as needed
and takes full responsibility for the publication’s content.
[14] J. Wu, S. Yang, R. Zhan, Y. Yuan, D. F. Wong, L. S. Chao, A survey on llm-generated text
detection: Necessity, methods, and future directions, 2024. URL: https://arxiv.org/abs/2310.14724.
arXiv:2310.14724.
[15] R. Wang, H. Chen, R. Zhou, H. Ma, Y. Duan, Y. Kang, S. Yang, B. Fan, T. Tan, Llm-detector:
Improving ai-generated chinese text detection with open-source llm instruction tuning, 2024. URL:
https://arxiv.org/abs/2402.01158. arXiv:2402.01158.
[16] K. Wu, L. Pang, H. Shen, X. Cheng, T.-S. Chua, Llmdet: A third party large language models
generated text detection tool, arXiv preprint arXiv:2305.15004 (2023).
[17] V. Verma, E. Fleisig, N. Tomlin, D. Klein, Ghostbuster: Detecting text ghostwritten by large
language models, arXiv preprint arXiv:2305.15047 (2023).
[18] S. Gehrmann, H. Strobelt, A. M. Rush, Gltr: Statistical detection and visualization of generated
text, 2019. URL: https://arxiv.org/abs/1906.04043. arXiv:1906.04043.
[19] I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, G. Krueger, J. W.</p>
      <p>Kim, S. Kreps, et al., Release strategies and the social impacts of language models, arXiv preprint
arXiv:1908.09203 (2019).
[20] J. Su, T. Y. Zhuo, D. Wang, P. Nakov, Detectllm: Leveraging log rank information for zero-shot
detection of machine-generated text, arXiv preprint arXiv:2306.05540 (2023).
[21] R. Li, W. Hao, W. Zhao, J. Yang, C. Mao, Learning to rewrite: Generalized llm-generated text
detection, 2025. URL: https://arxiv.org/abs/2408.04237. arXiv:2408.04237.
[22] C. Mao, C. Vondrick, H. Wang, J. Yang, Raidar: generative ai detection via rewriting, arXiv preprint
arXiv:2401.12970 (2024).
[23] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu, How close is chatgpt to human
experts? comparison corpus, evaluation, and detection, arXiv preprint arXiv:2301.07597 (2023).
[24] Y. Tian, H. Chen, X. Wang, Z. Bai, Q. Zhang, R. Li, C. Xu, Y. Wang, Multiscale positive-unlabeled
detection of ai-generated texts, arXiv preprint arXiv:2305.18149 (2023).
[25] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, T. Goldstein, A watermark for large language
models, in: International Conference on Machine Learning, PMLR, 2023, pp. 17061–17084.
[26] J. Ren, H. Xu, Y. Liu, Y. Cui, S. Wang, D. Yin, J. Tang, A robust semantics-based watermark for
large language model against paraphrasing, arXiv preprint arXiv:2311.08721 (2023).
[27] A. M. Sarvazyan, J. Á. González, P. Rosso, M. Franco-Salvador, Supervised machine-generated
text detectors: Family and scale matters, in: International Conference of the Cross-Language
Evaluation Forum for European Languages, Springer, 2023, pp. 121–132.
[28] A. Bhattacharjee, H. Liu, Fighting fire with fire: can chatgpt detect ai-generated text?, ACM</p>
      <p>SIGKDD Explorations Newsletter 25 (2024) 14–21.
[29] J. Read, 2000: Assessing vocabulary. cambridge: Cambridge university press (2000).
[30] M. Honnibal, I. Montani, spaCy 2: Natural language understanding with Bloom embeddings,
convolutional neural networks and incremental parsing, 2017. To appear.
[31] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: Advances in Information
Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer
Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          , et al.,
          <article-title>A survey of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2303.18223 1</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <article-title>Is artificial intelligence a world changer? a case study of openai's chat gpt</article-title>
          ,
          <source>Recent Progress in Science and Technology</source>
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>35</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Hanley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Durumeric</surname>
          </string-name>
          ,
          <article-title>Machine-made media: Monitoring the mobilization of machinegenerated articles on misinformation and mainstream news websites</article-title>
          ,
          <source>in: Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>18</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>542</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>A survey on large language model (llm) security and privacy: The good, the bad, and the ugly</article-title>
          , High-Confidence
          <string-name>
            <surname>Computing</surname>
          </string-name>
          (
          <year>2024</year>
          )
          <fpage>100211</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , et al.,
          <article-title>A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>43</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Sadasivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Balasubramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feizi</surname>
          </string-name>
          ,
          <article-title>Can ai-generated text be reliably detected?</article-title>
          ,
          <source>arXiv preprint arXiv:2303.11156</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2025:
          <article-title>Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsivgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abassy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mansurov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Ta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Elozeiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. V.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Habash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the “VoightKampf” Generative AI Authorship Verification Task at PAN</article-title>
          and
          <article-title>ELOQUENT 2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Corston-Oliver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gamon</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Brockett,</surname>
          </string-name>
          <article-title>A machine learning approach to the automatic evaluation of machine translation</article-title>
          ,
          <source>in: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2001</year>
          , pp.
          <fpage>148</fpage>
          -
          <lpage>155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Alhijawi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jarrar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>AbuAlRub</surname>
          </string-name>
          , A. Bader,
          <article-title>Deep learning detection method for large language models-generated scientific content</article-title>
          ,
          <source>Neural Computing and Applications</source>
          <volume>37</volume>
          (
          <year>2025</year>
          )
          <fpage>91</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-N.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>The science of detecting llm-generated text</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>67</volume>
          (
          <year>2024</year>
          )
          <fpage>50</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gallé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rozen</surname>
          </string-name>
          , G. Kruszewski,
          <string-name>
            <given-names>H.</given-names>
            <surname>Elsahar</surname>
          </string-name>
          ,
          <article-title>Unsupervised and distributional detection of machine-generated text</article-title>
          ,
          <source>arXiv preprint arXiv:2111.02878</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Hamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Detection of chatgpt fake science with the xfakesci learning algorithm</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2308.11767. arXiv:
          <volume>2308</volume>
          .
          <fpage>11767</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>