<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jeremi K. Ochab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mateusz Matias</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tymoteusz Boba</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomasz Walkowiak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information and Communication Technology, Wroclaw University of Science and Technology</institution>
          ,
          <addr-line>Wroclaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Physics</institution>
          ,
          <addr-line>Astronomy and Applied Computer Science</addr-line>
          ,
          <institution>Jagiellonian University</institution>
          ,
          <addr-line>Kraków</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Theoretical Physics, Jagiellonian University</institution>
          ,
          <addr-line>Kraków</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>M. Kac Center for Complex Systems Research, Jagiellonian University</institution>
          ,
          <addr-line>Kraków</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training. We explore several parameter options to increase the classifier's capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found efective previously.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;generative AI detection</kwd>
        <kwd>stylometry</kwd>
        <kwd>explainability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. MGT detection methods</title>
        <p>
          There is a considerable variety of MGT detection methods reviewed in [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ], but also in the overview of
last year’s Voight-Kampf Generative AI Authorship Verification Task at PAN and ELOQUENT 2024 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
They included systems based on (i) terms, (ii) perplexity or logit statistics, (iii) watermarking, and their
mixtures. The watermarking approach relies on embedding an imperceptible signature in the generated
texts at some stage of the generator training, text generation, or post-processing that modifies
characteror word-level distributions. In the present submission, we disregard this approach due to the task’s
constraints. The logits statistics approach typically involves zero-shot white-box methods, i.e., ones
that require access to the LLM generator (or its surrogate) in order to compute either the likelihood of
a text being generated by it or features later used in a classifier. The black-box alternatives, instead,
would machine-regenerate a given text sample and subsequently compare it to the original to obtain a
similarity score. Finally, term-based systems are typically neural fine-tuned classifiers (from the BERT
family with modifications) using word embeddings or linguistic (stylometric) features.
        </p>
        <p>Our submission follows works such as [8, 9, 10, 11], which either utilised various stylometric features,
augmented data, or expanded the training dataset. We especially find our approach similar to the simple
SVM classifier on the TF-IDF features [ 12], which outranked all neural baselines and most neural-based
submissions in the last year’s task. Classifiers based on stylometric features were also found to be
efective elsewhere [13].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. MGT detection robustness</title>
        <p>
          The performance of MGT detection can generally degrade due to two factors: out-of-distribution
issues and attacks [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The former encompasses generalisation issues such as: cross-domain (involving
changing text type, and consequently its vocabulary, style, topics, etc.), cross-language (involving
not only switching the language of the text but also linguistic interference due to non-nativity of the
authors) and cross-LLM (involving detection of text generators unseen during the detector’s training).
The latter includes: paraphrasing output of one LLM by another (therefore changing the textual feature
distribution of the former) [14], adversarial text perturbations on diferent levels (characters [ 15],
syntax, [16] or lexis, [17]), prompt engineering (taking advantage of in-context learning to change
LLM’s characteristics by varying prompts [18, 19] including mimicking specific authors or character
profiling [20]) and other attacks.
        </p>
        <p>
          Reportedly [21], supervised detectors can generalise reasonably well across LLM scales but less so
across model families. On the other hand, issues that were found to be challenging were incorporating
unseen languages and performing simple attacks such as Unicode obfuscation or shortening text length
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Our own approach has been found vulnerable to cross-domain detection [22] as tested on [23], but
on a closed domain it was robust to one-step paraphrasing. Furthermore, the unexpected performance
of the aforementioned SVM TF-IDF classifier [12] was mainly due to its robustness to obfuscation.
        </p>
        <p>
          We did not explicitly design our detector to target any of these issues; however, we follow the
general recommendation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] that supervised detectors can efectively defend against some of them by
continually expanding training datasets (with adversarial examples, examples of LLM families, examples
of text types, etc.) and fine-tuning even on small samples.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System Overview</title>
      <p>In general, our submission employed (i) gradient-boosted tree models together with (ii) feature
engineering and, crucially, (iii) a large training dataset. We did not target any obfuscation techniques. Similarly
to our previous work [22, 24], we used a modular Python pipeline for interpretable stylometric analysis
being developed for CLARIN-PL1[25]. It is designed to connect text preprocessing and linguistic feature
extraction with various existing NLP tools, classifiers, explainability modules, and visualisation.</p>
      <sec id="sec-3-1">
        <title>3.1. Data source</title>
        <p>Following our unremarkable attempt [22] (F1 = 0.54 compared to an ensemble of stylometric features
and transformers [26] and the highest ranked result 0.81) in cross-domain MGT detection on
AuTexTification [23] benchmark – where training was performed on tweets, how-to articles and legal documents,
1https://gitlab.clarin-pl.eu/stylometry/cl_explainable_stylo
while testing on reviews and news – we decided that our model needs as comprehensive and varied
training data as possible in order for the validation result to hold on test set.</p>
        <p>
          For that purpose, we have collected in total 563 571 text samples from several openly accessible
datasets [
          <xref ref-type="bibr" rid="ref3">3, 23, 27, 18, 28, 29, 30, 31</xref>
          ] designed as benchmarks in MGT detection, see Table 1. The
number reported above already takes into account dropouts due to issues with special characters or
incompatibility of data structure that we were not able to solve within the time constraints of the PAN’s
task. In particular cases, not all data were incorporated (e.g. training but not validation set in the case
of AuTexTification and PAN’s Voight-Kampf Generative AI Detection ; consequently, not all available
genres were included). Some of these datasets themselves were collected from other openly available
datasets and augmented with the generated texts. The total number of LLM labels available in that
dataset was 348.
        </p>
        <p>The source, genre and model labels were not used in the training.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Stylometric Features</title>
        <p>We considered two options: either a closed set of predefined but more interpretable features or an open
set features generated programmatically but still partly based on linguistic analysis.</p>
        <p>Regarding the first, when analysing our own Wikipedia-based dataset [ 22], we used StyloMetrix [32].
This open-source stylometric text analysis library calculates the appearance of 195 predefined features
that include grammatical forms (tenses, modal verbs, etc.), parts of speech, lexical items (types of
pronouns, hurtful words, etc.), aspects connected to social media (e.g. sentiment analysis), syntactic
forms, and general text statistics (e.g. type-token ratio). StyloMetrix uses the spaCy model for English
to extract these features. The classifiers based on this small feature set consistently scored lower than
the alternative, so the final submission comprised only the second option.</p>
        <p>The second option follows the basic ideas used in the R package stylo [33], which is mainly
computing token n-grams, but augmented with the various annotations. At present, for preprocessing
steps and said annotations (tokenisation, named entity recognition, dependency parsing, part-of-speech
tagging, and morphology annotation) we use spaCy [34] model en_core_web_lg. Specifically, we
computed the normalised frequencies of:
• lemmas (from uni- to trigrams), excluding named entities,
• part-of-speech tags (from uni- to quadrigrams) including punctuation,
• dependency-based bigrams (where token neighbourhood is defined by the distance in the
dependency tree), excluding named entities,
• morphological annotations (unigrams) including entity types (i.e. using Named Entity Recognition
to find named entities and replacing them with their types)
Each of the four feature classes could contain a maximum of 1500 items. This particular set of features
admittedly comes from some unresolved technical issues, but also from repeated trial and error on
yet other authorial attribution datasets. For instance, elsewhere [22] we have found that punctuation
features, such as the ‘SPACE’ token, can detect human mistakes or artefacts in LLM processing or
further data post-processing (a redundant whitespace character, e.g., at the beginning of a paragraph or
a second one between words). In that choice of feature classes we also try to minimise, although not
strictly enforce, the generation of duplicate versions of the same feature in separate classes.</p>
        <p>As presented in Table 2 we also testes so-called culling (i.e., ignoring features with document frequency
strictly higher or lower than the given threshold). In the present submission, a majority of our models
did not use culling. In one case, we set the minimum document frequency to 0.1 (that is, about 50k out
of 500k documents), which reduced the number of features from the initial 4594 to 3264.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Classifier</title>
        <p>We take advantage of the existing solutions: Light Gradient-Boosting Machine (LGBM) [35] as the
state-of-the-art boosted trees classifier and Scikit-learn [ 36] for feature counting and cross-validation.
Autextification [23]
CHEAT [27]
HC3 [18]
HC3 Plus [28]
MAGE [29]
Multitude [30]
M4 [31]
21 832
(21 832)
15 394
(15 395)</p>
        <p>0
(48 644)
148 237
(148 402)
318 958
(319 071)
29 459
(29 460)
5987
Following our previous experience on other, smaller datasets – mainly in English and Polish
languages – during pipeline development, the LGBM classifiers parameters were set to: DART boosting,
learning_rate = 0.5, enabled bagging (randomly selecting bagging_fraction = 0.8 of data without
resampling every bagging_freq = 3 iterations). At this time, we used the binary classifier, but it is
possible – and in fact in can be beneficial [ 22] – to train a multiclass model using the LLM labels, see
Table 1, and then map it back to the binary ‘human vs. machine’ labels.
human, deepseek-r1-distill-qwen-32b
falcon3-10b-instruct, gemini-1.5-pro
gemini-2.0-flash, gemini-pro
gemini-pro-paraphrase, gpt-3.5-turbo
gpt-4-turbo, gpt-4-turbo-paraphrase
gpt-4.5-preview, gpt-4o, gpt-4o-mini
llama-2-70b-chat, llama-2-7b-chat
llama-3.1-8b-instruct, llama-3.3-70b-instruct
mistral-8b-instruct-2410
mistral-7b-instruct-v0.2
mixtral-8x7b-instruct-v0.1, o3-mini
qwen1.5-72b-chat-8bit, text-bison-002
human, BLOOM-1B7, BLOOM-3B,
BLOOM-7B1, babbage, curie,
text-davinci-003
gpt-3.5-turbo
human, ChatGPT
human, GPT-3.5-Turbo-0301
human, gpt-3.5-turbo, text-davinci-002,
text-davinci-003, gpt_j, gpt_neox, opt,
flan_t5, t0, bloom_7b, GLM130B
human, alpaca-lora-30b, gpt-3.5-turbo
gpt-4, text-davinci-003, vicuna-13b
llama-65b, opt-66b, opt-iml-max-1.3b
Wikipedia human, GPT-4, ChatGPT
abstracts text-davinci-003, Cohere
peer reviews Dolly-v2, BLOOMz 176B
news briefs
small
medium</p>
        <p>big
culled
• maximal number of leaves per tree (num_leaves),
• number of boosting iterations (num_iterations)
• maximal depth of the tree model (max_depth),
In our smaller pre-submission experiments (e.g., human authorship attribution on 2-100 novels, resulting
in the number of samples of the order of thousands or tens of thousands at most) satisfactory results
were obtained with num_leaves = 5, num_iterations = 100, max_depth = 5. We decided that with
500k text samples, 4k features and 348 LLM labels, the LGBM classifier required a higher capacity, hence
we submitted three classifier versions: small, medium and big, listed in Table 2. Further hyperparameter
optimisation is possible, but was not performed in the present submission.</p>
        <p>Since LGBM training is fast, we used the stratified 10-fold cross-validation (CV) scheme to obtain
more reliable validation and test error estimation. We then decided to validate both a classifier from a
single fold (-single) and the probability scores averaged over classifiers trained on all CV folds ( -cv).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Evaluation setup</title>
        <p>The environment for running and evaluating submissions to Subtask 1 “AI Detection Sensitivity” of the
PAN: Voight-Kampf Generative AI Detection 2025 task was TIRA [37]. This platform allows dockerised
submissions in order to ensure their reproducibility. Upon submission our contribution was validated
on two datasets, to which we refer as: "Validation 1" – the validation split of the dataset available for
training (available texts and labels), "Validation 2" – the dataset used for evaluation at TIRA (not available
to see its contents). Both datasets could be used for classifier evaluation and selection; see Table 3. TIRA
platform produced the following six evaluation metrics (all on scale 0-1, with 1 representing the perfect
score):
• ROC-AUC: The area under the ROC (Receiver Operating Characteristic) curve
• Brier: The complement of the Brier score (equivalent to mean squared loss)
• C@1: A modified accuracy score that breaks ties by assigning non-answers (class probability =
0.5) the average accuracy of the remaining cases
• F1: The harmonic mean of precision and recall
• F0.5u : A modified 0.5 measure (where precision weighs more than recall) that treats
nonanswers as false negatives
• The arithmetic mean of all above.</p>
        <p>The final evaluation was also appended with the False Positive Rate (FNR) and False Negative Rate
(FNR). The submissions were ranked by a macro-average of the arithmetic mean over all individual
data sources (all individual datasets contained in the test and the ELOQUENT collections).</p>
        <p>F1
0.911
0.746
0.823
0.827
0.898</p>
        <p>F0.5u</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation results</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Two general observations are: (1) larger capacity of boosted trees increased the detection performance,
and (2) obfuscation considerably reduced it, Although the our model have not reached the baseline
TF-IDF scores, in the outlook, the boosted trees have the capacity to learn on a larger number of
features, so incorporating TF-IDF features [12] or standardising feature frequencies, found to be greatly
efective in stylometry [ 38, 33], and other classic feature engineering techniques could be beneficial.
The straightforward augmentation of the training set with obfuscated samples can further improve the
results. The other unexplored avenue is simply hyperparameter optimisation (both in terms of feature
set and LGBM parameters). The main computational overhead in our method is feature extraction on the
large training dataset. Classifier training (and training continuation), inference and explanation [ 39] is
inexpensive. In summary, we perceive it as a trade-of between the smaller cost and greater explainability
of boosted trees and the better generalisation of neural-based systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The research for this publication has been supported by a grant from the Priority Research Area
DigiWorld under the Strategic Programme Excellence Initiative at Jagiellonian University. JKO’s research on
the stylometric pipeline was financed by European Funds for Smart Economy, FENG program, CLARIN
– Common Language Resources and Technology Infrastructure, project no.FENG.02.04-IP.040004/24-00.</p>
      <p>MM and TB participated in the submission as a programming assignment from the “AI Workshop II”
course at Jagiellonian University during the summer term of 2025.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Writefull’s model in order to: Grammar and
spelling check. After using this tool, the authors reviewed and edited the content as needed and take
full responsibility for the publication’s content.
[8] M. Guo, Z. Han, H. Chen, J. Peng, A Machine-Generated Text Detection Model Based on Text
Multi-Feature Fusion, in: G. Faggioli, N. Ferro, P. Galuscáková, A. G. S. d. Herrera (Eds.), Working
Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024), Grenoble, France, 9-12
September, 2024, volume 3740 of CEUR Workshop Proceedings, CEUR-WS.org, 2024, pp. 2593–2602.</p>
      <p>URL: https://ceur-ws.org/Vol-3740/paper-238.pdf.
[9] P. Miralles, A. Martín, D. Camacho, Team aida at PAN: Ensembling Normalized Log Probabilities,
in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. Herrera (Eds.), Working Notes Papers of the
CLEF 2024 Evaluation Labs, CEUR-WS.org, 2024, pp. 2807–2813. URL: http://ceur-ws.org/Vol-3740/
paper-268.pdf.
[10] A. Yadagiri, D. Kalita, A. Ranjan, A. K. Bostan, P. Toppo, P. Pakray, Team cnlp-nits-pp at PAN:
Leveraging BERT for Accurate Authorship Verification: A Novel Approach to Textual Attribution,
in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. Herrera (Eds.), Working Notes Papers of the
CLEF 2024 Evaluation Labs, CEUR-WS.org, 2024, pp. 2976–2987. URL: http://ceur-ws.org/Vol-3740/
paper-290.pdf.
[11] L. Guo, W. Yang, L. Ma, J. Ruan, BLGAV: Generative AI Author Verification Model Based on
BERT and BiLSTM, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. Herrera (Eds.), Working
Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS.org, 2024, pp. 2585–2592. URL: http:
//ceur-ws.org/Vol-3740/paper-237.pdf.
[12] L. Lorenz, F. Z. Aygüler, F. Schlatt, N. Mirzakhmedova, BaselineAvengers at PAN 2024:
OftenForgotten Baselines for LLM-Generated Text Detection, in: G. Faggioli, N. Ferro, P. Galuscáková,
A. G. S. d. Herrera (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum
(CLEF 2024), Grenoble, France, 9-12 September, 2024, volume 3740 of CEUR Workshop Proceedings,
CEUR-WS.org, 2024, pp. 2761–2768. URL: https://ceur-ws.org/Vol-3740/paper-262.pdf.
[13] C. Opara, StyloAI: Distinguishing AI-Generated Content with Stylometric Analysis, in: A. M.</p>
      <p>Olney, I.-A. Chounta, Z. Liu, O. C. Santos, I. I. Bittencourt (Eds.), Artificial Intelligence in Education.
Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks,
Practitioners, Doctoral Consortium and Blue Sky - 25th International Conference, AIED 2024,
Recife, Brazil, July 8-12, 2024, Proceedings, Part II, volume 2151 of Communications in Computer and
Information Science, Springer, 2024, pp. 105–114. URL: https://doi.org/10.1007/978-3-031-64312-5_
13. doi:10.1007/978-3-031-64312-5_13.
[14] V. S. Sadasivan, A. Kumar, S. Balasubramanian, W. Wang, S. Feizi, Can AI-Generated Text be
Reliably Detected? Stress Testing AI Text Detectors Under Various Attacks, Transactions on
Machine Learning Research (2025). URL: https://openreview.net/forum?id=OOgsAZdFOt.
[15] H. Stif, F. Johansson, Detecting computer-generated disinformation, International Journal of
Data Science and Analytics 13 (2022) 363–383. URL: https://doi.org/10.1007/s41060-021-00299-5.
doi:10.1007/s41060-021-00299-5.
[16] M. M. Bhat, S. Parthasarathy, How Efectively Can Machines Defend Against Machine-Generated
Fake News? An Empirical Study, in: A. Rogers, J. Sedoc, A. Rumshisky (Eds.), Proceedings of
the First Workshop on Insights from Negative Results in NLP, Association for Computational
Linguistics, Online, 2020, pp. 48–53. URL: https://aclanthology.org/2020.insights-1.7/. doi:10.
18653/v1/2020.insights-1.7.
[17] E. Crothers, N. Japkowicz, H. Viktor, P. Branco, Adversarial Robustness of Neural-Statistical
Features in Detection of Generative Transformers, in: 2022 International Joint Conference on
Neural Networks (IJCNN), 2022, pp. 1–8. URL: https://ieeexplore.ieee.org/document/9892269.
doi:10.1109/IJCNN55064.2022.9892269, iSSN: 2161-4407.
[18] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, Y. Wu, How Close is ChatGPT to Human
Experts? Comparison Corpus, Evaluation, and Detection, CoRR abs/2301.07597 (2023). URL: https:
//doi.org/10.48550/arXiv.2301.07597. doi:10.48550/ARXIV.2301.07597, arXiv: 2301.07597.
[19] Z. Liu, Z. Yao, F. Li, B. Luo, On the Detectability of ChatGPT Content: Benchmarking, Methodology,
and Evaluation through the Lens of Academic Writing, 2024. URL: http://arxiv.org/abs/2306.05524.
doi:10.48550/arXiv.2306.05524, arXiv:2306.05524 [cs].
[20] K. Przystalski, J. K. Argasiński, N. Lipp, D. Pacholczyk, Building Personality-Driven
Language Models: How Neurotic is ChatGPT, Synthesis Lectures on Engineering, Science, and
Technology, Springer Nature Switzerland, Cham, 2025. URL: https://link.springer.com/10.1007/
978-3-031-80087-0. doi:10.1007/978-3-031-80087-0.
[21] A. M. Sarvazyan, J. González, P. Rosso, M. Franco-Salvador, Supervised Machine-Generated Text
Detectors: Family and Scale Matters, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis,
A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro (Eds.), Experimental IR
Meets Multilinguality, Multimodality, and Interaction, Springer Nature Switzerland, Cham, 2023,
pp. 121–132. doi:10.1007/978-3-031-42448-9_11.
[22] K. Przystalski, J. K. Argasiński, I. Grabska-Gradzińska, J. Ochab, Stylometry recognizes human
and llm-generated texts in short samples, 2025. Manuscript submitted for publication to *Expert
Systems with Applications*.
[23] A. M. Sarvazyan, J. González, M. Franco-Salvador, F. Rangel, B. Chulvi, P. Rosso, Overview of
AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple
Domains, in: Procesamiento del Lenguaje Natural, Jaén, Spain, 2023.
[24] J. K. Argasiński, I. Grabska-Gradzińska, K. Przystalski, J. K. Ochab, T. Walkowiak,
Stylometric analysis of large language model-generated commentaries in the context of medical
neuroscience, International Conference . . . (2024) 281–295. URL: https://link.springer.com/chapter/10.
1007/978-3-031-63775-9_20. doi:10.1007/978-3-031-63775-9_20.
[25] J. K. Ochab, T. Walkowiak, Implementing interpretable models in stylometric analysis, in: Digital</p>
      <p>Humanities 2024: Conference Abstracts, George Mason University (GMU), Washington, D.C., 2024.
[26] G. Mikros, A. Koursaris, D. Bilianos, ..., Ai-writing detection using an ensemble of transformers
and stylometric features., IberLEF . . . (2023).
[27] P. Yu, J. Chen, X. Feng, Z. Xia, CHEAT: A Large-scale Dataset for Detecting CHatGPT-writtEn
AbsTracts, IEEE Transactions on Big Data (2025) 1–9. URL: https://ieeexplore.ieee.org/abstract/
document/10858415. doi:10.1109/TBDATA.2025.3536929.
[28] Z. Su, X. Wu, W. Zhou, G. Ma, S. Hu, HC3 Plus: A Semantic-Invariant Human ChatGPT
Comparison Corpus, 2024. URL: http://arxiv.org/abs/2309.02731. doi:10.48550/arXiv.2309.02731,
arXiv:2309.02731 [cs].
[29] Y. Li, Q. Li, L. Cui, W. Bi, Z. Wang, L. Wang, L. Yang, S. Shi, Y. Zhang, MAGE: Machine-generated
Text Detection in the Wild, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Proceedings of the
62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 36–53. URL: https://
aclanthology.org/2024.acl-long.3/. doi:10.18653/v1/2024.acl-long.3.
[30] D. Macko, R. Moro, A. Uchendu, J. Lucas, M. Yamashita, M. Pikuliak, I. Srba, T. Le, D. Lee,
J. Simko, M. Bielikova, MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection
Benchmark, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing, Association for Computational Linguistics, Singapore,
2023, pp. 9960–9987. URL: https://aclanthology.org/2023.emnlp-main.616/. doi:10.18653/v1/
2023.emnlp-main.616.
[31] Y. Wang, J. Mansurov, P. Ivanov, J. Su, A. Shelmanov, A. Tsvigun, C. Whitehouse, O.
Mohammed Afzal, T. Mahmoud, T. Sasaki, T. Arnold, A. F. Aji, N. Habash, I. Gurevych, P. Nakov,
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text
Detection, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Conference of the
European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers),
Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp. 1369–1407. URL: https:
//aclanthology.org/2024.eacl-long.83/.
[32] I. Okulska, D. Stetsenko, A. Kołos, A. Karlińska, K. Głąbińska, A. Nowakowski, Stylometrix: An
open-source multilingual tool for representing stylometric vectors, arXiv preprint arXiv:2309.12810
(2023).
[33] M. Eder, M. Kestemont, J. Rybicki, Stylometry with R: A Package for Computational Text Analysis,</p>
      <p>The R Journal 8 (2016) 1–15. doi:10.32614/RJ-2016-007.
[34] I. Montani, M. Honnibal, M. Honnibal, A. Boyd, S. V. Landeghem, H. Peters, explosion/spaCy:
v3.7.2: Fixes for APIs and requirements, 2023. URL: https://doi.org/10.5281/zenodo.10009823.
doi:10.5281/zenodo.10009823.
[35] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly eficient
gradient boosting decision tree, Advances in neural information processing systems 30 (2017)
3146–3154.
[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,
Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–
2830.
[37] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in
Information Retrieval, volume 13982, Springer Nature Switzerland, Cham, 2023, pp. 236–241. URL:
https://link.springer.com/10.1007/978-3-031-28241-6_20. doi:10.1007/978-3-031-28241-6_
20, series Title: Lecture Notes in Computer Science.
[38] J. Burrows, ‘Delta’: A Measure of Stylistic Diference and a Guide to Likely Authorship, Literary
and Linguistic Computing 17 (2002) 267–287. doi:10.1093/llc/17.3.267.
[39] S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb,
N. Bansal, S.-I. Lee, From local explanations to global understanding with explainable ai for trees,
Nature Machine Intelligence 2 (2020) 2522–5839.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Lund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. R.</given-names>
            <surname>Mannuru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shimray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Chatgpt and a new academic reality: Artificial intelligence-written research papers and the ethics of the large language models in scholarly publishing</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>74</volume>
          (
          <year>2023</year>
          )
          <fpage>570</fpage>
          -
          <lpage>581</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>De Angelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Baglivo</surname>
          </string-name>
          , G. Arzilli,
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Privitera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Tozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          ,
          <article-title>Chatgpt and the rise of large language models: the new ai-driven infodemic threat in public health</article-title>
          ,
          <source>Frontiers in public health 11</source>
          (
          <year>2023</year>
          )
          <fpage>1166120</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsivgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abassy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mansurov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Ta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Elozeiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. V.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Artemova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Habash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the “Voight-Kampf” Generative AI Authorship Verification Task at PAN</article-title>
          and
          <article-title>ELOQUENT 2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2025:
          <article-title>Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Crothers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Japkowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Viktor</surname>
          </string-name>
          ,
          <string-name>
            <surname>Machine-Generated Text</surname>
          </string-name>
          :
          <article-title>A Comprehensive Survey of Threat Models and Detection Methods</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>70977</fpage>
          -
          <lpage>71002</lpage>
          . URL: https://doi.org/ 10.1109/ACCESS.
          <year>2023</year>
          .
          <volume>3294090</volume>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2023</year>
          .
          <volume>3294090</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on LLM-Generated Text Detection: Necessity, Methods,</article-title>
          and Future Directions, Computational
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>64</lpage>
          . URL: https://doi.org/10.1162/coli_a_00549. doi:
          <volume>10</volume>
          .1162/coli_a_
          <fpage>00549</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dürlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talman</surname>
          </string-name>
          , E. Stamatatos,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the "Voight-Kampf" Generative AI Authorship Verification Task at PAN</article-title>
          and
          <article-title>ELOQUENT 2024</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. d. Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>2486</fpage>
          -
          <lpage>2506</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-225.pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>