<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adept: AI-Generated Text Detection Based on Phrasal Category N-Grams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felix Völpel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oren Halvani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Secure Information Technology SIT</institution>
          ,
          <addr-line>Rheinstraße 75, 64295 Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>With the advent of large language models (LLMs), the generation of artificial text has become remarkably accessible and is increasingly integrated into everyday applications. As the use of LLMs to produce content becomes more widespread, the ability to distinguish between AI-generated and human-written texts has grown in importance. This year's PAN competition focuses on this specific challenge: Based on a text, participants must determine whether it was written by a human or generated by an AI system (more specifically, an LLM). We propose a classification approach called Adept, which explicitly leverages constituent trees to model the grammatical structure of texts. For each sentence, we generate a constituent tree and represent the entire text by aggregating the distribution of syntactic -grams, defined as paths of a fixed length within these trees. Using these structural representations, we train a multilayer perceptron (MLP) to classify authorship. Adept achieves a mean score of 0.843 on the test dataset, evaluated by the organizers of the competition. This ranks us on rank 16 out of 24 with a score diference of 0.056 to the first place and 0.036 to the third place.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2025</kwd>
        <kwd>Subtask 1</kwd>
        <kwd>Voight-Kampf AI Detection Sensitivity</kwd>
        <kwd>Constituent Tree n-grams</kwd>
        <kwd>AI-generated Text Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>existing challenges, that make the detection of AI-generated content dificult. As such they mention
paraphrasing attacks, in which AI-generated texts can be rephrased to evade detection. Other examples
mentioned are the generative attack or the spoofing attack. Open questions that remain, are, according
to the authors, how detectors could be standardized or if watermarking should be regulated to prevent
misuse. Other aspects include the improvement of fairness and robustness of text detectors among others.
In last year’s PAN competition, the following task was hosted, i. e., given two texts, among which one
text was AI-generated and one was human-authored, the participants had to develop a model, that was
capable of identifying the human-written one [5]. The organizers summarized, that most solutions based
on a BERT-based classification with some additional modifications like PU-loss or R-Drop. There were
also approaches, that developed classifiers such as LSTMs or XGBoost based on engineered features like
perplexity, token distributions of general stylometrics. The best performing solution achieved an AUC
of 0.961 and a @1 score of 0.912 [6]. The authors implemented a special kind of ensemble, involving
two LLMs fine-tuned with a classification head and the Binoculars scores. Binoculars, which is also
selected among the baselines of this year’s competition, is a zero-shot detection method, that employs
two pre-trained LLMs [7]. It contrasts the perplexity, given by one LLM for the available text to the
cross-perplexity between both LLMs. These scores allow for excellent separation as is shown in their
publication.</p>
      <p>At its core, Adept relies on constituent trees extracted from the text. A constituent tree (also called a
phrase structure or parse tree) is a hierarchical, tree-like representation of the syntactic structure of a
sentence according to a phrase structure grammar (such as Context-Free Grammar). It shows how a
sentence is composed of nested constituents (phrases) such as noun phrases (NP), verb phrases (VP)
and prepositional phrases (PP), each of which can be further broken down into smaller constituents or
individual words (terminals). For each sentence in a text, we generate a constituent tree. We then convert
the complete set of trees into a numerical vector representation by computing the relative frequencies
of syntactic paths of a fixed length across all trees in the text. These vectorized representations are
used as input to a neural network classifier. In our case, we use a simple MLP trained to distinguish
between AI-generated and human-authored texts. By incorporating the grammatical structure of the
text into the classification process, this approach achieves a mean score of 0.843 on the final test dataset.
This paper is organized as follows. Section 2 introduces the datasets made available by the organizers of
the competition. In Section 3, we describe our solution to the task, followed by Section 4, which reports
the results obtained on the final test dataset. We then discuss, in Section 5, the evaluation framework
of the competition and our proposed method and highlight possible improvements. Finally, Section 6
presents the conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. PAN Dataset</title>
      <p>The organizers of the PAN competition provide a comprehensive training and validation dataset
comprising a mix of AI-generated and human-authored texts. Figures 1, 2, and 3 present an exploratory
analysis of the training data. In total there are 23,707 texts in the training set. Among these there
are 14,606 AI-generated texts, which amounts to a ratio of 61.6%, and 9,101 human-authored texts,
which relates to a share of 38.4%. In this regard, it is important to emphasize that this dataset is highly
unbalanced with respect to the ratio of AI-generated to human-authored texts. The same holds for
the validation dataset. In total there are 2,312 (64.4%) AI-generated and 1,277 (35.6%) human-written
texts. We elaborate on the efects of this distribution in Section 5. The number of applied AI-models
used to generate the texts is versatile, as can be seen in Figure 1. This is beneficial, as it enables greater
generalizability of the developed classification model. Furthermore, the respective texts have a varying
length (Figure 2). While AI-generated and human-written texts are mostly between 500 and 1,000 words
long, the human-authored texts are more often longer. There are also some extremes, as can be seen,
since there seem to be both very short and very long texts available. The plots further reveal that any
text belongs to one of three possible genres with fiction writing being represented the most (Figure 3).
0.08
0.06
0.04
0.02
0.00
o
b
r
u
t
5
.
3
t
p
g
i
n
i
m
o
4
t
p
g
o
4
t
p
g
0
1
4
2
t
c
u
r
t
s
n
i
b
8
la
r
t
s
i
n
i
h
s
a
l
f
0
.
2
in
i
m
e
g
i
n
i
m
3
o
o
r
p
5
.
1
in
i
m
e
g
t
c
u
r
t
s
n
i
b
8
1
.
3
a
m
a
ll
b
2
3
n
e
w
q
llit
s
i
d
1
r
k
e
e
s
p
e
e
d
t
c
u
r
t
s
n
i
b
0
1
3
n
o
c
l
a
f
t
c
u
r
t
s
n
i
b
0
7
3
.
3
a
m
a
ll
Counts
w
e
i
v
e
r
p
5
.
4
t
p
g
e
s
a
r
h
p
a
r
a
p
o
b
r
u
t
4
t
p
g
o
r
p
in
i
m
e
g
o
b
r
u
t
4
t
p
g
t
i
b
8
t
a
h
c
b
2
7
5
.
1
n
e
w
q
t
a
h
c
b
0
7
2
a
m
a
ll
2
.
0
v
t
c
u
r
t
s
n
i
b
7
la
r
t
s
i
m
e
s
a
r
h
p
a
r
a
p
o
r
p
in
i
m
e
g
2
0
0
n
o
s
i
b
t
x
e
t
1
.
0
v
t
c
u
r
t
s
n
i
b
7
x
8
la
r
t
x
i
t
a
h
c
b
7
2
a
m
a
ll
m
m</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Solution</title>
      <sec id="sec-3-1">
        <title>3.1. Algorithm</title>
        <p>Our proposed approach bases inherently on capturing the grammatical structures of the sentences
of a text. Our main assumption lies in the hypothesis, that the complexity and structural composition
of sentences contains discriminative information to diferentiate AI-generated from human-authored
texts. To do so, the central element is the generation of constituent trees. Constituent trees have a
long tradition in linguistics. A constituent tree is a hierarchical representation of the grammatical units
of a sentence. A constituent is either a word or a compound object, that represents a sub-phrase of a
sentence. Each constituent, except for single words, consists of other constituents again, which is why
a tree graph structure evolves. A comprehensive introduction into constituent trees can be found in [8].
An illustration of a constituent tree (also known as tree diagram [8]) can be seen in Figure 4. In that
case, the sentence consists of a noun phrase and a verb phrase on the highest level. In our approach, we
deal with a slightly diferent version of a constituent tree, i. e., leave nodes, which originally represent
single words, are replaced by their respective part of speech tag. A constituent tree gives a profound
insight into the grammatical complexity of a sentence. The height of such trees, the number of unique
phrase-types are, for example, entities that contain relevant information.</p>
        <p>Our method is composed of three successive steps, which are summarized in the Figures 5, 6, 7 and
described in what follows.</p>
        <p>Step-1: In the first step (Figure 5), each text is segmented into sentences and for each sentence of
a text the respective constituent tree is constructed. As a result, one receives for each text a set of
constituent trees. In our approach, the leave nodes of any constituent tree are part of speech tags of the
respective word that is being replaced.</p>
        <p>Step-2: In the second step (Figure 6) these sets of constituent trees are transformed into a numeric
entertaining
competition</p>
        <sec id="sec-3-1-1">
          <title>Document</title>
          <p>Sentence 1
...</p>
          <p>Sentence x
VP
D
.
!
N</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Constituent Tree</title>
          <p>S
N</p>
          <p>V</p>
          <p>NP</p>
          <p>D
CT</p>
          <p>NP
representation. To do so, for each tree, we count the occurring -grams. By this, we mean, counting
the frequencies of unique paths of length  , starting at any node in the tree and following the
directions of the edges. We interpret any edge in the tree to be directed and pointing downwards in
the tree. This is done for all numbers  = 1, . . . , 7, at which  = 1 simply represents the counts
of occurring node types. Afterwards, the frequencies are aggregated for each N across all trees
belonging to the same text. Subsequently, these frequencies are normalized for each  , such that the
corresponding vector represents relative frequencies. As a result of this step, we obtain for each text
seven vectors (one for each  ), that represents the relative frequencies of unique paths of a specific
length across all trees belonging to the same text. (We ensure that these vectors have the same length
for each text, such that unique paths that do not occur in one text are still represented by 0.) These seven
vectors are concatenated and finally represent the original text. The final vector is an interpretable
representation of the grammatical structure of a text.</p>
          <p>2.
...</p>
          <p>2</p>
          <p>...</p>
          <p>Step-3: The final step (Figure 7) consists of training a classifier based on the generated dataset,
received after step-2. To do so we repeat step-1 and step-2 for both the training and validation dataset.
As a classifier, we have implemented a simple MLP. For the optimization we used the Adam method with
a fixed learning rate of 0.001. The MLP was composed of 3 hidden layers, each with a dimension of 1000
neurons. Together with the validation dataset we manually tuned the number and size of the hidden
layers. We executed the training for 30 epochs and subsequently plotted the curve of the accuracy on
the validation set to select the best possible model.
3.</p>
          <p>Dataset
N-Gramm Distriubtions</p>
          <p>Label
0.1 0.02 ... 0.3 0.004
1
0
1
0</p>
          <p>Training</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Implementation Details</title>
        <p>The implementation was carried out in Python. We used the Constituent Treelib (CTL) package2 [9]
for the generation of the constituent trees and torch to develop the neural network. The CTL package
only provides a bracketed string representation of a constituent tree. Therefore, we had to develop a
parser, that processes such a string and transforms it into a list of edges and a dictionary, that maps each
node index to the respective node type of the constituent tree. Furthermore, it should be noted that the
computation of a constituent tree for a given sentence takes several seconds. This does not scale very
well for large corpora and long texts. To speed up the computation we parallelized the processing of
diferent texts across multiple CPU cores on our cluster. In total, step-1 took ≈ 9 hours for the training
corpus to complete and ≈ 1.5 hours for the validation data. The final evaluation on the test dataset,
which was for obvious reasons inaccessible to us, was executed by the organizers of the competition.
The solution was uploaded to Tira, an environment that provided access to computing resources and
was used to manage the submitted solutions [10].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In Table 1, the final results of all evaluated solutions, as published by the organizers of the competition,
are shown [2]. As can be seen, we obtain a good @1 score of 0.816, which equals in our case accuracy,
since our method does not produce any non-predictions, and an overall mean score of 0.8433. This
beats two out of three baselines and is also close to the best performing baseline, i. e., linear SVM with
TF-IDF. Overall, the performances are rather close to each other. Our approach has a diference of only
0.02 to the top-10 method and a score diference of 0.056 to the winning team. This shows, that the
submitted models all perform within a similar range.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In the following, we will scrutinize two aspects, i. e., on the one hand, the framework used for evaluation
and on the other hand our method as such and how it could be advanced.</p>
      <sec id="sec-5-1">
        <title>5.1. Discussion on the evaluation</title>
        <p>The overall score in the evaluation is computed as the average of five diferent metrics. However, it
is generally not meaningful to compute an average across heterogeneous performance metrics such
as AUC, 1, @1 (which, in our case, is equivalent to Acc) and the (inverted) Brier score. Each metric
captures a distinct aspect of the model’s performance. The 1 score quantifies the trade-of between
precision and recall. The AUC measures the model’s ability to rank positive instances ahead of negative
ones. The Brier score, which is essentially the standard Mean Squared Error (MSE), evaluates the
accuracy of predicted probabilities, that is, the model’s calibration.</p>
        <p>Besides this, there are other challenges as well. Since our approach always returns a prediction, the
@1 measure degenerates to the standard accuracy measure. This can be proven easily. According to
[11], this measure is defined as follows:
1 (︂

 +
 ·  )︂

Here,  denotes the total number of instances in the dataset,  is the number of correct predictions
(i. e., true positives and true negatives, or TP + TN) and  is the number of non-predictions, that is,
2The package is available under: https://github.com/Halvani/Constituent-Treelib
3Note, the Brier score is actually the inverted value, i. e., 1-Brier, such that big values correspond to better performance.
1 mdok 0.853
2 steely 0.842
3 nexus-interrogators 0.865
4 yangjlg 0.845
5 cnlp-nits-pp 0.825
6 unibuc-nlp 0.828
7 moadmoad 0.822
8 iimasnlp 0.838
9 bohan-li 0.848
10 advacheck 0.802
11 hello-world 0.838
x Baseline TF-IDF 0.838
12 xlbniu 0.794
13 shushantatud 0.823
14 ds-gt-pan 0.803
15 styloch 0.793
16 felix-volpel 0.815
17 sinai-inta 0.811
18 pindrop 0.782
19 diveye 0.786
20 s-titze 0.797
x Baseline Binoculars 0.760
21 iunlp 0.734
22 hiwiy 0.765
23 team-a 0.603
x Baseline PPMd 0.636
24 asdkklkk 0.718
instances for which no prediction was provided. Since our approach always produces a prediction,
 =</p>
        <p>TP + TN
TP + TN + FP + FN
= Accuracy
Although, we had no insight into the test dataset, the training, as well as the validation dataset, were
unbalanced regarding the ratio of AI-generated and human-authored texts, as was presented in Section
2. This might indicate that the test dataset was assembled with a similar imbalance. In the context of
machine learning, it is widely acknowledged that using accuracy as an evaluation metric on unbalanced
test sets is not advisable. This is because accuracy can be overly influenced by the majority class,
potentially masking poor performance on the minority class. Consequently, the use of @1 (which is
equivalent to accuracy) is of limited value and does not provide a reliable assessment of model
performance.</p>
        <p>Beyond the limitations of @1, there are additional concerns within the evaluation framework that
merit attention in the context of this year’s competition. Notably, 1 introduces another drawback, as
this measure is not symmetric with respect to the class labels. In other words, 1 depends on the
designation of the positive class, which in the context of the competition is defined as "AI-generated
texts." In this case, the positive class also constitutes the majority class, which leads to counterintuitive
efects when interpreting the 1 score. Figure 8 contrasts the 1 score with the accuracy across the
1
0,8
0,6
0,4
0,2
0
1
0,8
it
underlying class counts arise from human- and AI-generated texts in the validation data, treated as
positives and negatives, respectively. When the positive class predominates, the two surfaces become
almost indistinguishable: the scarcity of negatives makes the false-positive term negligible, so both
metrics depend almost entirely on the true-positive rate. In this high-prevalence regime the 1 score
collapses to a mildly non-linear yet very close approximation of accuracy, and the usual distinction
between overall correctness and precision–recall balance efectively disappears.</p>
        <p>0
0,2
0,4
0,6
0,8
1
0
0,2
0,4
0,6
0,8
1</p>
        <p>False positive rate
Isometric plot for 1.</p>
        <p>False positive rate
Isometric plot for Accuracy.</p>
        <p>As a simple countermeasure, we propose to use balanced accuracy instead, which is defined as [ 12]:
can use the deviance between this and the original ranking as indicator of how strong the imbalance
in the test dataset might be and, as a consequence, how reliable the published results are. A more
in-depth analysis reveals, that the average absolute rank diference is 2.4 positions. It can be seen that,
for example, the top-10 approaches remain mostly in the top-10. The biggest decrease is regarding the
(originally) second approach that ranks at place 8 according to the balanced accuracy. Our approach
stays about the same with a slight decrease from rank 16 to 18.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Discussion on the method</title>
        <p>The proposed method, Adept, is easy to implement and bases on an intuitive logic which leads to
fairly good results. This backs our assumption, that constituent trees contain relevant information
about the grammatical structure that is unique for an author and can therefore be used for the given
classification task. There are two lines, along which we can envision straightforward improvements to
be implemented.</p>
        <p>The generation of the constituent trees does not scale well for long texts. To counteract this behavior
one has to either search for a more eficient implementation or apply parallelization. Another,
straightforward, workaround is to sample sentences of a text at random and only compute the trees for this
sample. Obviously, this goes hand in hand with a loss of information.</p>
        <p>Another aspect regards the constituent tree representation. The considered path lengths of 1 up to 7, to
generate the tree -grams, were chosen a-priori. However, we didn’t optimize the model considering
subsets or even longer paths. Furthermore, one might think of involving graph neural networks
to classify the constituent trees or to generate some kind of embeddings. However, it remains an
open question how to aggregate the information of multiple constituent trees into one common
representation of an entire text. In [13], Vinyal et al. developed an attention-enhanced
sequenceto-sequence model for syntactic constituency parsing based on a LSTM encoder-decoder framework.
To process constituency trees, they used a linearized representation, based on a depth-first traversal
order. Hence, they have presented an eficient way to process as well as to generate constituency trees,
something that might be applicable to the here presented use case too.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented our solution to this year’s PAN competition about AI-generated text detection.
Given a document, which was either AI-generated or human-written, a model has to decide the correct
case. Core element of our approach is the concept of the constituent tree. A constituent tree is a
hierarchical representation of grammatical sub-phrases of a sentence. The leave nodes represent the
single words, although in our implementation they have been replaced with the respective words’ part of
speech tags. By generating such a tree for each sentence of a text we receive a rich pool of information,
that captures the grammatical and syntactical complexity of the sentences of a text. We represent such a
set of trees by the distribution of occurring paths of a specified length in any tree belonging to the same
text. The received dataset of path distributions, representing each text of the corpus, was used to train
a neural network to classify whether a text was AI-generated or human-written. In the final evaluation,
we obtain fairly good results, achieving an overall mean score of 0.843. Possible improvements of this
approach encompass the computational efort of constructing the constituent trees as well as other
ideas to numerically represent such trees, potentially by incorporating other neural network structures
such as graph neural networks.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>This research work was supported by the National Research Center for Applied Cybersecurity ATHENE.
ATHENE is funded jointly by the German Federal Ministry of Education and Research and the Hessian
Ministry of Higher Education, Research and the Arts.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this paper, the authors used no GenAI tools.
Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025),
Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025.
[2] J. Bevendorf, Y. Wang, J. Karlgren, M. Wiegmann, M. Fröbe, A. Tsivgun, J. Su, Z. Xie, M. Abassy,
J. Mansurov, R. Xing, M. N. Ta, K. A. Elozeiri, T. Gu, R. V. Tomar, J. Geng, E. Artemova, A. Shelmanov,
N. Habash, E. Stamatatos, I. Gurevych, P. Nakov, M. Potthast, B. Stein, Overview of the
“VoightKampf” Generative AI Authorship Verification Task at PAN and ELOQUENT 2025, in: G. Faggioli,
N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF 2025 – Conference and Labs of the
Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org, 2025.
[3] S. S. Ghosal, S. Chakraborty, J. Geiping, F. Huang, D. Manocha, A. S. Bedi, Towards Possibilities &amp;
Impossibilities of AI-generated Text Detection: A Survey, 2023. URL: https://arxiv.org/abs/2310.
15264. arXiv:2310.15264.
[4] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, T. Goldstein, A watermark for large language
models, 2024. URL: https://arxiv.org/abs/2301.10226. arXiv:2301.10226.
[5] A. A. Ayele, N. Babakov, J. Bevendorf, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag,
M. Fröbe, D. Korenčić, M. Mayerl, D. Moskovskiy, A. Mukherjee, A. Panchenko, M. Potthast,
F. Rangel, N. Rizwan, P. Rosso, F. Schneider, A. Smirnova, E. Stamatatos, E. Stakovskii, B. Stein,
M. Taulé, D. Ustalov, X. Wang, M. Wiegmann, S. M. Yimam, E. Zangerle, Overview of PAN 2024:
Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot,
D. Schwab, G. M. D. Nunzio, L. Soulier, P. Galuscakova, A. G. S. Herrera, G. Faggioli, N. Ferro
(Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 15th International
Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer,
Berlin Heidelberg New York, 2024.
[6] E. Tavan, M. Najafi, Marsan at pan: Binocularsllm, fusing binoculars’ insight with the proficiency
of large language models for machine-generated text detection, in: Conference and Labs of the
Evaluation Forum, 2024. URL: https://api.semanticscholar.org/CorpusID:271801058.
[7] A. Hans, A. Schwarzschild, V. Cherepanova, H. Kazemi, A. Saha, M. Goldblum, J. Geiping, T.
Goldstein, Spotting llms with binoculars: Zero-shot detection of machine-generated text, 2024. URL:
https://arxiv.org/abs/2401.12070. arXiv:2401.12070.
[8] C. Anderson, B. Bjorkman, D. Denis, J. Doner, M. Grant, N. Sanders, A. Taniguchi,
Essentials of Linguistics, 2nd Edition, Open textbook library, eCampusOntario, 2022. URL: https:
//ecampusontario.pressbooks.pub/essentialsoflinguistics2.
[9] O. Halvani, Constituent Treelib - A Lightweight Python Library for Constructing, Processing,
and Visualizing Constituent Trees., 2024. URL: https://github.com/Halvani/constituent-treelib.
doi:10.5281/zenodo.10951644.
[10] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: Advances in Information
Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer
Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241.
[11] E. Stamatatos, W. Daelemans, B. Verhoeven, M. Potthast, B. Stein, P. Juola, M. Sanchez-Perez,
A. Barrón-Cedeño, Overview of the Author Identification Task at PAN 2014, in: L.
Cappellato, N. Ferro, M. Halvey, W. Kraaij (Eds.), Working Notes Papers of the CLEF 2014 Evaluation
Labs, volume 1180 of Lecture Notes in Computer Science, 2014. URL: https://ceur-ws.org/Vol-1180/
CLEF2014wn-Pan-StamatosEt2014.pdf.
[12] K. H. Brodersen, C. S. Ong, K. E. Stephan, J. M. Buhmann, The balanced accuracy and its posterior
distribution, in: 2010 20th International Conference on Pattern Recognition, 2010, pp. 3121–3124.
doi:10.1109/ICPR.2010.764.
[13] O. Vinyals, L. u. Kaiser, T. Koo, S. Petrov, I. Sutskever, G. Hinton, Grammar as a foreign language,
in: C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information
Processing Systems, volume 28, Curran Associates, Inc., 2015. URL: https://proceedings.neurips.
cc/paper_files/paper/2015/file/277281aada22045c03945dcb2ca6f2ec-Paper.pdf.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2025:
          <article-title>Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality</source>
          , Multimodality, and Interaction.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>