=Paper=
{{Paper
|id=Vol-3740/paper-289
|storemode=property
|title=BertT: A Hybrid Neural Network Model for Generative AI Authorship Verification
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-289.pdf
|volume=Vol-3740
|authors=Zepeng Wu,Wenyin Yang,Li Ma,Zikai Zhao
|dblpUrl=https://dblp.org/rec/conf/clef/WuYMZ24
}}
==BertT: A Hybrid Neural Network Model for Generative AI Authorship Verification==
BertT: A Hybrid Neural Network Model for Generative AI
Authorship Verification
Notebook for PAN at CLEF 2024
Zepeng Wu, Wenyin Yang* , Li Ma and Zikai Zhao
Foshan University, Foshan, China
Abstract
With the rapid development and widespread adoption of Large Language Models (LLMs), distinguishing
between human-authored and machine-generated texts has become increasingly complex. Although
various classification methods have been devised to help identify the origins of texts, they often fail to
address the fundamental feasibility and inherent challenges of the task. Building on extensive
experience in the field of authorship verification, this study introduces BertT, a novel hybrid model that
combines BERT and Transformer technologies, specifically designed for the Generative AI Authorship
Verification Task organized in collaboration with PAN and ELOQUENT Labs. This task requires
accurately identifying human-authored texts from pairs, one written by a human and the other
generated by a machine. Leveraging the deep semantic understanding capabilities of BERT and the
efficient sequence processing power of Transformers, our model, BertT, significantly outperforms
existing baseline models such as Fast-Detect.
Keywords
PAN 2024, Generative AI Authorship Verification, BERT, Transformer1
1. Introduction
Text classification is a cornerstone of Natural Language Processing (NLP), with authorship
verification serving as a pivotal application in this domain. This process is crucial for validating
the authenticity of documents, detecting plagiarism, and identifying the origins of articles,
thereby preserving the integrity of written content across various fields. The Generative AI
Authorship Verification Task at PAN@CLEF 2024 [1], which builds upon previous challenges,
aims specifically to differentiate between human-authored and machine-generated texts. This
task is increasingly pertinent as Large Language Models (LLMs) like GPTs now produce high-
quality text that closely mimics human writing, thereby presenting substantial challenges in
differentiation.
The utility of authorship verification has been demonstrated in various contexts, underscoring
its adaptability and critical importance. For example, Halvani et al. explore the use of compression
models for authorship verification, highlighting their effectiveness in digital text forensics
without relying on complex machine learning algorithms or extensive feature engineering [2].
This approach is well-aligned with our need for efficient and scalable solutions to manage the
vast amounts of text generated by LLMs. Similarly, Bevendorff et al. have adapted the unmasking
method to short texts, significantly reducing the amount of material required for effective
authorship verification, thereby making it applicable to more practical scenarios [3]. Additionally,
the challenge of distinguishing between machine-generated and human-authored content is
accentuated in the work by Bao et al., who developed Fast-DetectGPT. This model improves the
efficiency of detecting machine-generated text through the innovative use of conditional
probability curvature, thereby reducing computational costs while maintaining high accuracy [4].
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
2112203031@stu.fosu.edu.cn (Z. Wu); cswyyang@fosu.edu.cn (W. Yang); molly_917@163.com (L. Ma) ;
gzjbzzk@163.com (Z. Zhao)
0009-0004-5756-9713 (Z. Wu); 0000-0003-4842-9060 (W. Yang); 0000-0002-5013-052X (L. Ma) ; 0009-0006-
7120-3958 (Z. Zhao)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
This advancement is particularly relevant to our study as it addresses similar challenges
concerning processing efficiency and accuracy. Moreover, the broad application of neural
networks in text classification tasks is exemplified by Yang et al. and Yuan et al. in their respective
studies on profiling irony and stereotype spreaders on Twitter. These studies employ RNN and
CNN models to classify complex social media content, offering insights into the adaptability of
neural network architectures for varied NLP tasks [5], [6].
In response to the complexities of authorship verification in the era of LLMs, we developed
Bert_T, a novel model that marries the deep semantic understanding capabilities of BERT with
the efficient sequential data processing power of Transformer architecture. This model employs
a sophisticated contrastive learning approach with an advanced loss function, aimed at enhancing
the discrimination between human and machine text. It operates on a dataset formatted in pairs
"(text1, text2, label)", training to detect subtle nuances that signify distinct authorship styles.
Prior to its effectiveness evaluation, we submitted our model to the TIRA.io platform [7], which
provides a stringent and controlled testing environment for fair and transparent benchmarking
against established baselines. This preliminary submission was crucial for assessing the model's
real-world applicability and refining its performance based on unbiased feedback. The Bert_T
demonstrated superior performance across several key metrics, achieving a ROC-AUC of 0.967, a
Brier score of 0.903, a C@1 of 0.869, an F1 score of 0.869, and an F0.5u of 0.872, culminating in
an overall mean score of 0.896. These results significantly surpassed those of other baseline
models such as Fast-DetectGPT (Mistral), PPMd, Unmasking, and Fast-DetectGPT, underscoring
the Bert_T’s enhanced ability to discern between human and machine-generated texts. This
success highlights the efficacy of our approach in tackling the complexities of generative AI
authorship verification.
2. Dataset
The dataset for the Generative AI Authorship Verification Task at PAN@CLEF 2024 plays a crucial
role in training and validating the efficacy of our Bert_T model. This year, the dataset comprises
a diverse array of text genres, reflecting a mix of both real and synthetically generated content.
The primary sources of data include news articles, Wikipedia introduction texts, and pieces of
fanfiction, which provide a rich variety in style, structure, and complexity. Additionally, PAN
participants receive a bootstrap dataset that includes real and fabricated news articles covering
various 2021 U.S. news headlines, designed to simulate scenarios that models might encounter in
practical applications.
The data, sourced from contributions by ELOQUENT participants, is meticulously curated to
ensure a balanced representation of human and machine-authored texts. The bootstrap dataset
is formatted as newline-delimited JSON files, where each file contains a list of articles. These
articles are authored either by one or more human authors or entirely by an AI, specifically
Google's Gemini Pro model. The dataset structure is pivotal for the task, as it contains pairs of
texts where each pair is written on the same topic but by different authors-one human and one
machine. The file format for these pairs is demonstrated below:
{"id": "gemini-pro/news-2021-01-01-2021-12-31-kabulairportattack/art-081", "text": "..."}
{"id": "gemini-pro/news-2021-01-01-2021-12-31-capitolriot/art-050", "text": "..."}
Each text pair in the dataset is meticulously labeled with `0` or `1`, indicating whether the texts
are from the same author, thereby facilitating supervised learning. The test dataset is provided in
a slightly altered format to challenge the model's ability to generalize. Instead of individual files,
it is delivered as a single JSONL file where each line contains a pair of texts. The content of this
file is arranged such that the identities of the authors are anonymized, and the order of texts
scrambled:
{"id": "iixcWBmKWQqLAwVXxXGBGg", "text1": "...", "text2": "..."}
{"id": "y12zUebGVHSN9yiL8oRZ8Q", "text1": "...", "text2": "..."}
Participants are tasked with predicting which of the two texts in each pair is human-authored.
This setup tests the model’s ability to discern subtle linguistic and stylistic nuances that typically
distinguish human writing from its AI-generated counterpart. Access to the dataset is regulated
via Zenodo, where participants must register and request access using their Tira-registered email,
ensuring that the use of this data remains confined to research purposes and that no
redistribution occurs. This controlled distribution ensures compliance with copyright regulations
and maintains the integrity of the data for academic and developmental uses.
3. Methodology
3.1. Dataset Preprocessing
Effective data preprocessing is essential for the robust performance of machine learning models,
particularly in tasks involving natural language processing such as authorship verification. For
the Generative AI Authorship Verification Task at PAN@CLEF 2024, our preprocessing routine
involved several critical steps to enhance the quality and consistency of model inputs.
Initially, the text normalization process involved removing all punctuation and converting text
to lowercase to reduce variability and focus the model's learning on substantive content. This was
coupled with the removal of non-alphabetic characters and numerals to ensure that the model
trained strictly on textual elements. Following normalization, stopwords-common words that
typically do not contribute to the identification of authorship-were removed to minimize data
noise and enhance focus on more distinctive text features.After cleaning the texts, the corpus was
tokenized into individual words or tokens, which is essential for structuring raw text into a format
suitable for machine learning models. The texts were then vectorized using a pre-trained Bert
tokenizer, which also standardized the length of tokens through padding and truncation to
optimize computational efficiency. To address the challenge of limited training data, we
implemented data augmentation techniques to artificially expand the dataset, creating new text
pairs from existing ones by subtly modifying texts while preserving their key attributes. This
approach helped improve the model’s generalization capabilities from training scenarios to real-
world applications.
Throughout the preprocessing stages, we meticulously ensured that the alterations did not
compromise the semantic integrity or the stylistic attributes of the texts, which are crucial for
authorship identification. This comprehensive preprocessing not only prepared the dataset for
effective training of our Bert_T model but also enhanced the model’s accuracy in distinguishing
between human and machine-generated texts, a critical aspect of the verification task.
3.2. Network Architecture
In this study, we introduced Bert_T, a hybrid neural network model that integrates BERT-base
for robust feature extraction with a Transformer encoder to handle attention-based dynamics,
specifically tailored for distinguishing between human-written and machine-generated texts. We
employ the bert-base-uncased model from Hugging Face's Transformers library as our
foundational pre-trained BERT layer, leveraging its well-established capabilities in natural
language understanding. This layer focuses on the CLS token embedding to capture
comprehensive textual context, which is then processed through a Dropout layer to prevent
overfitting and enhance generalizability. The Transformer Encoder, equipped with a multi-head
attention mechanism, dynamically integrates information across text segments, crucial for
identifying subtle linguistic and stylistic nuances. During testing, Bert_T processes each text in a
pair independently in a JSONL format, evaluating the likelihood of each text being human-written
and comparing these scores to classify texts; the text with the higher score is deemed human-
authored. Optimization of model parameters such as learning rate and batch size, along with the
use of Binary Cross-Entropy Loss, fine-tunes the model's accuracy, ensuring it performs
effectively on metrics such as ROC-AUC and Brier scores. This configuration enables Bert_T to
meet the specific challenges of the Generative AI Authorship Verification Task at PAN@CLEF
2024, demonstrating both innovative theoretical approaches and practical discriminative
capabilities, as illustrated in Figure 1: Bert_T Architecture.
Figure 1: Bert_T Architecture
4. Experiments and Results
4.1. Experimental Setting
In our experimental setup for evaluating the Bert_T model's ability to distinguish between
human-authored and machine-generated texts, we preprocessed the dataset and divided it into
training and testing sets with a 7:3 ratio. The model, a Bert_T, integrates a pretrained BERT base
model with a Transformer layer tailored for sequence classification, featuring 768 hidden units,
four attention heads, and a linear classifier. Training parameters were meticulously set, with a
batch size of 8 and a learning rate of 1e-6 over 300 epochs using the AdamW optimizer on CUDA-
capable GPUs to balance computational efficiency and learning depth.
4.2. Metrics
Our evaluation framework was meticulously designed to rigorously assess the performance of
the Bert_T model across several metrics that reflect its effectiveness in distinguishing between
human-authored and machine-generated texts. The model was evaluated using a standard set of
metrics that are commonly employed in authorship verification tasks, including ROC-AUC, Brier
score, C@1, F1, and F0.5u, along with the arithmetic mean of these metrics to provide a
comprehensive overview of performance.
Performance Metrics:
ROC-AUC measures the area under the receiver operating characteristic curve, providing
insight into the model's ability to discriminate between classes across all thresholds [8]. The ROC
curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold
settings. The formula is given by:
1
ROC-AUC = ∫ TPR(𝑡) 𝑑(FPR(𝑡)) (1)
0
Brier Score evaluates the mean squared error of the probabilities assigned, indicating the
accuracy of probability predictions [9]. The lower the Brier score, the better, as it reflects a closer
proximity to the true outcome. It is calculated as:
𝑁
1
Brier Score = ∑(predicted probability𝑖 − actual outcome𝑖 )2 (2)
𝑁
𝑖=1
C@1 represents a modified accuracy that treats non-answers (predictions with a confidence
score of 0.5) by averaging the accuracy of the remaining cases, thus penalizing uncertainty [10].
This metric is particularly useful in situations where making no prediction is preferable to making
an incorrect prediction. The formula is:
Number of correct answers Number of non-answers
𝐶@1 = + (3)
Total number of cases − Number of non-answers Total number of cases
F1 Score is the harmonic mean of precision and recall, offering a balance between the
precision of the classifier and its recall capability [11]. It is particularly useful in situations where
an equal balance between precision and recall is desired. The formula is:
Precision × Recall
𝐹1 = 2 ⋅ (4)
Precision + Recall
TP TP
Where Precision = TP+FP 𝑎𝑛𝑑 Recall = TP+FN.
F0.5u is a variant of the F-measure that weights precision more than recall, suitable for
scenarios where false positives are more costly than false negatives [12]. It is calculated using the
formula:
Precision × Recall
𝐹0.5𝑢 = (1 + 0.52 ) ⋅ (5)
0.52 ⋅ Precision + Recall
These metrics collectively provided a robust framework for evaluating our model, enabling us
to effectively measure its ability to perform authorship verification across different dimensions
of accuracy and reliability.
4.3. Results
Our Bert_T model demonstrated robust performance in the PAN 2024 Voight-Kampff Generative
AI Authorship Verification task, showcasing substantial effectiveness across several critical
metrics. As evidenced in Table 1, Bert_T achieved a ROC-AUC of 0.967, which, while slightly lower
than the top-performing Baseline Binoculars at 0.972, reflects a high level of discriminative
capability. The Brier score for Bert_T was 0.903, indicating reliable probability predictions of
class membership, although it did not surpass the Baseline Binoculars, which scored 0.957.
Regarding precision-related metrics, Bert_T recorded scores of 0.869 for both C@1 and F1, and
0.872 for F0.5u, remaining competitive although below the near-perfect scores around the 95th
percentile.
Table 2 presents an overview of Bert_T’s mean accuracy across nine test set variants, showing
considerable stability and less variability in performance compared to other models which
displayed more significant fluctuations. Bert_T maintained a minimum accuracy of 0.354 and a
maximum of 0.980, with a notable median performance of 0.892, and the 25th and 75th
percentiles at 0.864 and 0.896, respectively. These figures underscore Bert_T's robust
performance across different testing scenarios, highlighting its efficacy in handling the complex
demands of the verification task.
In terms of competition standings, our submission ranked 20th out of 30 participants on the
official PAN 2024 leaderboard. Notably, Bert_T outperformed all but one baseline with a ranking
score over all test datasets of 0.608, as detailed on the PAN 2024 leaderboard. This ranking
underscores our model ’ s competitive edge and its significant discriminative power in a
challenging environment filled with diverse and sophisticated entries.
These results affirm that Bert_T not only embodies theoretical innovation but also exhibits
significant practical capabilities in the authorship verification domain. The model’s ability to
effectively discern between human and machine-generated texts makes it a valuable tool for
complex text analysis tasks. Future work will focus on further optimizing model parameters,
enhancing feature engineering techniques, and expanding the diversity of the training dataset to
boost the model ’ s generalizability and performance across varied textual contexts. This
continuous improvement aims to refine Bert_T’s capabilities for higher detection accuracy and
broader application scope in real-world scenarios.
Table 1:The final performance of our submission on PAN 2024 (Voight-Kampff Generative AI
Authorship Verification)
Approach ROC-AUC Brier C@1 F1 F0.5u Mean
Bert_T 0.967 0.903 0.869 0.869 0.872 0.896
Baseline Binoculars 0.972 0.957 0.966 0.964 0.965 0.965
Baseline Fast-DetectGPT (Mistral) 0.876 0.8 0.886 0.883 0.883 0.866
Baseline PPMd 0.795 0.798 0.754 0.753 0.749 0.77
Baseline Unmasking 0.697 0.774 0.691 0.658 0.666 0.697
Baseline Fast-DetectGPT 0.668 0.776 0.695 0.69 0.691 0.704
95-th quantile 0.994 0.987 0.989 0.989 0.989 0.990
75-th quantile 0.969 0.925 0.950 0.933 0.939 0.941
Median 0.909 0.890 0.887 0.871 0.867 0.889
25-th quantile 0.701 0.768 0.683 0.657 0.670 0.689
Min 0.131 0.265 0.005 0.006 0.007 0.224
Table 2: Overview of the mean accuracy over 9 variants of the test set
Approach Minimum 25-th Quantile Median 75-th Quantile Max
Bert_T 0.354 0.864 0.892 0.896 0.980
Baseline Binoculars 0.342 0.818 0.844 0.965 0.996
Baseline Fast-DetectGPT (Mistral) 0.095 0.793 0.842 0.931 0.958
Baseline PPMd 0.270 0.546 0.750 0.770 0.863
Baseline Unmasking 0.250 0.662 0.696 0.697 0.762
Baseline Fast-DetectGPT 0.159 0.579 0.704 0.719 0.982
95-th quantile 0.863 0.971 0.978 0.990 1.000
75-th quantile 0.758 0.865 0.933 0.959 0.991
Median 0.605 0.645 0.875 0.889 0.936
25-th quantile 0.353 0.496 0.658 0.675 0.711
Min 0.015 0.038 0.231 0.244 0.252
5. Conclusion
This paper details the development and evaluation of the Bert_T model, our innovative
contribution to the PAN 2024 Voight-Kampff Generative AI Authorship Verification task.
Combining BERT-based feature extraction with a Transformer encoder for attention processing,
Bert_T effectively differentiates between human-written and machine-generated texts. It
demonstrated strong performance across various metrics, achieving a ROC-AUC of 0.967 and a
Brier score of 0.903, which confirms its reliability in predictions. Despite stiff competition from
established baselines, Bert_T maintained consistent performance across different test set
variants, with accuracies ranging from a minimum of 0.354 to a maximum of 0.980. This
showcases its capability to handle diverse and complex textual scenarios effectively. Moving
forward, we plan to further refine Bert_T by optimizing its parameters, enhancing its feature
engineering techniques, and expanding its training dataset to cover a broader spectrum of text
types and genres. These efforts will not only improve the model’s performance in authorship
verification tasks but also extend its applicability to a wider range of natural language processing
challenges, aiming for higher detection accuracy and broader operational scope.
6. Acknowledgements
This work was supported by grants from the Guangdong-Foshan Joint Fund Project (No.
2022A1515140096) and Open Fund for Key Laboratory of Food Intelligent Manufacturing in
Guangdong Province (No. GPKLIFM-KF-202305).
References
[1] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D.
Korenčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova,
E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024:
Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
Analysis, and Generative AI Authorship Verification, in: Experimental IR Meets
Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International
Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science,
Springer, Berlin Heidelberg New York, 2024.
[2] O. Halvani, C. Winter, and L. Graner. "On the usefulness of compression models for authorship
verification." Proceedings of the 12th international conference on availability, reliability and
security. 2017.
[3] J. Bevendorff, B. Stein, M. Hagen, et al. "Generalizing unmasking for short texts." Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
[4] G. S. Bao, Y. B. Zhao, Z. Y. Teng, et al. "Fast-detectgpt: Efficient zero-shot detection of machine-
generated text via conditional probability curvature." arXiv preprint arXiv:2310.05130
(2023).
[5] Z. X. Yang, L. Ma, W. Y. Yang , et al. A Intelligent Detection Method for Irony and Stereotype
Based on Hybird Neural Networks. In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and
Martin Potthast, editors, CLEF 2022 Labs and Workshops, Notebook Papers, September 2022.
CEUR-WS.org.
[6] D. Yuan, W. Y. Yang, L. Ma, et al. Analysis of Irony and Stereotype Spreaders Based On
Convolutional Neural Networks. In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and
Martin Potthast, editors, CLEF 2022 Labs and Workshops, Notebook Papers, September 2022.
CEUR-WS.org.
[7] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M.
Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L.
Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.),
Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023),
Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–
241. URL: https://link.springer.com/chapter/10.1007/978-3-031-28241-6_20.
doi:10.1007/978-3-031-28241-6_20.
[8] A. M. Carrington, D. G. Manuel, P. W. Fieguth, et al. "Deep ROC analysis and AUC as balanced
average accuracy, for improved classifier selection, audit and explanation." IEEE
Transactions on Pattern Analysis and Machine Intelligence 45.1 (2022): 329-341.
[9] W. Yang, J. Jiang, E. M. Schnellinger, et al. "Modified Brier score for evaluating prediction
accuracy for binary outcomes." Statistical methods in medical research 31.12 (2022): 2287-
2296.
[10] A. Peñas, A. Rodrigo, A simple measure to assess non-response (2011).
[11] F. Pedregosa, G. Varoquaux, A. Gramfort, Scikit-learn: Machine learning in python, the Journal
of machine Learning research 12 (2011) 2825–2830.
[12] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Generalizing unmasking for short texts, in:
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers), 2019, pp. 654–659.