BLGAV：Generative AI Author Verification Model Based
                         on BERT and BiLSTM
                         Notebook for PAN at CLEF 2024

                         Linjiu Guo1, Wenyin Yang2,* , Li Ma3 , Jinli Ruan4

                         1Foshan University, Foshan, China
                         2Foshan University, Foshan, China
                         3Foshan University, Foshan, China
                         4Foshan University, Foshan, China


                                        Abstract
                                        In recent years, large language models like GPT-3, BERT, and GPT-4 have made significant
                                        advancements in the field of natural language processing, enhancing tasks such as document
                                        summarization, language translation, and question answering. Despite these benefits, the authenticity
                                        and credibility of texts generated by these models have raised societal concerns, including
                                        misinformation and plagiarism. To address these issues, the PAN organization has initiated a series of
                                        tasks to differentiate between machine-generated and human-written texts. This paper proposes a
                                        Generative AI Authorship Verification model based on BERT and BiLSTM, which enhances text
                                        discrimination capabilities by combining Transformer encoders with multi-text feature techniques. The
                                        model leverages a pretrained BERT for deep feature extraction and incorporates additional text features
                                        calculated by the spaCy , further processed by BiLSTM and Transformer encoders for classification.
                                        Experimental results show that the model achieved a mean score of 0.971 on the PAN validation dataset,
                                        surpassing all baseline models. This approach not only improves detection accuracy but also enhances
                                        adaptability to various text types, making it significant for maintaining the authenticity and reliability
                                        of information in the era of automatic content generation.

                                        Keywords
                                        PAN 2024, Generative AI Authorship Verification, Pre-training BERT, SpaCy, Multi-Text Features
                                        1


                         1. Introduction
                            In recent years, large language models such as GPT-3, BERT, ChatGPT, Llama2, PaLM2, and
                         GPT-4 have demonstrated exceptional performance in the field of natural language processing.
                         They are widely utilized in tasks such as document summarization, language translation, and
                         question answering [1,2,3,4]. These models not only facilitate automated content creation and
                         dialogue systems but also enhance efficiency across various industries including customer
                         service, education, law, and healthcare, through intelligent solutions [5,6]. However, with the
                         widespread adoption of these technologies, issues related to the authenticity and credibility of
                         texts generated by these models have increasingly attracted public attention. Key concerns
                         include the spread of misinformation, generation of nonsensical or misleading content, and
                         plagiarism of intellectual property and original content, which are considered significant societal
                         issues .
                            In this context, the PAN[7] organization has launched a series of tasks to differentiate between
                         machine-generated and human-written texts. This initiative not only aids in identifying and
                         verifying the authenticity of texts but also effectively curbs the spread of misleading information
                         and copyright infringement, thereby protecting the rights of information recipients. Generative


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                             1837247142@qq.com(L .Guo); cswyyang@163.com (W. Yang)( *corresponding author); molly_917@fosu.edu.cn
                         (L. Ma) ; 13104922330@163.com (J. Ruan)
                           0009-0008-7868-8972 (L. Guo); 0000-0003-4842-9060 (W. Yang); 0000-0002-5013-052X (L. Ma) ; 0009-0007-
                         0411-8312 (J. Ruan)
                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
AI Author Verification is typically viewed as a binary classification problem, which involves
distinguishing whether a text is written by a human or generated by a machine. Some approaches,
based on statistical features, classify texts by comparing the statistical characteristics of texts
written by humans and those generated by machines, such as word frequency, syntactic features,
and semantic similarity [8]. For instance, Wang et al. proposed a detection method based on word
frequency and n-gram features [9]. While initially effective, its performance significantly
decreased when faced with more complex generation models. Gehrmann et al. introduced
manually designed statistical features [10], which have also shown some effectiveness in assisting
humans in detecting machine-generated texts. Another approach involves the use of fine-tuned
pretrained language models. These methods fine-tune large-scale pretrained models, such as
BERT and GPT-3, on extensive text data, enabling them to better capture subtle differences in
texts . Methods based on pretrained language models typically exhibit higher detection accuracy
and generalizability, and can adapt to different types of generative models and texts. However,
these methods also face challenges, such as the need for substantial computational resources and
data for training, and high complexity and computational costs associated with the models.
    This paper proposes a BERT and Bidirectional Long Short-Term Memory Network (BiLSTM)
based Generative AI Authorship Verification model (BLGAV), which enhances the ability to
discriminate between machine-generated and human-written texts by combining Transformer
encoders with multi-feature fusion techniques. The model initially uses a pretrained BERT to
extract deep textual features, and integrates additional text features computed by the spaCy
model, such as lexical diversity and average sentence length, to enhance its discriminative ability.
It then processes these features further using BiLSTM[11] and Transformer encoders, and finally,
classification is performed through a fully connected layer:
    • Multi-text feature fusion method:This model not only relies on deep language feature
        extraction by the BERT model but also enhances its discriminative ability by calculating
        multiple text features such as lexical diversity and average sentence length. This multi-
        feature fusion method improves the model's accuracy in recognizing generated text.
    • Experimental results show that the model achieved a Mean score of 0.971 on the official
        validation dataset provided by the PAN laboratory, surpassing all five benchmark models
        provided by the official sources.

2. Related Work
  With the rapid development and application of large language models, detecting machine-
generated text and verifying authorship have become significant research topics. Existing work
mainly focuses on the following areas.

2.1. Unsupervised Methods Based on Statistical Features

  To overcome the overfitting problem in supervised learning models, researchers have begun
exploring unsupervised methods based on statistical features. These methods use statistical
anomalies in the text to distinguish between machine-generated and human-written texts. For
example, Lavergne et al. studied statistical anomalies in entropy[12], while Badaskar et al. used
n-gram frequencies as detection features[13]. Gehrmann et al. introduced manually designed
statistical features to assist humans in detecting machine-generated texts [10]. Solaiman et al.
proposed a simple zero-shot method to detect machine-generated text by evaluating the log
probability of each word and using a threshold for segmentation [14]. Mitchell et al. observed that
machine-generated texts often lie within the local curvature of log probabilities and introduced
DetectGPT. Although this method performs exceptionally well, it requires substantial
computational resources [15].
2.2. Methods Using Pretrained Models

   In recent years, pretrained models like BERT and RoBERTa have made significant progress in
natural language processing tasks and have been applied to the task of detecting machine-
generated texts. For example, Solaiman et al. introduced a GPT-2 detector by fine-tuning the
RoBERTa model on outputs from GPT-2[16]. Similarly, Guo et al. developed a ChatGPT detector
by fine-tuning the RoBERTa model on the HC3 dataset to distinguish between human-written
texts and texts generated by ChatGPT[17]. These methods demonstrate the effectiveness of fine-
tuning pretrained models for specific tasks but also expose potential overfitting issues when the
training data distribution differs from the actual application data distribution[18,19].

3. Methodology
   In this paper, we first convert the data provided by PAN into a format suitable for model
training through cleaning and formatting to improve data quality. Then, the model uses a pre-
trained BERT to extract deep semantic features of the text and integrates additional text features
calculated by the spaCy model, such as lexical diversity and average sentence length, which
enhances the model's discriminative capability. Subsequently, these features are deeply
processed by combining BiLSTM and Transformer encoders to capture complex text structures.
Finally, the model classifies through a fully connected layer, effectively distinguishing between
human and machine-generated texts.

3.1. Dataset Preprocessing

    The dataset provided by PAN consists of two types of files: one written by human authors and
the other generated by machines. Generative AI Authorship Verification is typically viewed as a
binary classification problem, that is, distinguishing whether the text is generated by humans or
machines. We classify the texts and match labels for each text, where human-written texts are
marked as 0 and machine-generated texts as 1. The original data format is transformed from {"id":
"...", "text": "..."} to {"text": "...", "label": "0 or 1"}. This process can be described as follows:
                                            {(𝑖𝑑, 𝑡𝑒𝑥𝑡)} → {(𝑡𝑒𝑥𝑡, 𝑙𝑎𝑏𝑒𝑙)}                             (1)

The conversion rules are as follows:
                                        0 𝑖𝑓 𝑎𝑢𝑡ℎ𝑜𝑟 = ℎ𝑢𝑚𝑎𝑛
                                    𝑙𝑎𝑏𝑒𝑙 {                                                     (2)
                                        1 𝑖𝑓 𝑎𝑢𝑡ℎ𝑜𝑟 = 𝑚𝑎𝑐ℎ𝑖𝑛
This indicates that we transform the original data format, which includes text ID, text content,
and author type (human or machine), into a format that only includes text content and the
corresponding label (0 or 1). Moreover, to enhance model training effectiveness, the following
data cleaning steps were carried out:
   • Removing irrelevant information: Clearing numbers, punctuation, and other distracting
       characters.
   • Unifying text format: Converting all text to lowercase and removing stop words.
   • Improving feature quality: The above cleaning steps help more accurately reflect the
       language structure and features, facilitating effective feature extraction by the model.

3.2. Network Architecture

  Traditional unsupervised methods based on statistical features detect patterns by calculating
word frequency, character frequency, word length, and sentence length. Although simple and
easy to implement, these methods fail to capture deep semantic information and rely heavily on
manually designed features. Their effectiveness is limited, making it difficult to handle the
diversity and complexity of texts. Pre-trained models (such as word2vec and GloVe) represent
the semantic information of sentences by averaging or summing word vectors. This approach
lacks contextual interaction, ignores the order and dependencies between words, and cannot
effectively handle polysemy and synonyms. Additionally, it fails to capture deep structural and
contextual information within sentences, leading to shortcomings in text detection tasks.
   To overcome the limitations of traditional methods, we designed a Generative AI Authorship
Verification based on BERT and BiLSTM. As shown in Figure 1,The model first utilizes the pre-
trained BERT model to extract deep semantic features from the text. These features capture the
complex contextual relationships between words. The BERT model processes the input text
sequence using the following formula:
                               𝐻𝐿 = [ℎ𝐿 [𝐶𝐿𝑆], ℎ1𝐿 , … , ℎ𝐿𝑇 , ℎ𝐿 [𝑆𝐸𝑃]]                    (3)
   where 𝐻𝐿 represents the output feature sequence of the BERT model at the 𝐿 layer, including
the special tokens [CLS] and [SEP]. In this study, we used the contextual embeddings from the last
layer of the BERT model. Specifically, we obtained the embeddings for each token
(last_hidden_state) from the last layer of BERT. These contextual embeddings were concatenated
with features calculated by spaCy. The concatenation was done at the token level, not at the layer
or CLS token level. That is, the embeddings for each token from the last layer of BERT were
concatenated with the extended spaCy feature representations and then used as input to the
LSTM.
   After extracting deep semantic features, this study used the en_core_web_sm model from the
spaCy library to calculate additional text features. These features include lexical diversity,
average sentence length, average word length, number of grammatical errors, sentiment
tendency, repetition rate, and stop word ratio. en_core_web_sm is a small English language model
provided by spaCy, suitable for various natural language processing tasks such as tokenization,
part-of-speech tagging, dependency parsing, and named entity recognition. The formulas for the
additional text features are as follows:
                              𝑉𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = 𝑆𝑝𝑎𝐶𝑦[𝑓1 , 𝑓2 , 𝑓3 , 𝑓4 , 𝑓5 , 𝑓6 , 𝑓7 ]              (4)
   where 𝑓1 represents lexical diversity, 𝑓2 represents average sentence length, 𝑓3 represents
average word length, 𝑓4 represents the number of grammatical errors, 𝑓5 represents sentiment
tendency, 𝑓6represents repetition rate, and 𝑓7 represents stop word ratio. These features provide
the model with the ability to analyze the text from different perspectives, enhancing its capability
to distinguish between human-written and machine-generated texts.
   The contextual embeddings from the last layer of BERT (the embeddings for each token) have
a dimension of (batch_size, sequence_length, 768). The additional text features extracted by
spaCy are 7-dimensional vectors. These 7-dimensional spaCy features are expanded to 21
dimensions through a fully connected layer. Then, the expanded spaCy features are repeated at
each time step to match the sequence length of the BERT output, resulting in a dimension of
(batch_size, sequence_length, 21). Finally, the contextual embeddings from BERT and the
expanded spaCy features are concatenated at the token level, resulting in a concatenated feature
dimension of (batch_size, sequence_length, 789). The fused features are then fed into the BiLSTM
for processing. The formulas are as follows:
                                        𝐹𝑓𝑢𝑠𝑐𝑑 = 𝐻𝐿 ⊕ 𝑉𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠                                 (5)
                                        ⃗⃗⃗
                                        ht = LSTM(h ⃗⃗⃗⃗⃗⃗⃗⃗   𝐿
                                                       t−1 , ℎ𝑡 )                               (6)
                                        ⃖⃗⃗⃗        ⃖⃗⃗⃗⃗⃗⃗⃗⃗
                                        ℎ𝑡 = LSTM(ht+1 , ℎ𝑡 )  𝐿                                (7)
           ⃗⃗⃗                                                                          ⃖⃗⃗⃗
   Where ht represents the hidden state obtained from the forward LSTM, ℎ𝑡 represents the
hidden state obtained from the backward LSTM, and ℎ𝑡𝐿 represents the representation of the
fused features at time step t.The features processed by the BiLSTM are then fed into the
Transformer encoder, followed by a fully connected layer for binary classification, producing the
classification results.
                                  𝐻𝑜𝑢𝑡 = 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝐸𝑛𝑐𝑜𝑑𝑒𝑟(h              ⃗⃗⃗t ⊕ ⃖⃗⃗⃗
                                                                                   ℎ𝑡 )         (8)
                                  𝑝(𝑦𝑐 |𝐻𝑜𝑢𝑡 ) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝐻𝑜𝑢𝑡 )                                (9)
   where 𝐻𝑜𝑢𝑡 represents the optimized feature representation output by the Transformer
encoder, 𝑝(𝑦𝑐 |𝐻𝑜𝑢𝑡 ) represents the probability of the output being class c, and W represents the
weight matrix of the fully connected layer.
  Through this hybrid feature approach, the BLGAV model not only improves recognition
accuracy but also enhances its adaptability to different types and styles of text. This makes it a
powerful tool for automatically detecting and classifying machine-generated content. This is of
great significance in the current environment of information explosion and increasing automatic
content generation, helping to maintain the authenticity and reliability of information.


Figure 1: Model Structure Diagram

4. Experiments and Results
4.1. Experimental Setting

   For dataset partitioning, the dataset was first preprocessed and then split into training and
validation sets in a 7:3 ratio. The study used the pre-trained BERT version bert-base-uncased to
extract deep semantic features. The batch size was set to 8, the maximum length of the BERT
encoder to 512, the learning rate to 2e-5, and the random seed to 42. The BiLSTM model had a
hidden layer dimension of 256 and 2 layers. During the training phase, an RTX 4070 GPU was
used for training, with a total of 50 epochs, and the Adam optimizer was employed to update the
model weights.
   In the testing phase, the format of the test data is {"id": "iixcWBmKWQqLAwVXxXGBGg",
"text1": "...", "text2": "..."}. The model predicts each pair of input texts ("text1" and "text2")
separately to determine which text is more likely human-written. The process is as follows.
   First, both texts are cleaned and formatted separately, and then encoded using the BERT
tokenizer. Next, additional text features such as lexical diversity, average sentence length, and
average word length are extracted using the spaCy model. The processed texts and extracted
features are input into the model, which outputs the probability that each text is "machine-
generated." The "human-written" probability for each text is calculated as 1 minus the machine-
generated probability. By comparing these probabilities, the text with the higher "human-
written" confidence is selected as the human-written text, and this confidence score is output as
the result.

4.2. Results

  To evaluate the performance of our proposed model, we used the evaluation platform provided
by PAN, which includes the following metrics:
   • ROC-AUC: Measures the model's ability to distinguish between positive and negative
      samples, with higher values being better.
   • c@1: Assesses the classifier's ability to handle uncertainty while maintaining high
      accuracy, with higher values being better.
   • F0.5u: A variant of the F-score that places more emphasis on precision, suitable for
      reducing false positives, with higher values being better.
   • F1: The harmonic mean of precision and recall, used to evaluate the overall performance
      of a classification model, with higher values being better.
   • Brier: Evaluates the error between predicted probabilities and actual outcomes, with
      lower values being better.
   • Mean: The average of various evaluation metrics, used to comprehensively assess the
      overall performance of the model.
   We uploaded a software named "merciless-lease" to TIRA[20], which evaluates the detection
method proposed in this paper. Table 1 shows the comparison results of our method with five
baseline methods on various evaluation metrics.
   As can be seen from the table, the model proposed in this paper performs excellently on
multiple evaluation metrics, especially in ROC-AUC, Brier, F1, F0.5u, and Mean metrics. Notably, the
ROC-AUC reaches 0.994, significantly surpassing other models, indicating its outstanding ability
to distinguish between positive and negative samples. The Mean metric also demonstrates the
high efficiency and reliability of our model, reflecting its excellent overall performance across
different evaluation dimensions. However, in the C@1 metric, our model is slightly lower than
Binoculars, which may be due to Binoculars being more refined in handling high-probability
samples.
   These evaluation results reflect the advantages of our model in feature extraction and
processing. Our method significantly improves the accuracy and overall performance of detecting
machine-generated text by combining the deep semantic features of BERT, additional feature
analysis of spaCy, and the feature processing of BiLSTM and Transformer. In contrast, the five
baseline models are relatively simple in feature extraction and processing, lacking the capture of
complex semantics and contextual relationships, resulting in poor performance when detecting
complex texts.
Table 1
Comparison of Different Methods on Various Evaluation Metrics
 Approach                          ROC-AUC Brier C@1     F1   F0.5u Mean
 BLGAV (our)                         0.994 0.975 0.963 0.963 0.962 0.971
 Baseline Binoculars                 0.972 0.957 0.966 0.964 0.965 0.965
 Baseline Fast-DetectGPT (Mistral)   0.876  0.8 0.886 0.883 0.883 0.866
 Baseline PPMd                       0.795 0.798 0.754 0.753 0.749 0.77
 Baseline Unmasking                  0.697 0.774 0.691 0.658 0.666 0.697
 Baseline Fast-DetectGPT             0.668 0.776 0.695 0.69 0.691 0.704

5. Conclusion
  This paper proposes a BERT and BiLSTM based Generative AI Authorship Verification model,
which significantly enhances the ability to distinguish between machine-generated and human-
written texts by combining Transformer encoders and multi-feature fusion techniques.
Specifically, the model first uses a pretrained BERT to extract deep features of the text, and
integrates additional text features calculated by the spaCy model, such as lexical diversity and
average sentence length, to enhance its discriminative ability. Subsequently, these features are
further processed using LSTM and Transformer encoders, and finally, classification is performed
through a fully connected layer. Experimental results show that the model achieved a Mean score
of 0.971 on the official validation dataset provided by the PAN laboratory, surpassing all
benchmark models.
  In future work, we will further optimize the model by introducing more effective features,
compressing long texts, and exploring other methods to improve system performance and
detection accuracy. We believe that with continuous improvement, this model will play a greater
role in the field of machine-generated text detection.

Acknowledgements
  This work was supported by grants from the Guangdong-Foshan Joint Fund Project (No.
2022A1515140096) and Open Fund for Key Laboratory of Food Intelligent Manufacturing in
Guangdong Province (No. GPKLIFM-KF-202305).

References
[1] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in
     neural information processing systems, 2020, 33: 1877-1901.
[2] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J].
     OpenAI blog, 2019, 1(8): 9-32.
[3] Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[J].
     arXiv preprint arXiv:2307.09288, 2023.
[4] Lao Q, Ma L, Yang W, et al. Style Change Detection Based On Bert And Conv1d[C]//CLEF
     (Working Notes). 2022: 2554-2559.
[5] Kolasani S. Optimizing natural language processing, large language models (LLMs) for
     efficient customer service, and hyper-personalization to enable sustainable growth and
     revenue[J]. Transactions on Latest Trends in Artificial Intelligence, 2023, 4(4).
[6] Baidoo-Anu D, Ansah L O. Education in the era of generative artificial intelligence (AI):
     Understanding the potential benefits of ChatGPT in promoting teaching and learning[J].
     Journal of AI, 2023, 7(1): 52-62.
[7] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D.
     Korenčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova,
     E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024:
     Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
     Analysis,    and Generative AI Authorship Verification, in: Experimental IR Meets
     Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International
     Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science,
     Springer, Berlin Heidelberg New York, 2024.
[8] Mutlu B, Sezer E A. Enhanced sentence representation for extractive text summarization:
     Investigating the syntactic and semantic features and their contribution to sentence
     scoring[J]. Expert Systems with Applications, 2023, 227: 120302.
[9] Wang T, Chen L C, Genc Y. A dictionary-based method for detecting machine-generated
     domains[J]. Information Security Journal: A Global Perspective, 2021, 30(4): 205-218.
[10] Gehrmann, S., Strobelt, H., & Rush, A. M. (2019). GLTR: Statistical Detection and Visualization
     of Generated Text. Proceedings of the 2019 Conference on Empirical Methods in Natural
     Language Processing (EMNLP), 1112-1123.
[11] Yang Z, Ma L, Yang W, et al. A Intelligent Detection Method for Irony and Stereotype Based
     on Hybird Neural Networks[C]//CLEF (Working Notes). 2022: 2708-2713.
[12] Lavergne T, Urvoy T, Yvon F. Detecting Fake Content with Relative Entropy Scoring[J]. Pan,
     2008, 8(27-31): 4.
[13] Badaskar S, Agarwal S, Arora S. Identifying real or fake articles: Towards better language
     modeling[C]//Proceedings of the Third International Joint Conference on Natural Language
     Processing: Volume-II. 2008.
[14] Solaiman I, Brundage M, Clark J, et al. Release strategies and the social impacts of language
     models[J]. arXiv preprint arXiv:1908.09203, 2019.
[15] Mitchell E, Lee Y, Khazatsky A, et al. Detectgpt: Zero-shot machine-generated text detection
     using probability curvature[C]//International Conference on Machine Learning. PMLR, 2023:
     24950-24962.
[16] Solaiman I, Clark J, Brundage M. GPT-2: 1.5 B Release[J]. OpenAI. Available online at
     https://openai. com/blog/gpt-2-1-5b-release/, checked on, 2019, 11(13): 2019.
[17] Guo B, Zhang X, Wang Z, et al. How close is chatgpt to human experts? comparison corpus,
     evaluation, and detection[J]. arXiv preprint arXiv:2301.07597, 2023.
[18] Bakhtin A, Gross S, Ott M, et al. Real or fake? learning to discriminate machine from human
     generated text[J]. arXiv preprint arXiv:1906.03351, 2019.
[19] Uchendu A, Le T, Shu K, et al. Authorship attribution for neural text
     generation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing (EMNLP). 2020: 8384-8395.
[20] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M.
     Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L.
Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.),
Advances in Information Retrieval. 45th.