<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Model Fusion Approach for Generative AI Authorship Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rui Qin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haoliang Qi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yusheng Yi</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>This paper aims to outline our method for distinguishing between text generated by humans and generative AI models with a model fusion approach. Our approach consists of three steps: first, we extend the competition dataset of PAN at CLEF 2024 with an external dataset from a famous data science and machine learning competition platform Kaggle and apply the Levenshtein distance algorithm to correct misspelled words. Datasets of text pairs are then formed based on a shared theme and split into training, validation, and testing datasets. Second, we train a fine-tuned BERT as the base model and a BERT with the R-Drop method to mitigate the overfitting issue. Last, the two models are combined using an ensemble learning technique with a voting strategy. Our experimental results show that the fusion model achieves an ROC-AUC metric of 0.932, representing a 5.6% improvement over the baseline model Fast-DetectGPT (Mistral).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2024</kwd>
        <kwd>Generative AI Authorship Verification</kwd>
        <kwd>BERT</kwd>
        <kwd>Ensemble Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>dropout masks. Finally, to enhance the overall performance and robustness of our system, we generate
the final score with the ensemble learning[7].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <sec id="sec-2-1">
        <title>2.1. BERT Model</title>
        <p>To enhance the model performance, we train and fuse two models: a fine-tuned BERT-base model and a
BERT model fine-tuned with the R-Drop method.</p>
        <p>We introduce the BERT-based model for AI-generated text detection in this part. The process of
finetuning using the BERT model is illustrated as shown in Figure 1. In the model training process for
a classification task, the two texts are first preprocessed and tokenized to generate corresponding
input embeddings, then the embedding sequences are fed into the pre-trained BERT model to obtain
contextual representations, and features are extracted from the [CLS] token’s output for classification,
ultimately producing the prediction results. This targeted training is designed to refine the BERT-base
model’s ability to perform precise binary classification on our dataset. Fine-tuning in the dataset yielded
a probability that the generated t1 was generated by AI.</p>
        <p>Text1</p>
        <p>Text2</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. BERT Model With R-Drop</title>
        <p>We introduce the R-Drop method into the training process in Figure 2. We employ an encoder
architecture that consists of two separate towers to handle the feature encoding for t1 and t2. By integrating
these features, we aim to capture the nuances of the text pairs more efectively. The training process
begins with dropout sampling, which generates a subset of the model’s output distribution, denoted
as . We then utilize the KL divergence loss[8], ℒℒ, to guide the optimization process, allowing us
to update the model parameters in a way that minimizes the divergence between the actual and the
predicted distributions. Ultimately, this leads to the generation of the final output, which is a refined
representation of the text pairs relationship as understood by the model. This function encompasses
both the cross-entropy loss and the KL divergence loss. The cross-entropy loss continues to drive the
Text pair</p>
        <p>Labeled text1
Labeled text2</p>
        <p>Bert Encoder1
（Feature1）
Bert Encoder2
（Feature）</p>
        <p>Fusion +
N x</p>
        <p>FC Layer
(Classifier)</p>
        <p>Label
model towards accurate classification[ 9], while the KL divergence loss introduces a regularization efect.
By training the model to minimize the KL divergence between its outputs When processing text pairs
and applying dropout, we increase the consistency and robustness of the model to some extent in our
predictions.</p>
        <p>KL Divergence Loss Formula:</p>
        <p>1 ∑︁
ℒℒ = 
=1</p>
        <p>ℒ(1 ||2 )
ℒtotal = ℒℰ +  · ℒ ℒ
where 1 and 2 are diferent prediction probability distributions for the same input.</p>
        <p>Total Loss Formula:
where ℒℰ is the cross-entropy loss, a common loss function used for classification tasks, and  is a
hyperparameter that balances its weight against the KL divergence loss. We dynamically adjustℒℒ by
using a learning rate decay strategy, which gradually decreases  as training progresses.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Ensemble Learning</title>
        <p>In essence, while the the fine-tuned BERT model is more conventional and focuses solely on classifying
accuracy, the BERT model with the R-Drop method through the integration of the R-Drop method, seeks
to enhance the model’s performance by reducing overfitting and promoting a more stable and reliable
prediction across varying inputs. The incorporation of the R-Drop method significantly enhances the
model’s classification accuracy while also bolstering its generalization capabilities[ 10]. Considering the
above situation, we choose to fuse these two models through the method of Ensemble learning[11].
This improvement is achieved by ensuring that the model’s predictions remain reliable and stable.</p>
        <p>For each sample in the test set, we use the trained models (One experiment will be conducted without
utilizing the R-Drop method, while another will incorporate the R-Drop method) to make predictions,
resulting in two probability distributions no R-Drop and R-Drop. Each model evaluates the test data
and outputs a probability score. For two models, let the outputs be no R-Drop and R-Drop Compute the
average probability score for each sample:
Determine the final classification based on the average probability score. For binary classification, a
threshold (typically 0.5) is used:
average =
no R-Drop + R-Drop</p>
        <p>
          2
Final Prediction =
{︃1 if average ≥ 0.5
0 if average &lt; 0.5
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>To execute this task, we leverage datasets from two sources: the competition dataset and an additional
external dataset. The competition dataset, provided by the PAN at CLEF 2024 organization, comprises
a collection of real and fake news articles. These articles encapsulate a variety of headlines from U.S.
news in 2021. The structure of the dataset is detailed in Table 1. The dataset encompasses 1087 distinct
topics. For each topic, it includes 1 human-written text file and 13 AI-generated text files. Each of these
13 files is generated by a diferent AI model, as enumerated in Table 2.</p>
        <p>Except for the competition dataset, we introduce an external dataset Kaggle’s Dataset: DAIGT V2
Train 1 to enhance the generalization of our model. The Kaggle dataset consists of 44,868 items, 58% of
the data comes from persuade corpus, 5% came from mistral7binstruct_v2, and the remaining 42% could
have been generated by other AI or written by the author himself. The data source and several topics
from the Kaggle dataset are detailed in Table4. The fields of the dataset are shown in the Table 3.
1The dataset is available at https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Construct Dataset</title>
          <p>Texts in both datasets are labeled with a topic. To fine-tune a model to distinguish AI-generated text,
we first construct a dataset of text pairs with the same topic. In the competition dataset illustrated
in Table 1, the ’id’ field contains three parts: the name of the AI model, the name of the news, and
the serial number of the article. First, we treat the name of the news as the topic and collect all texts
with the same topic. Then, we form a sample text pair by selecting each human-written text and each
AI-generated text. Finally, we have constructed 14,131 text pairs from the competition dataset. Since
the number of human-written texts is much smaller than the number of AI-generated texts with a ratio
of 1 to 13, the unbalanced ratio is more likely to lead to an overfitting problem.</p>
          <p>To provide more data to the model, we collected the relevant dataset DAIGT V2 Train Dataset from
the Kaggle competition platform, including 44 texts. We compose 15,235 text pairs from the Kaggle
dataset with the same method above.</p>
          <p>Therefore, the two datasets from Kaggle and competition generate 29,366 pairs of texts. We divide
these pairs into training sets, evaluation sets, and test sets according to a ratio of 60%, 30%, and 10%,
respectively.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Clean Text Data</title>
          <p>For the collected dataset, it is inevitable that some words are spelled incorrectly. To solve this problem,
we use a Python library2 that created a structure for quickly searching for words within a given
Levenshtein distance. The Levenshtein distance, also known as edit distance, is an algorithm used
to measure the diference between two strings. Then we use the Levenshtein Distance algorithm to
correct misspelled words in the data set. This method can balance the data between human and machine
as much as possible, while ensuring the accuracy of word spelling in the data set, restoring the text
semantics completely and improving the quality of the data set. Specific implementation steps are as
follows:
1. Search for a distance 1, and when there is one corrected word, correct it.
2. Count all suggested edits (inserts, deletions, replacements of specific letters) for all words within
distance "1".
3. Get the most frequent one, and if its frequency is more than 6% of all words in the document, then
search again for the Levenshtein distance but with custom cost specification: the most frequent
2The source code is available at https://github.com/pkoz/leven-search.</p>
          <p>edit cost being 10% of the default edit cost. That way, we fixed most of the "letter obfuscations"
mentioned in the discussions.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental Setup</title>
        <p>To achieve optimal experimental results, we adhere to common experimental configurations as follows:
We employ the BERT-base model as our encoder, leveraging its 12-layer transformer architecture, 768
hidden sizes, and 12 attention heads. For hyperparameter optimization, we set the batch size to 32 and
the learning rate to 3 × 10− 5. The model underwent training for 10 epochs with a maximum input
sequence length capped at 512 tokens. This configuration ensures a balance between computational
eficiency and model performance, laying a solid experimental foundation for our study.</p>
        <p>Additionally, to incorporate the R-Drop method, we introduce the following parameter settings:
• R-Drop method Hyperparameter  : This hyperparameter controls the weight of the KL
divergence loss.
• Dropout Rate: For the R-Drop method, we no longer use the traditional fixed dropout rate but
employ the dynamic dropout strategy of the R-Drop method, which dynamically adjusts the
dropout rate during training. Specifically, we set the initial dropout rate to 0.5 to enable the
R-Drop method.</p>
        <p>By utilizing the above formula and incorporating the KL divergence loss into the loss function, we
aim to reduce the distributional discrepancy between diferent dropout instances, thereby mitigating
overfitting:</p>
        <p>Loss = BinaryCrossEntropy(, ˆ) +  · ℒ ℒ</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Result</title>
        <p>The results in Table6. It shows that our fine-tuned BERT-base and BERT-base with R-Drop methods
achieved ROC-AUC values of 0.891 and 0.91 respectively, demonstrating some competitiveness. It’s
worth noting that the fusion of the BERT model and BERT with the R-Drop method outperformed
both individually, scoring higher across all metrics, with an average score of 0.913. Particularly, the
ROC-AUC value reached 0.932, which falls between the 75th quantile and the Median of all participants,
surpassing Baseline Fast-DetectGPT (Mistral) but falling short of Baseline Binoculars.</p>
        <p>From the perspective of the fused model, there was some improvement, though not entirely
satisfactory. Upon reviewing my scores across various datasets, I notice particularly low scores on two
German-related datasets, dragging down my overall performance. I speculate this might be due to the
limited size of my dataset, or perhaps certain flaws introduced during training with textual pairs. The
exact reasons remain uncertain. Moving forward, I plan to conduct further experiments to validate
these hypotheses and identify the underlying causes.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>The results demonstrate that fine-tuning the BERT-base alone yields lower performance compared to
the model enhanced with the R-Drop method. Furthermore, the ensemble voting approach, combining
the outputs of both models, surpasses the performance of either individual model, highlighting the
efectiveness of our proposed method. The integration of the R-Drop method with the BERT-base
and the subsequent ensemble learning strategy improves classification performance. This approach
demonstrates the potential for combining regularization techniques and ensemble methods to enhance
model robustness and accuracy in text classification tasks.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is supported by the National Natural Science Foundation of China (No.62276064)
IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New
York, 2023, pp. 236–241. doi:10.1007/978-3-031-28241-6_20.
[7] X. Dong, Z. Yu, W. Cao, et al., A survey on ensemble learning, Frontiers of Computer Science 14
(2020) 241–258.
[8] T. Kim, J. Oh, N. Kim, et al., Comparing kullback-leibler divergence and mean squared error loss
in knowledge distillation, arXiv preprint arXiv:2105.08919 (2021).
[9] x. liang, J. Li, Y. Wang, et al., R-drop: Regularized dropout for neural networks, in: M. Ranzato,
A. Beygelzimer, e. a. Y. Dauphin (Eds.), Advances in Neural Information Processing Systems,
volume 34, Curran Associates, Inc., 2021, pp. 10890–10905. URL: https://proceedings.neurips.cc/
paper_files/paper/2021/file/5a66b9200f29ac3fa0ae244cc2a51b39-Paper.pdf.
[10] R. Polikar, Ensemble learning, Ensemble machine learning: Methods and applications (2012) 1–34.
[11] T. G. Dietterich, et al., Ensemble learning, The handbook of brain theory and neural networks 2
(2002) 110–125.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-N.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>The science of detecting llm-generated text</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>67</volume>
          (
          <year>2024</year>
          )
          <fpage>50</fpage>
          -
          <lpage>59</lpage>
          . URL: https://doi.org/10.1145/3624725. doi:
          <volume>10</volume>
          .1145/3624725.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Sadasivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , et al.,
          <article-title>Can ai-generated text be reliably detected?</article-title>
          ,
          <source>arXiv preprint arXiv:2303.11156</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the “Voight-Kampf” Generative AI Authorship Verification Task at PAN</article-title>
          and
          <article-title>ELOQUENT 2024</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ayele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          , et al.,
          <source>Overview of PAN</source>
          <year>2024</year>
          <article-title>: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. B.</given-names>
            <surname>Casals</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <article-title>a. a. Chulvi, Overview of the Voight-Kampf Generative AI Authorship Verification Task at PAN 2024</article-title>
          , in: Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolyada</surname>
          </string-name>
          , et al.,
          <article-title>Continuous Integration for Reproducible Shared Tasks with TIRA.io</article-title>
          , in: J.
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Kruschwitz</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Caputo (Eds.),
          <source>Advances in Information Retrieval. 45th European Conference on</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>