Team Chen at PAN: Integrating R-Drop and Pre-trained
                         Language Model for Multi-author Writing Style Analysis
                         Notebook for the PAN Lab at CLEF 2024

                         Zhaotian Chen, Yong Han* and Yusheng Yi
                         Foshan University, Foshan, China


                                      Abstract
                                      This paper presents our experiment in the PAN Multi-Author Writing Style Analysis task at CLEF 2024. The
                                      task is divided into three increasingly difficult subtasks according to the topic consistency between paragraphs:
                                      from detecting style changes between paragraphs with multiple topics at the easy level, to a medium level where
                                      the diversity of topics is small, forcing the method to focus more on style, finally, at the most difficult level to
                                      identify subtle style differences between paragraphs of the same topic. Therefore, the task asks for not only
                                      distinguishing different topics but also capturing obvious change in writing style with the same topic. To address
                                      the task, we select the powerful pre-trained language model, Roberta, as the foundation model and fine-tuned it
                                      to detect styles and topics of texts. Additionally, we employed R-Drop regularization to reduce overfitting during
                                      the model fine-tuning, thereby enhancing its generalization capabilities on unseen texts. Experimental results
                                      demonstrate that our model achieved F1 scores of 0.968, 0.822, and 0.807 on the test sets of the three difficulty
                                      levels, respectively.

                                      Keywords
                                      PAN 2024, Multi-Author Writing Style Analysis, Regularization, R-Drop, Pre-trained language Model


                         1. Introduction
                         The task of multi-author writing style analysis aims to find all positions of writing style change on the
                         paragraph-level in a given multi-author document [1]. Therefore, detecting style changes can assist in
                         identifying the identity of the current author, verifying the claimed authorship, and detecting the risk
                         of plagiarism in documents. Particularly in situations where there is no comparative text, detecting
                         style changes becomes the sole method to identify plagiarism in documents.
                            In recent years, PAN [2] has organized a series of tasks to detect writing style changes in text, ranging
                         from determining the actual number of authors[3], identifying style changes between two consecutive
                         paragraphs[4, 5], to detecting style changes at the sentence level[6](ranging from detecting style of
                         consecutive paragraphs to consecutive sentences). In this year, the task of Style Change Detection
                         focuses on paragraphs and detects writing style changes at every pair of consecutive paragraphs in a
                         given text.


                         2. Related Work
                         Large-scale pre-trained models, such as BERT[7], RoBERTa[8], etc., often contain millions or even
                         billions of parameters. Although larger models tend to exhibit better performance, they are highly
                         susceptible to overfitting. During the fine-tuning process of the pre-trained model for the style change
                         detection task, it was observed that despite a continuous decrease in training loss, the F1 score on the
                         validation set remained unsatisfactory. Upon closer examination of both the training and validation
                         losses, it was revealed that while the training loss was steadily declining, the validation loss was
                         progressively increasing. To address this issue, researchers have proposed various regularization

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding Author
                          $ 1353663548z@gmail.com (Z. Chen); hanyong2005@fosu.edu.cn (Y. Han); yiys@fosu.edu.cn (Y. Yi)
                           0009-0008-3734-7442 (Z. Chen); 0000-0002-9416-2398 (Y. Han); 0009-0006-7098-3681 (Y. Yi)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1
Datasets statistics. “Dataset1”, “Dataset2”, “Dataset3”, correspond to tasks of easy, medium, and difficult levels,
respectively.
                                Dataset1          Dataset2          Dataset3
            Datasets
                           Documents Samples Documents Samples Documents Samples
            Training set        4200     11065    4200     21914    4200     19014
            Validation set       900      2468     900      4590     900      4132


methods, including weight decay [9, 10, 11], dropout [12, 13, 14, 15], normalization [16, 17, 18], adding
noise [19], layer-wise pre-training and initialization [20, 21], label smoothing [22], and more. Among
these methods, dropout and its variants have garnered significant attention due to their effectiveness
and compatibility with other regularization techniques.
   Dropout enhances the generalization capability by inhibiting the co-adaptation of neurons and
implicitly creating an ensemble of multiple sub-models. As a variant of dropout, compared to the
traditional dropout strategy in neural network training, The core idea of R-Drop regularization[23]
lies in generating consistent predictions from models with different dropout masks during the training
process.
   As a modified version of dropout, R-Drop regularization, in contrast to the conventional dropout
approach employed in neural network training, centers on ensuring consistent predictions from various
dropout-masked networks throughout the training phase [23]. In order to do this, R-Drop employs
the minimization of Kullback-Leibler divergence between the outputs of any two sub-models using
different dropout masks to achieve model regularization. The method greatly improves models ability
of generalization and lowers the risk of overfitting by effectively reducing the degree of freedom of
model parameters. Consequently, it significantly enhances the stability and generalization capability of
reasoning.


3. Dataset
The task presents three datasets of varying difficulty levels, categorized based on the diversity of topics
and consistency within the documents. Each dataset poses specific subtasks.

    • Easy: The paragraphs of a document cover a variety of topics, allowing approaches to make use
      of topic information to detect authorship changes.
    • Medium: The topical variety in a document is small (though still present), forcing the approaches
      to focus more on style to effectively solve the detection task.
    • Hard: All paragraphs in a document are on the same topic.

   Each dataset is divided into three parts: training set, validation set and testing set. The training set
and the validation set include ground truth data, while the testing set does not provide ground truth
data. Table 1 provides statistical information about the datasets. Note that "Samples" specifically refers
to data units composed of two consecutive paragraphs from the documents, used to analyze whether
there is a style change between the two paragraphs. For details on how the samples were constructed,
please refer to Section 4.1.


4. Methodology
The methodology presented in this paper encompasses three primary steps: 1) data preparation, 2) R-
Drop regularization, and 3) model fine-tuning. The methodology is founded on the concept of attaining
elevated precision and recall rates for classifying unseen datasets. To accomplish this goal, fine-tuning
of a pre-trained language model is undertaken for specific downstream tasks. Furthermore, R-Drop
regularization methods are employed to enhance the model’s generalization capabilities.
4.1. Data Preparation
To create the samples, we first marked the junction between two consecutive paragraphs in each
document using delimiters. Subsequently, we assigned binary labels indicating whether there was
a style change between the two paragraphs. This enabled us to transform the task into a binary
classification problem. In order to prepare the samples for fine-tuning the pre-trained RoBERTa model,
we adopted the corresponding tokenizer for RoBERTa. RoBERTa has a limit on the maximum input
sequence length, typically 512 tokens. Upon analyzing the dataset, we found that only a few samples
exceeded the maximum token limit. Therefore, we opted for a truncation strategy to handle samples
exceeding the maximum input sequence length.

4.2. R-Drop Regularization
Dropout randomly drops part of units in each layer of the neural network to avoid co-adapting and
over-fitting. Besides, dropout also approximately performs to combine exponentially many different
neural network architectures efficiently, while model combination can always improve the model
performance. Despite its simplicity and efficacy, dropout introduces a significant inconsistency between
the training and inference phases, which can potentially impede model performance. To address this
issue, the incorporation of the R-Drop regularization term into the training process ensures consistency
in the model’s predictions for identical inputs across varying dropout masks. This approach regulates
the inconsistency arising from dropout during training. Specifically, for each training batch, the
process involves conducting two forward passes with distinct dropout masks on the same data batch.
Subsequently, the Kullback-Leibler (KL) divergence, a widely used metric for quantifying the disparity
between two probability distributions, is computed between the two prediction outcomes.
   Given the input 𝑥𝑖 , 𝑃𝑤1 (𝑦𝑖 |𝑥𝑖 ) and 𝑃𝑤2 (𝑦𝑖 |𝑥𝑖 ) represent the probability distributions of 𝑦𝑖 predicted
by the model under different sets of parameters (caused by dropout, such as 𝑤1 and 𝑤2), respectively.
The KL divergence between 𝑃𝑤1 (𝑦𝑖 |𝑥𝑖 ) and 𝑃𝑤2 (𝑦𝑖 |𝑥𝑖 ) is given by:
                                                           ∑︁                       𝑃𝑤1 (𝑦𝑖 |𝑥𝑖 )
                     𝐷KL (𝑃𝑤1 (𝑦𝑖 |𝑥𝑖 )‖𝑃𝑤2 (𝑦𝑖 |𝑥𝑖 )) =        𝑃𝑤1 (𝑦𝑖 |𝑥𝑖 ) log                          (1)
                                                           𝑦𝑖
                                                                                    𝑃𝑤2 (𝑦𝑖 |𝑥𝑖 )

  To incorporate this discrepancy into the training process, R-Drop add the calculated KL divergence as
an important regularization term to the loss function and use the parameter 𝛼 to control the coefficient
weight of KL divergence. The R-Drop method employs a loss function represented by the formula
                                                   (1)                   (2)
given below. In the formula 2, the terms − log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 ) and − log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 ) signify the negative
log probabilities of accurately predicting the label 𝑦𝑖 conditional on the input 𝑥𝑖 . These probabilities
are obtained from two sub-models, both of which are generated by introducing dropout variations to
the same neural network.
                            (1)                (2)                         (1)              (2)
              𝐿 = − log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 ) − log 𝑃𝜃 (𝑦𝑖 |𝑥𝑖 ) + 𝛼𝐷𝐾𝐿 (𝑃𝜃 (𝑦𝑖 |𝑥𝑖 )||𝑃𝜃 (𝑦𝑖 |𝑥𝑖 ))                (2)

4.3. Model Fine-Tuning
In the task, the RoBERTa pre-trained model was chosen as the base model to leverage the rich linguistic
representations it has learned from a large corpus, enhancing the performance of downstream tasks. To
reduce the risk of model overfitting, a Dropout layer was introduced after the output of the RoBERTa
model. Dropout is a widely used regularization technique that randomly discards neurons to reduce
model complexity and enhance model generalization. During model fine-tuning, R-Drop regularization
was employed to further improve the model’s generalization capability. To adapt the pre-trained model
to the style change detection task, a fully connected linear output layer was added on top of the model.
This output layer uses the softmax activation function to generate a probability distribution for each
category, enabling the model to learn and classify whether there is a style change between consecutive
paragraphs.
   Algorithm 1 details the implementation of R-Drop in the model fine-tuning process. Specifically, for
each training batch, two forward passes with different dropout masks are performed on the same batch of
data, and the Kullback-Leibler (KL) divergence between the two prediction results is calculated. Through
this approach, the model not only takes full advantage of the language representation capabilities of the
RoBERTa model but also addresses the issue of overfitting. This enables the model to effectively identify
and classify style changes in unseen texts, thereby enhancing the overall performance of the task.

Algorithm 1 Fine-tuning with Integrated R-Drop Algorithm
Require: Training dataset 𝐷 = {(𝑥𝑖 , 𝑦𝑖 )}
Require: Neural network model 𝑀 with parameters 𝜃
Require: Number of training epochs 𝐸
Require: Learning rate 𝜂
Require: Balance factor 𝛼 for KL divergence loss
Ensure: Trained model 𝑀 with optimized parameters 𝜃*
 1: Initialize model parameters 𝜃 randomly or with pre-trained weights
 2: Initialize optimizer with learning rate 𝜂
 3: for epoch 𝑒 = 1 to 𝐸 do
 4:     for each batch (𝑋𝑏 , 𝑌𝑏 ) in 𝐷 do
 5:          Forward pass the model 𝑀 twice with 𝑋𝑏 to obtain two sets of outputs: 𝑂1 and 𝑂2
 6:          Compute cross-entropy losses: 𝐶𝐸1 = 𝐶𝐸(𝑌𝑏 , 𝑂1 ), 𝐶𝐸2 = 𝐶𝐸(𝑌𝑏 , 𝑂2 )
 7:          Compute KL divergence loss: 𝐾𝐿 = 𝛼 · 𝐾𝐿(𝑂1 ‖ 𝑂2 ) + 𝛼 · 𝐾𝐿(𝑂2 ‖ 𝑂1 )
 8:          Compute total loss: 𝐿 = 𝐶𝐸1 + 𝐶𝐸2 + 𝐾𝐿
 9:          Backpropagate the total loss 𝐿 to update model parameters 𝜃
10:     end for
11: end for
12: return trained model 𝑀 with optimized parameters 𝜃 *


5. Experiments
5.1. Experimental settings
In this paper, the RoBERTa model was chosen, comprising 12 transformer layers, 768 hidden units, and
12 attention heads. The hyperparameter settings are as follows: the maximum sequence length is set to
512, the learning rate is set at 0.00001, the batch size is configured to 32, the number of epochs is set to
7, and the dropout rate is 0.5. The coefficient weight 𝛼 for R-Drop is set to 5.
   To evaluate the effectiveness of the model for each subtask, performance is assessed by calculating
the F1 score on the provided evaluation set. After conducting experiments and obtaining results on the
evaluation set, the best-performing model for each subtask is selected.

5.2. Result
The best-performing model for each sub-task was ultimately submitted to TIRA [24] for execution, and
the final performance indicators of the model were obtained. Table 2 provides the F1 scores achieved by
the model on the official test set. Compared to the baseline approaches provided by Pan, which predict
either no style change (all-0) or all style changes (all-1), our method achieves a minimum improvement
of 1-fold and a maximum improvement of up to 7-fold.

5.3. Ablation experiments
To demonstrate the effectiveness of R-Drop in this task, we conducted experiments while keeping other
parts of the model unchanged. We observed the performance changes of the model in the validation set
by adding or not adding the R-Drop method. Table 3 presents the experimental results.
Table 2
Overview of the F1 accuracy for the multi-author writing style task in detecting at which positions the author
changes for task 1, tas 2, and task 3.
                                        Approach             Task 1 Task 2 Task 3
                                        Our Method             0.968     0.822   0.807
                                        Baseline Predict 1     0.466     0.343   0.320
                                        Baseline Predict 0     0.112     0.323   0.346


Table 3
F1 score comparison on validation datasets.Set dropout to 0.5, if the dropout parameter is available.
                                                                   Task1 Task2 Task3
                                       Roberta With R-Drop         0.975 0.833 0.815
                                       Roberta With dropout        0.951 0.830 0.810
                                       Roberta Without dropout     0.948 0.822 0.782

                                                Effect of dropout rate on the Task3
                       0.9
                                  F1 Score


                      0.85
           F1 Score


                       0.8


                      0.75


                       0.7
                             0   0.1      0.2    0.3     0.4       0.5     0.6    0.7    0.8   0.9   1
                                                             Dropout Rate

Figure 1: Effect of dropout rate on the Task3.


  In Figure 1, we adjusted the dropout parameter in the model with other parameters unchanged, in
order to optimize the model’s performance on complex datasets.


6. Conclusion
This paper briefly introduces our work achievements on the PAN 2024 multi-author writing style
analysis task. We fine-tune the RoBERTa model using the R-Drop regularization method to obtain the
final results. This approach achieved promising outcomes across three subtasks of varying difficulty
levels, demonstrating its effectiveness in tackling complex writing style analysis challenges. An ablation
study further validated the significance of R-Drop regularization in preventing overfitting and enhancing
model performance.
  However, the lack of analysis on error cases limits our understanding of the model’s limitations and
potential for improvement. Future research should delve deeper into error case analyses to identify the
model’s weaknesses and devise targeted solutions.


Acknowledgments
This work is supported by the National Natural Science Foundation of China (No.62276064)


References
 [1] E. Zangerle, M. Mayerl, M. Potthast, et al., Overview of the Multi-Author Writing Style Analysis
     Task at PAN 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working
     Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
 [2] J. Bevendorff, X. B. Casals, B. Chulvi, et al., Overview of PAN 2024: Multi-Author Writing Style
     Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI
     Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D.
     Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets
     Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Confer-
     ence of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin
     Heidelberg New York, 2024.
 [3] E. Zangerle, M. Tschuggnall, G. Specht, et al., Overview of the Style Change Detection Task at
     PAN 2019, in: L. Cappellato, N. Ferro, D. Losada, H. Müller (Eds.), CLEF 2019 Labs and Workshops,
     Notebook Papers, CEUR-WS.org, 2019. URL: http://ceur-ws.org/Vol-2380/.
 [4] E. Zangerle, M. Mayerl, G. Specht, et al., Overview of the Style Change Detection Task at PAN
     2020, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), CLEF 2020 Labs and Workshops,
     Notebook Papers, CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2696/.
 [5] E. Zangerle, M. Mayerl, M. Potthast, et al., Overview of the Style Change Detection Task at PAN
     2021, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), CLEF 2021 Labs and Workshops,
     Notebook Papers, CEUR-WS.org, 2021.
 [6] E. Zangerle, M. Mayerl, M. Potthast, et al., Overview of the Style Change Detection Task at PAN
     2022, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), CLEF 2022 Labs and Workshops,
     Notebook Papers, CEUR-WS.org, 2022. URL: http://ceur-ws.org/Vol-3180/paper-186.pdf.
 [7] J. Devlin, M.-W. Chang, K. Lee, et al., Bert: Pre-training of deep bidirectional transformers
     for language understanding, in: Proceedings of the 2019 Conference of the North, 2019. URL:
     http://dx.doi.org/10.18653/v1/n19-1423. doi:10.18653/v1/n19-1423.
 [8] Y. Liu, M. Ott, N. Goyal, et al., Roberta: A robustly optimized bert pretraining approach, Cornell
     University - arXiv,Cornell University - arXiv (2019).
 [9] A. Krogh, J. Hertz, A simple weight decay can improve generalization, Neural Information
     Processing Systems,Neural Information Processing Systems (1991).
[10] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural
     networks, Communications of the ACM (2017) 84–90. URL: http://dx.doi.org/10.1145/3065386.
     doi:10.1145/3065386.
[11] W. Wen, C. Wu, Y. Wang, et al., Learning structured sparsity in deep neural networks, Neural
     Information Processing Systems,Neural Information Processing Systems (2016).
[12] G. Hinton, N. Srivastava, A. Krizhevsky, et al., Improving neural networks by preventing co-
     adaptation of feature detectors, Cornell University - arXiv,Cornell University - arXiv (2012).
[13] L. Wan, M. Zeiler, S. Zhang, et al., Regularization of neural networks using dropconnect, In-
     ternational Conference on Machine Learning,International Conference on Machine Learning
     (2013).
[14] J. Ba, B. Frey, Adaptive dropout for training deep neural networks, Neural Information Processing
     Systems,Neural Information Processing Systems (2013).
[15] S. Wang, C. Manning, Fast dropout training, International Conference on Machine Learn-
     ing,International Conference on Machine Learning (2013).
[16] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal
     covariate shift, arXiv: Learning,arXiv: Learning (2015).
[17] L. Huang, X. Liu, B. Liu, et al., Orthogonal weight normalization: Solution to optimization over
     multiple dependent stiefel manifolds in deep neural networks, National Conference on Artificial
     Intelligence,National Conference on Artificial Intelligence (2017).
[18] Y. Wu, K. He, Group normalization, International Journal of Computer Vision (2020) 742–755.
     URL: http://dx.doi.org/10.1007/s11263-019-01198-w. doi:10.1007/s11263-019-01198-w.
[19] S. Hochreiter, J. Schmidhuber, Simplifying neural nets by discovering flat minima, Neural
     Information Processing Systems,Neural Information Processing Systems (1994).
[20] D. Erhan, P.-A. Manzagol, Y. Bengio, et al., The difficulty of training deep architectures and the
     effect of unsupervised pre-training (2009).
[21] K. He, X. Zhang, S. Ren, et al., Delving deep into rectifiers: Surpassing human-level performance
     on imagenet classification, in: 2015 IEEE International Conference on Computer Vision (ICCV),
     2015. URL: http://dx.doi.org/10.1109/iccv.2015.123. doi:10.1109/iccv.2015.123.
[22] C. Szegedy, V. Vanhoucke, S. Ioffe, et al., Rethinking the inception architecture for computer
     vision, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. URL:
     http://dx.doi.org/10.1109/cvpr.2016.308. doi:10.1109/cvpr.2016.308.
[23] L. Wu, J. Li, Y. Wang, et al., R-drop: Regularized dropout for neural networks, Advances in Neural
     Information Processing Systems 34 (2021) 10890–10905.
[24] M. Fröbe, M. Wiegmann, N. Kolyada, et al., Continuous Integration for Reproducible Shared Tasks
     with TIRA.io, in: Advances in Information Retrieval. 45th European Conference on IR Research
     (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp.
     236–241. doi:10.1007/978-3-031-28241-6_20.