Team Text Understanding and Analysis at PAN: Utilizing
                         BERT Series Pre-training Model for Multi-Author Writing
                         Style Analysis
                         Notebook for the PAN Lab at CLEF 2024

                         Yingzhou Huang, Leilei Kong
                         Foshan University, Foshan, China


                                      Abstract
                                      We propose a training model based on BERT series. This method uses sliding window technique to preprocess data
                                      sets to train and solve multi-author writing style analysis tasks. Our method is to combine adjacent paragraphs
                                      into a training sample, and effectively extract the characteristics of style changes between paragraphs. We
                                      conducted systematic training and evaluation on three multi-author writing style analysis datasets (easy, medium,
                                      and difficult) at different difficulty levels provided by the PAN organization. We obtained f1 scores of 0.993, 0.831
                                      and 0.825 on the test set, respectively, which proved the effectiveness and robustness of the proposed method.

                                      Keywords
                                      Multi-author writing style analysis, BERT series, training sample


                         1. Introduction
                         The Multi-Author Writing Style Analysis task is designed to identify where authorship has changed
                         in multiple author documents. In practical applications, we can analyze the author’s writing style to
                         determine the author’s identity, verify whether the document has been tampered with, and whether the
                         article is suspected of plagiarism [1] [2] [3].
                            Multi-author recognition research is mainly divided into two directions: traditional methods and deep
                         learning-based methods [4]. The traditional method usually uses hand-selected features to distinguish
                         text similarity, such as word frequency, sentence length, punctuation, etc. These methods work well
                         in some simple scenarios, but struggle to handle more complex situations. A deep learning-based
                         approach uses a neural network model to extract the text representation and calculate the similarity
                         of the text representation. These methods typically include convolutional neural networks (CNNS),
                         recurrent neural networks (RNNS), and attention mechanisms. In addition, some researchers also try
                         to use pre-trained language models [5], such as BERT [6], GPT [7], etc to improve the performance
                         of the models. In general, deep learning-based approaches [4] have performed well in multi-author
                         recognition studies and are receiving increasing attention.
                            We created data samples using a sliding window. We encoded and classified the data using Bert
                         [6]series models, such as Bert base, DeBERTa, ALBERT, and RoBERTa.


                         2. Method
                         In this study, we utilized four different pre-trained language models [5]: DeBERTa-base [8], DeBERTa-
                         v3-large [8], ALBERT-large-v2 [9], and RoBERTa-base [10]. Each of them was specifically applied to
                         training tasks of varying difficulty (easy, medium, hard). Our goal is to enhance the performance of the
                         multi-author style detection task by leveraging the specific architectural advantages of these models.
                         The reasons for choosing the above models for our experiments are as follows.

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                          $ 3545817606@daytoy.freeqiye.com (Y. Huang); kongleilei@fosu.edu.cn (L. Kong)
                           0009-0001-9239-3344 (Y. Huang); 0000-0002-4636-3507 (L. Kong)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
  DeBERTa-base and DeBERTa-v3-large introduce efficient attention mechanisms and improved sen-
tence encoding strategies, making them particularly suitable for tasks that require complex language
understanding [8].
  ALBERT-large-v2, through parameter sharing techniques, maintains a large model capacity while
reducing the number of parameters, making it suitable for processing large datasets and balancing
performance and efficiency [9].
  RoBERTa-base has shown excellent performance in a variety of natural language processing tasks
through dynamic masked language model training, and its robustness makes it a reliable choice for
experiments [10].
  We designed targeted training strategies for each model to adapt to tasks of different difficulty levels.
During the training process, we used precision, recall, and F1 score as the main evaluation metrics to
ensure that the models can achieve optimal performance in detecting changes in author style.


3. Experiment
3.1. Dataset
The writing style change detection dataset provided by PAN@CLEF [1]includes three levels of difficulty:
   1.Easy: The paragraphs of the documents cover various topics, allowing methods that utilize topic
information to detect changes in authorship.
   2.Medium: There is little diversity in the topics within the documents (although still present), forcing
methods to focus more on style to effectively address the detection task.
   3.Hard: All paragraphs in the documents are related to the same topic.
   In the dataset provided by PAN, the label information available to participants includes the number
of authors in the document and labels indicating whether there are changes in writing style between
paragraphs. We segmented the documents in the dataset according to natural paragraphs and labels,
and recalculated the quantities of the dataset. The statistical results are shown in the following table.

Table 1
Dataset size of three tasks.
                                        Easy               Medium                  Hard
                    Task
                                 Train Validation Train Validation Train Validation
             num of documents 4200          900        4200      900        4200      900
              num of text pairs 11065       2471      21913      4592      19015      4135


3.2. Data Processing
In this study, we first carried out preprocessing on the dataset. Initially, we conducted paragraph
segmentation. We segmented the documents into natural paragraphs, treating each paragraph as
an independent data item. Subsequently, we performed text pair extraction. Utilizing the sliding
window method, we construct a text pair adjacent paragraphs. Finally, we read the corresponding
label information for each text pair from the JSON file, which will be used for subsequent training and
evaluation.

3.3. Experiment setting
In this experiment,We chose the CrossEntropyLoss as the loss function, which measures the error
based on the similarity between probability distributions and is used in conjunction with the softmax
activation function, making it suitable for classification tasks. The optimizer selected was AdamW, an
improved version of the Adam optimizer, particularly suitable for weight decay, which helps reduce
overfitting. The learning rate was set to 1e-5. A smaller learning rate helps the model to train stably
and converge to the global optimum. The batch size was set to 2 and the number of iterations was set
to 10. Smaller batch sizes and numbers of iterations help enhance the model’s generalization ability and
reduce overfitting.We set the maximum length of the encoder to 256, which means the total length of
the text pairs is 512, since the number of tokens in most text pairs is less than 512. We established the
dropout layer ratio at 0.1 to prevent overfitting during fine-tuning [11].
   We then continuously replaced the pre-trained models and their corresponding neural network
frameworks to compare the performance of the trained models. By comparing the accuracy on the
validation set, we found that the model trained with DeBERTa [8]had the highest accuracy on the easy
and hard datasets, while the model trained with RoBERTa [10]had the highest accuracy on the medium
dataset.
   In machine learning, the batch size is a crucial hyperparameter that can significantly affect the
model’s convergence rate and final performance. A larger batch size usually provides a more stable
gradient estimate but also increases memory consumption and may lead the model to become trapped in
local optima. Based on these considerations, we increased the batch size by tenfold, hoping to improve
model performance through more stable gradient estimation.
   However, after a series of experiments, we found that increasing the batch size did not lead to the
expected performance improvement. Specifically, after increasing the batch size, the model’s accuracy
on the validation set decreased by 10%, and the training time also increased. This indicates that for the
current task and the adopted model architecture, a larger batch size may not be the optimal choice.
Therefore, we ultimately set the batch size to 2.

3.4. Experiment results
We carried out four experiments: the BERT series pre-training models (DeBERTa-base, Deberta-v3-
large, ALBERT-large-v2 and RoBERTa-base) were used to train on the training set, and the model with
the highest accuracy was selected in the verification set to fine-tune the hyperparameters. Then we
implement the model with the highest F1 score on the verification sets corresponding to different
difficulty levels submitted by the TIRA platform [3]. The main experiment results are shown in Table 2.

Table 2
Overview of the F1 accuracy for the multi-author writing style task in detecting at which positions the author
changes for task 1, tas 2, and task 3.
                               Approach              Task 1 Task 2 Task 3
                               cloying-tournament     0.991    0.815    0.818
                               Baseline Predict 1     0.466    0.343    0.320
                               Baseline Predict 0     0.112    0.323    0.346

   We also compare the performance of different Bert-based models, denoted as DeBERTa-base, Deberta-
v3-large, ALBERT-large-v2 and RoBERTa-base. The experiment results concerning accuracy are shown
in Table 3, where Easy, Medium and Hard denote the dataset of different difficulty levels.

Table 3
Accuracy of different Bert-based models.
                          model validation(accuracy)       Easy   Medium Hard
                          deberta-base                    99.8%     85.4%       82.1%
                          deberta-v3-large                95.0%     78.7%       62.7%
                          albert-large-v2                 99.3%     78.6%       82.1%
                          roberta-base                    99.7%     86.2%       74.7%
4. Conclusion
In this study, we successfully developed a system based on DeBERTa-base and RoBERTa-base models to
effectively process the multi-author writing style analysis task by using sliding window technique. By
applying sliding Windows to text sequences, our method can effectively capture local context informa-
tion and extract stylistic changes between paragraphs. On three datasets of different difficulty levels
provided by the PAN organization,our approach achieves excellent scores on the f1 index, demonstrating
its effectiveness and robustness.
   Despite the positive results of our study, there are still some limitations and issues that need to be
further explored. In experiments, it was found that increasing batch size did not improve performance,
but led to a decrease in accuracy. This suggests that batch size has a significant impact on model
performance, but its optimal value may depend on the specific task and model architecture and requires
more in-depth research to determine. We are interested in experimenting with emerging pre-trained
model architectures to explore whether they can bring further performance improvements for multi-
author style detection tasks.


Acknowledgments
This research was supported by the National Social Science Foundation of China (22BTQ101) .


References
 [1] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Ko-
     renčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova,
     E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024:
     Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
     Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot,
     D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro
     (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of
     the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in
     Computer Science, Springer, Berlin Heidelberg New York, 2024.
 [2] E. Zangerle, M. Mayerl, M. Potthast, B. Stein, Overview of the Multi-Author Writing Style Analysis
     Task at PAN 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working
     Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
 [3] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
     Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
     F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
     in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
     in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
     978-3-031-28241-6_20.
 [4] I. H. Sarker, Deep learning: a comprehensive overview on techniques, taxonomy, applications and
     research directions, SN computer science 2 (2021) 420.
 [5] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen, J. Yi,
     W. Zhao, X. Wang, Z. Liu, H.-T. Zheng, J. Chen, Y. Liu, J. Tang, J. Li, M. Sun, Delta tuning: A
     comprehensive study of parameter efficient methods for pre-trained language models, 2022. URL:
     https://arxiv.org/abs/2203.06904. arXiv:2203.06904.
 [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, 2019. URL: https://arxiv.org/abs/1810.04805. arXiv:1810.04805.
 [7] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, J. Tang, Gpt understands, too, 2023. URL:
     https://arxiv.org/abs/2103.10385. arXiv:2103.10385.
 [8] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, 2021.
     URL: https://arxiv.org/abs/2006.03654. arXiv:2006.03654.
 [9] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-
     supervised learning of language representations, 2020. URL: https://arxiv.org/abs/1909.11942.
     arXiv:1909.11942.
[10] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining approach, 2019. URL: https://arxiv.org/abs/1907.
     11692. arXiv:1907.11692.
[11] X. Liang, L. Wu, J. Li, Y. Wang, Q. Meng, T. Qin, W. Chen, M. Zhang, T.-Y. Liu, R-drop: Regularized
     dropout for neural networks, 2021. URL: https://arxiv.org/abs/2106.14448. arXiv:2106.14448.