Team baker at PAN: Enhancing Writing Style Change Detection with Virtual Softmax Notebook for the PAN Lab at CLEF 2024 Bingpei Wu, Yong Han† , Kai Yan and Haoliang Qi Foshan University, Foshan, China Abstract This paper introduces the application of Virtual Softmax for the PAN 2024 multi author writing style analysis task. We found that tasks with the same topic are particularly challenging due to difficulties at classification boundaries. To address this problem, we integrated Virtual Softmax into the Transformer architecture to provide additional feature supervision, enhancing the recognition ability of model. Finally, we achieved F1 scores higher than the baseline method on the three tasks in the official test set. Keywords Style Change Detection, Pre-trained Model, Virtual Softmax 1. Introduction Multi-Author Writing Style change detection involves identifying whether different paragraphs within the same document are written by different authors. [1] This technique holds significant importance in academic research and has extensive practical applications. For instance, in academia, style change detection can be used for detecting plagiarism in scholarly papers; in the publishing industry, it can help identify ghostwriting; and in the legal field, it can assist in verifying the authenticity of documents. Therefore, further research and enhancement of style change detection technology can not only elevate academic research but also provide robust technical support across various domains. On this task PAN organized [2], Multi-Author Writing Style change detection is divided into three tasks: · Task 1: The paragraphs of a document cover different topics. · Task 2: The paragraphs of the document may cover different topics or the same topics. · Task 3: All paragraphs in a document are on the same topic. In this paper, we find that tasks with the same topic are difficult to identify for detecting author changes. The reason is that some data cause classification difficulties at classification boundaries. In this paper, we utilize the Virtual Softmax [3] to enlarge the inter-class margin and compress the intra-class distribution. Therefore, we integrate Virtual Softmax into a Roberta [4] architecture to enhance classification capabilities and provide feature supervision during training, thus enhancing the identification ability of features. 2. Related work With the rise and improvement of large language model technology, the methods for handling complex tasks have undergone significant changes. Traditional work, such as the research by Gómez-Adorno et al. [5], relied on the design of stylometric features and the use of machine learning methods for CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France † Corresponding author $ wubingpei0819@gmail.com (B. Wu); hanyong2005@fosu.edu.cn (Y. Han); yankai@fosu.edu.cn (K. Yan); qihaoliang@fosu.edu.cn (H. Qi)  0009-0004-6281-4322 (B. Wu); 0000-0002-9416-2398 (Y. Han); 0000-0002-4960-7108 (K. Yan); 0000-0003-1321-5820 (H. Qi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings prediction. However, the focus has now shifted towards fine-tuning large language models using various techniques. For instance, in PAN 2023, Ye et al. [6] work employed contrastive learning for supervised fine-tuning, achieving remarkable results. In our work, we compared the performance of three pre-trained transformer-based models on Task 1 to select the most suitable base model. Additionally, we applied data augmentation techniques and used Virtual Softmax to enhance the model’s performance in handling samples near the classification boundaries. 3. Method 3.1. Network Architecture The Transformer [7] architecture is technologically mature and includes well-developed models such as BERT [8], RoBERTa [4], and DistilBERT [9]. This architecture, pre-trained on a large corpus, has strong contextual understanding capabilities. We compared the F1 performance of models with the same parameters in task1, then chose RoBERTa as the base model, as shown in the Table1. Table 1 Task1 Performance comparison Task 1 BERT 0.9155 DistilBERT 0.9164 RoBERTa 0.9741 In our work, we use the RoBERTa-based model as the encoder to process the input text. The input text paragraphs are first tokenized and then fed into the RoBERTa model for encoding. The pooled output of the [cls] token, represents the contextual features of the entire paragraph. This output is then passed through a Virtual Softmax layer. The model is trained using a cross-entropy loss function to perform our classification task. During training, the extracted paragraph features are fed into a Virtual Softmax layer, enabling the model to perform a three-class classification task. An additional class is introduced to provide feature supervision, which compresses the inter-class space of the other two classes, thereby enforcing stricter boundary constraints. Figure 1: Model Architecture 3.2. Virtual Softmax To enhance the model’s discriminative power, we integrate a Virtual Softmax layer. During the training phase, no additional processing is performed on the input data and Virtual Classes are directly added. During the evaluation phase, we choose the category with the highest probability among the non-virtual classes. These classes do not correspond to actual categories but are used to increase the complexity of the training process, thereby strengthening the model’s discriminative abilities. The core idea is to add noise by incorporating these virtual classes, forcing the model to generalize better when faced with real data. Specifically, for a given classification task, the additional injected classes introduce a new and tighter decision boundary for the original classes, compressing their inter-class distribution. 4. Experiments 4.1. Experiments setting In this work, we select the RoBERTa-based model [4] for classification. The model consists of 12 layers and 12 attention heads, with a hidden size of 768. The maximum sequence length is set to 256, the learning rate is 1e-5, and the batch size is 32. These experimental settings are consistent with the comparative experimental settings for BERT, RoBERTa, and DistilBERT. 4.2. Data preparation In the data provided by PAN, three tasks, categorized by difficulty (task1-easy, task2-medium, task3- hard), were divided into a training set (70%), a validation set (15%), and a test set (15%). The training set for each difficulty consists of 4200 documents, consisting of multiple paragraphs. We connect two adjacent paragraphs in the same document using a separator token [cls] to form a sample, and label it whether the author of the sample text has changed. Samples constructed according to the above method are made up of adjacent paragraphs. We extended the data set by linking discontinuous paragraphs together based on some logical judgment based on the number of authors and whether the document paragraphs changed. For example, if there is no author change for three consecutive paragraphs, the first paragraph and the third paragraph can form a new sample. Table 2 Data processing and augmentation Task 1 Task 2 Task 3 train-set 11061 21906 19009 train-set-augment 13,504 26,695 24,548 4.3. Results We submit the model to TIRA [10] for execution to get the final metrics for the model. Table 3 shows the F1 scores obtained by our model in the official test and validation set. In addition, the paper compares the performance of some methods in 2023, Chen et al. [11], Jacobo et al. [12]. In task1 and task2, authors can be distinguished by capturing the characteristics of the topic. Due to the architectural limitations of Transformer, style changes and context dependencies cannot be fully captured. This results in a low F1 score in task3, where each pair of paragraphs has the same topic and keywords. Table 3 validation set result Approach Task 1 Task 2 Task 3 Our method 0.976 0.816 0.770 Chen et al. 0.914 0.820 0.676 Jacobo et al. 0.793 0.591 0.498 Baseline Predict 1 0.466 0.343 0.320 Baseline Predict 0 0.112 0.323 0.346 5. Conclusion In this paper, we presented a RoBERTa-based model enhanced with Virtual Softmax for detecting style changes in multi-author documents. Our approach showed significant promise, particularly in the more challenging scenarios where documents share the same topic throughout. By injecting additional classes, we were able to improve the model’s ability to distinguish between different authors, thereby enhancing the robustness and accuracy of style change detection. Acknowledgments This work is supported by the National Natural Science Foundation of China (No.62276064). References [1] E. Zangerle, M. Mayerl, M. Potthast, B. Stein, Overview of the Multi-Author Writing Style Analysis Task at PAN 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024. [2] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Ko- renčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova, E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024. [3] B. Chen, W. Deng, H. Shen, Virtual class enhanced discriminative embedding learning, Advances in Neural Information Processing Systems 31 (2018). [4] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [5] H. Gómez-Adorno, J.-P. Posadas-Duran, G. Ríos-Toledo, G. Sidorov, G. Sierra, Stylometry-based approach for detecting writing style changes in literary texts, Computación y Sistemas 22 (2018) 47–53. [6] Z. Ye, C. Zhong, H. Qi, Y. Han, Supervised contrastive learning for multi-author writing style analysis, in: Conference and Labs of the Evaluation Forum (CLEF), 2023. [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [9] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019). [10] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/ 978-3-031-28241-6_20. [11] H. Chen, Z. Han, Z. Li, Y. Han, A writing style embedding based on contrastive learning for multi-author writing style analysis, in: Conference and Labs of the Evaluation Forum (CLEF), 2023. [12] G. X. Jacobo, V. Dehesa-Corona, A. D. Rojas-Reyes, H. Gómez-Adorno, Authorship verification machine learning methods for style change detection in texts, in: Conference and Labs of the Evaluation Forum (CLEF), 2023.