=Paper= {{Paper |id=Vol-3740/paper-265 |storemode=property |title=Team fosu-stu at PAN: Supervised Fine-Tuning of Large Language Models for Multi Author Writing Style Analysis |pdfUrl=https://ceur-ws.org/Vol-3740/paper-265.pdf |volume=Vol-3740 |authors=Jiajun Lv,Yusheng Yi,Haoliang Qi |dblpUrl=https://dblp.org/rec/conf/clef/LvYQ24 }} ==Team fosu-stu at PAN: Supervised Fine-Tuning of Large Language Models for Multi Author Writing Style Analysis== https://ceur-ws.org/Vol-3740/paper-265.pdf
                         Team fosu-stu at PAN: Supervised Fine-Tuning of Large
                         Language Models for Multi Author Writing Style Analysis
                         Notebook for the PAN Lab at CLEF 2024

                         Jiajun Lv, Yusheng Yi* and Haoliang Qi
                         Foshan University, Foshan, China


                                      Abstract
                                      This paper introduces large language models and label-supervised classification to address the Multi-Author
                                      Writing Style Analysis task. Large-scale pre-training and increased parameter sizes have endowed large language
                                      models with remarkable emergent capabilities, yet their performance on specific tasks still needs to improve. Our
                                      motivation is to leverage and exploit the capabilities of large language models in natural language processing
                                      tasks, enhancing their performance on specific tasks through label-supervised classification training.

                                      Keywords
                                      Multi-Author Writing Style Analysis, Large language models, Low-Rank Adaptation




                         1. Introduction
                         The rapid development of the internet has made plagiarism increasingly easy. Without reference corpora,
                         multi-author writing style analysis is an effective method for detecting plagiarism[1]. Multi-author
                         style analysis aims to identify changes in writing style within a document attributed to different authors.
                         Research indicates that by analyzing an author’s writing style, a document can be segmented into parts
                         written by different authors, essentially performing an intrinsic style analysis task[1].
                            Since 2016, the PAN committee has organized an annual multi-author writing style analysis task.
                         Participants must identify the positions of writing style changes, using variations in style and similarities
                         in paragraph topics as indicators. In the PAN24: Multi-Author Writing Style Analysis task, participants
                         need to address the following intrinsic style change detection task: identify all paragraph-level positions
                         in a given text where there are changes in writing style[2].
                            In this study, we employ low-rank adaptation for efficient fine-tuning of large language models
                         to achieve labeled supervised fine-tuning to address the PAN: Multi-Author Writing Style Analysis
                         task within the CLEF 2024 challenge. This task will be conducted on three datasets, with increasing
                         challenges as the similarity between paragraph topics increases.


                         2. Related Work
                         Analyzing recent Multi-Author Writing Style Analysis tasks[3][4], Ye et al. [5] used supervised con-
                         trastive learning techniques with p-tuning to enhance performance. Ahmad et al.[6] adopted data
                         augmentation and multi-model fusion to improve model performance. Huang et al. [7] employed knowl-
                         edge distillation to compress the teacher model mT0-large, leveraging the generalization capabilities of
                         large language models to improve performance metrics. From recent years’ methods, it is evident that
                         models with larger base parameters and more complex techniques generally perform better.
                            Since the rise of large language models (LLMs) represented by ChatGPT, LLMs have shown great
                         potential in natural language processing [8]. Previous studies [9][10][11] have utilized LLMs’ in-context


                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ lvjiajun.96@gmail.com (J. Lv); yiys@fosu.edu.cn (Y. Yi); qihaoliang@fosu.edu.cn (H. Qi)
                           0000-0002-8755-5310 (J. Lv); 0009-0006-7098-3681 (Y. Yi); 0000-0003-1321-5820 (H. Qi)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
learning capabilities for text classification and achieved significant results. However, generation-
centered architectures may not capture task-specific patterns as effectively as label-supervised BERT[12]
models. Inspired by the fine-tuned BERT family models on classification tasks, this study explores
label-supervised fine-tuning based on LLMs, aiming to leverage their advantages in multi-author writing
style analysis tasks. We compress the model using quantization techniques and low-rank adaptation
methods to reduce the cost of model training and system deployment.


3. Data processing
In the PAN24 task of writing style analysis[2], participants are required to identify changes in writing
style at the paragraph level and find all the locations where these changes occur. The organizers
have strictly controlled the changes in author identity and topic, and provided datasets with three
levels of difficulty. To achieve this goal, given a document 𝐷, we split it into multiple text seg-
ments based on line breaks, represented as the set {𝑝1 , 𝑝2 , 𝑝3 , . . . , 𝑝𝑛 }. Then, we recombine each
text segment with its adjacent segment to form 𝑛 − 1 pairs of new text pairs, represented as the
set {(𝑝1 , 𝑝2 ), (𝑝2 , 𝑝3 ), (𝑝3 , 𝑝4 ), . . . , (𝑝𝑛−1 , 𝑝𝑛 )}. For text pairs with a sequence length exceeding 512
characters, we truncate them evenly to 512 characters.


4. Method
Our approach is illustrated in the Figure 1. We use the LLaMA-3-8B decoder [13], obtaining vector
representations from the last hidden layer of the LLaMA decoder. These representations are then mapped
to the label space through a feedforward layer, generating probabilities used for label classification.
The model is updated by calculating the cross-entropy loss and employing low-rank adaptation for
fine-tuning.


                     Softmax

                         Linear


       Causal-masked
                                                      Feed-Forward
        Multi-head
       Self-Attention                                                                  𝐅𝐫𝐨𝐳𝐞𝐧
                                                           Linear                                     𝐵 ∈ ℝ𝑝×𝑟
                                                                                       𝐏𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐞𝐝
                                                            𝐖𝐝                         𝐖𝐞𝐢𝐠𝐡𝐭            r
          LLaMA                                       Self Attention                   𝑊0 ∈ ℝ𝑝×𝑞      𝐴 ∈ ℝ𝑟×𝑝

       Causal-masked                         Linear        Linear       Linear
         Multi-head                           𝐖𝐪            𝐖𝐤                     Linear 𝐖 𝐪 , 𝐖 𝐤
                                                                          𝐖𝐯
       Self-Attention

                                         Causal-masked Multi-head Self-Attention
          [𝑥1 , … 𝑥𝑛 ]



Figure 1: Details of model architecture. Label supervised fine-tuning architecture for large language models.



4.1. Label supervision fine-tuning
Given the input text pairs (𝑝𝑖 , 𝑝𝑖+1 ), concatenate the two texts and feed them into the Tokenizer to
perform byte-pair encoding to obtain the text encoding 𝑥𝑖 . Then, input 𝑥𝑖 into the decoder and extract
the hidden state vector representation 𝐻𝑖 for sequence classification.
                                          𝑥𝑖 = 𝑇 𝑜𝑘𝑒𝑛𝑖𝑧𝑒𝑟(𝑝𝑖 , 𝑝𝑖+1 )                                   (1)
                                           𝐻𝑖 = 𝐿𝑙𝑎𝑚𝑎𝑀 𝑜𝑑𝑒𝑙(𝑥𝑖 )                                        (2)
  Extract the last token vector from the hidden state vector 𝐻𝑖 to serve as the vector representation ℎ𝑖
for sequence classification.
                                            ℎ𝑖 = 𝑙𝑎𝑠𝑡(𝐻𝑖 )                                            (3)
   The representation vector ℎ𝑖 of the sequence classification is fed into a linear layer and a softmax
layer, where the vector representationℎ𝑖 is mapped to the label space, resulting in an output probability
distribution 𝑝(𝑦𝑖 ) Cross-entropy loss is calculated with the true label 𝑦𝑖 , and the model parameters are
updated.
                                           𝑝(𝑦𝑖 ) = 𝑓𝐿𝑖𝑛𝑒𝑎𝑟 (ℎ𝑖 )                                       (4)
                                      𝑁
                                  1 ∑︁
                        ℒ𝑐𝑒 = −        𝑦𝑖 ·𝑙𝑜𝑔(𝑝(𝑦𝑖 )) + (1 − 𝑦𝑖 )·𝑙𝑜𝑔(1 − 𝑝(𝑦𝑖 ))                      (5)
                                  𝑁
                                     𝑖=1


4.2. Low-Rank Adaptation
The standard full fine-tuning paradigm requires thousands of GPUs working in parallel, which is very
inefficient and unsustainable [14][15]. An algorithm, Parameter Efficient Fine-Tuning (PEFT), has been
proposed, which aims at tuning the smallest parameters [14] to achieve better performance on full
tuning of downstream tasks.
  We adopted the low-rank decomposition method shown in Figure 2, where the original pretrained
model weights are denoted as 𝑊0 ∈ R𝑑×𝑘 . Through the low-rank decomposition 𝑊0 +∆𝑊 = 𝑊0 +𝐵𝐴,
an additional parameter matrix 𝐵𝐴 is introduced into the self-attention matrices 𝑊𝑞 and 𝑊𝑘 , where
𝐵 ∈ R𝑑×𝑟 and 𝐴 ∈ R𝑟×𝑘 , and the rank 𝑟 << min(𝑑, 𝑘). During training, we keep the pretrained
model frozen, with only matrices 𝐴 and 𝐵 being updated.




                                                                    𝐅𝐫𝐨𝐳𝐞𝐧
                                                                                         𝐵 ∈ ℝ!×#
      𝐏𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐞𝐝            𝐖𝐞𝐢𝐠𝐡𝐭                                  𝐏𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐞𝐝
       𝐖𝐞𝐢𝐠𝐡𝐭               Update                                  𝐖𝐞𝐢𝐠𝐡𝐭                   r
          𝑾                   ∆𝑾
                                                                        𝑊$ ∈ ℝ!×%       𝐴 ∈ ℝ#×!


Figure 2: Low-Rank Decomposition. For a pre-trained weight matrix W, restrict its updates to the form of
low-rank decomposition.




5. Experiments
5.1. Dataset analysis
We conduct a positive and negative sample size analysis on the text pairs generated after data process-
ing,The analysis results are shown in the Table1
   Analysis reveals that the ratio of positive to negative samples in both the training and testing datasets
is generally similar for each difficulty level. However, the distribution of positive and negative samples
in the task1 easy dataset is unbalanced, with a ratio of 1:10.
Table 1
Statistics of the original dataset
                                                Training set   Validation set
                                     Datasets
                                                #pos. #neg. #pos.      #neg.
                                      task 1    10098 967       2219    252
                                      task 2    12493 9420      2603   1989
                                      task 3    8917 10098      1887   2248


5.2. Experience setting
In this paper, we chose Meta-Llama-3-8B as the pre-trained model and quantized it to int8. The model
was trained on three different task datasets, resulting in models tailored to each task.Our hyperparameter
settings are shown in Table 2:

Table 2
Fine-tuning tasks, the hyperparameters we use.
                                           Hyperparameter          value
                                           rank                   128
                                           alpha                  128
                                           dropout                 0.1
                                           batch size              64
                                           max sequence length    512
                                           initial learning rate  2e-5
                                           epochs                   3
                                           warmup rate             0.1
                                           target modules        𝑤𝑞 , 𝑤𝑣

  We train and evaluate models on the A800 80GB GPU using the deep learning framework PyTorch
and the efficient fine-tuning framework Peft[16].


6. Results
We use the fully fine-tuned deberta-base[17] as the baseline for our experiments, and the final indicators
obtained by our method in the validation set are shown in Table 3

Table 3
Overview of the F1 accuracy for the multi-author writing style change detection on the validation set.
                                       Approach     Task 1 Task 2 Task 3
                                       our method     0.93     0.884    0.85
                                       baseline      0.987     0.839   0.821


   We finally submitted the model to the TIRA[18] platform for testing, and scored F1 for the three tasks
respectively. The results are shown in Table 4 "alternating-vase" represents the fully fine-tuned deberta-
base method, "quantum-ship" is the fine-tuning method based on this paper, "equilateral-commit" is a
combination of both, using a voting method. The "camel-clef" involves modifying hyperparameters
of target modules specifically to fine-tune the 𝑤𝑞 , 𝑤𝑘 , 𝑤𝑣 , 𝑤𝑜 weights. Our analysis reveals that the
supervised fine-tuning of large language models surpasses the baseline in metrics for task2 and task3
but performs poorly on the task1 easy dataset. This poor performance may be related to the imbalance
in the easy dataset distribution.
Table 4
Overview of the F1 accuracy for the multi-author writing style task in detecting at which positions the author
changes for task 1, tas 2, and task 3.
                               Approach              Task 1 Task 2 Task 3
                               presto-branch          0.944    0.887    0.834
                               alternating-vase       0.987    0.826    0.821
                               camel-clef             0.987    0.885    0.852
                               equilateral-commit     0.987    0.887    0.834
                               Baseline Predict 1     0.466    0.343    0.320
                               Baseline Predict 0     0.112    0.323    0.346


7. Conclusion
This paper proposes a method for detecting changes in writing style based on a large language model
classifier, which uses label-supervised fine-tuning of the large language model. Additionally, we
compress the model using LoRa and quantization methods to reduce training and inference costs.
Experimental results show the effectiveness of supervised fine-tuning of the large language model in
identifying multi-author style changes.


Acknowledgments
This research was supported by the Natural Science Foundation of Guangdong Province, China
(No.2022A1515011544)


References
 [1] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Ko-
     renčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova,
     E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024:
     Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
     Analysis, and Generative AI Authorship Verification, in: L. Goeuriot, P. Mulhem, G. Quénot,
     D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro
     (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of
     the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in
     Computer Science, Springer, Berlin Heidelberg New York, 2024.
 [2] E. Zangerle, M. Mayerl, M. Potthast, B. Stein, Overview of the Multi-Author Writing Style Analysis
     Task at PAN 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. Herrera (Eds.), Working Notes
     of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
 [3] J. Bevendorff, I. Borrego-Obrador, M. Chinea-Ríos, M. Franco-Salvador, M. Fröbe, A. Heini, K. Kre-
     dens, M. Mayerl, P. Pęzik, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann,
     M. Wolska, E. Zangerle, Overview of PAN 2023: Authorship Verification, Multi-Author Writ-
     ing Style Analysis, Profiling Cryptocurrency Influencers, and Trigger Detection, in: A. Aram-
     patzis, E. Kanoulas, T. Tsikrika, A. G. S. Vrochidis, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli,
     N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceed-
     ings of the Fourteenth International Conference of the CLEF Association (CLEF 2023), Lecture
     Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 459–481. URL:
     https://doi.org/10.1007/978-3-031-42448-9_29. doi:10.1007/978-3-031-42448-9_29.
 [4] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl, R. Ortega-
     Bueno, P. Pezik, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wolska,
     E. Zangerle, Overview of PAN 2022: Authorship Verification, Profiling Irony and Stereotype
     Spreaders, Style Change Detection, and Trigger Detection, in: A. Barrón-Cedeños, G. D. S.
     Martino, M. D. Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Faggioli,
     N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th
     International Conference of the CLEF Association (CLEF 2022), volume 13186 of Lecture Notes in
     Computer Science, Springer, 2022. URL: https://link.springer.com/book/10.1007/978-3-031-13643-6.
     doi:10.1007/978-3-031-13643-6.
 [5] Z. Ye, C. Zhong, H. Qi, Y. Han, Supervised Contrastive Learning for Multi-Author Writing Style
     Analysis, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of CLEF
     2023 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2023, pp. 2817–2822. URL:
     https://ceur-ws.org/Vol-3497/paper-237.pdf.
 [6] A. Hashemi, W. Shi, Enhancing writing style change detection using transformer-based models
     and data augmentation, Working Notes of CLEF (2023).
 [7] M. Huang, Z. Huang, L. Kong, Encoded classifier using knowledge distillation for multi-author
     writing style analysis, Working Notes of CLEF (2023).
 [8] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A
     survey of large language models, arXiv preprint arXiv:2303.18223 (2023).
 [9] X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, G. Wang, Text classification via large language models,
     arXiv preprint arXiv:2305.08377 (2023).
[10] Y. Fei, Y. Hou, Z. Chen, A. Bosselut, Mitigating label biases for in-context learning, arXiv preprint
     arXiv:2305.19148 (2023).
[11] K. Margatina, T. Schick, N. Aletras, J. Dwivedi-Yu, Active learning principles for in-context learning
     with large language models, arXiv preprint arXiv:2305.14264 (2023).
[12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[13] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
     E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models (2023), arXiv
     preprint arXiv:2302.13971 (2023).
[14] Z. Han, C. Gao, J. Liu, S. Q. Zhang, et al., Parameter-efficient fine-tuning for large models: A
     comprehensive survey, arXiv preprint arXiv:2403.14608 (2024).
[15] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
     adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021).
[16] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, B. Bossan, Peft: State-of-the-art parameter-
     efficient fine-tuning methods, https://github.com/huggingface/peft, 2022.
[17] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, 2021.
     URL: https://arxiv.org/abs/2006.03654. arXiv:2006.03654.
[18] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
     Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
     F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
     in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
     in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
     978-3-031-28241-6_20.