Continual Transfer Learning With Progress Prompt for
                         Multi-Author Writing Style Analysis
                         Zhanhong Ye1,† , Yutong Zhong1 , Chen Huang2 and Leilei Kong1,†
                         1
                             Foshan University, Foshan, Guangdong, China
                         2
                             Zhongnan University of Economics and Law, Wuhan, Hubei, China


                                        Abstract
                                        This paper introduces a method utilizing forward knowledge transfer in continual learning to address the Multi-
                                        Author Writing Style Analysis 2024. The motivation is to transfer knowledge of varying difficulty levels to the
                                        current training task. Therefore, we employ the method of continual learning and forward knowledge transfer
                                        to train the model on task sequences composed of datasets with varying difficulty levels. This approach allows
                                        us to gradually transfer knowledge of different difficulties to the current training task. We then evaluated the
                                        Multi-Author Writing Style Analysis datasets provided by PAN. Finally, we selected model weights with the best
                                        validation set performance from each sequence. We achieved F1 scores of 0.993, 0.830, and 0.832 on each of the
                                        three difficulty levels of the test sets.

                                        Keywords
                                        PAN 2024, Multi-Author Writing Style Analysis 2024, continual learning, transfer learning


                         1. Introduction
                         Multi-author style identification involves determining whether the writing styles of two authors are
                         consistent. Specifically, the style change detection task aims to determine whether the writing style
                         changes between two consecutive paragraphs in a given multi-author document. Multi-Author Writing
                         Style is extensively applied in plagiarism detection and author identification [1]. Furthermore, style
                         change detection can aid in uncovering anonymous authorships, verifying claimed authorships, or
                         developing new technologies for writing support.
                         Recent studies [2, 3] have employed the MTL (Multiple task learning) method, which involves solving
                         multiple tasks jointly. However, one of the biggest challenges in MTL is to balance the convergence
                         schedule across tasks. Differences in task difficulties can result in faster convergence on some tasks
                         over others. As a result, when access to all datasets are available, three difficult datasets provided by
                         PAN [4, 5] can be accessed simultaneously, it is sub-optimal to directly utilize the MTL method for
                         mixing the data from the three difficulty levels of datasets together [2].
                         Progress prompts [6] differ from traditional MTL in that they transform multiple tasks into a sequential
                         learning process. This approach effectively avoids the sub-optimal outcomes often associated with the
                         simultaneous training of tasks in MTL. Hence Progress prompt methods are better solutions than simply
                         adding together the losses of all tasks. Adding together the losses of all tasks is typically sub-optimal
                         [2]. Especially, when each dataset is evaluated independently, rather than evaluating the performance
                         of all datasets simultaneously.
                         Progress prompts [6] combine prompt tuning [7] with continual learning [8], retain a learnable soft
                         prompt [9] for each incoming task, and sequentially concatenate it with previously trained soft prompts.
                         The purpose of this approach is to facilitate forward knowledge transfer [10], focusing on learning
                         multiple tasks sequentially rather than simultaneously.
                         In this paper, we leverage the progress prompts method mentioned in the study [7] to transfer knowledge
                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         †
                           corresponding author
                          $ chinwang.yip@gmail.com (Z. Ye); yutongz115@gmail.com (Y. Zhong); 1141460892@qq.com (C. Huang);
                          kongleilei@fosu.edu.cn (L. Kong)
                           0009-0001-4094-006X (Z. Ye); 0009-0003-1694-9800 (Y. Zhong); 0009-0008-5942-0006 (C. Huang); 0000-0002-4636-3507
                          (L. Kong)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of varying difficulty levels from previous tasks to the current task using learnable soft prompts. Different
from the MTL method, we employ the progress prompts method, which involves training a soft prompt
for the current task and concatenating it with previously trained soft prompts. This allows for the
transfer of knowledge across datasets of varying difficulty levels. Regarding the model architecture,
the model has three parts. The first part involves soft prompt parameters that are combined with the
parameters of the soft prompt for the current task and the parameters of the soft prompt for the previous
task. The second part consists of the deberta-v3-base [11] model, which handles the current task. The
third part is the classifier with classification loss.


2. Network Architecture
First, let 𝑇𝑖 be the dataset, where 𝑖 ∈ 1..3. 𝑇𝑖 consists of a binary classification task for a style change.
We convert the easy, medium, and hard difficulties datasets in the Multi-Author Writing Style Analysis
task [4, 5] into binary classification tasks. This means that in any dataset 𝑇𝑖 , the data input is called 𝑆𝑗 ,
meaning the paragraph pair, with an output of 0 or 1. 0 indicates that the paragraph pair has no style
change, while 1 indicates that the paragraph pair has a style change. We then form a sequence of tasks
with easy, medium, and hard datasets, (𝑇1 , 𝑇2 , 𝑇3 ).
The goal is to utilize the DeBERTa-v3 model to sequentially implement this binary classification for
a style change task on the task sequence. After training on 𝑇𝑖 , we obtain the model’s classification
performance on 𝑇𝑖 and then proceed to the next classification task. The core feature of the method
is the progress prompts method, which involves learning a distinct soft prompt [9] 𝑃𝑖 for each task
𝑇𝑖 , 𝑖 ∈ 1..3. Note that the soft prompt 𝑃𝑖 has parameters provided by the embedding layer of the
pre-trained language model. In addition, we not only learn a soft prompt 𝑃𝑖 for each task 𝑇𝑖 but also
concatenate it with all previously trained soft prompts 𝑃𝑘 ; 𝑘 < 𝑖 ≤ 3. According to the model shown in
Figure, it consists of an encoder block, classification, and soft prompt parameters. The first is the encoder
block. We use the deberta-v3 [11] model to encode the input, which consists of pairs of paragraphs
from the current difficulty dataset. Next comes the classification part, where we use linear layers as
classifiers to classify the encoded content, making it possible to complete the current downstream
task. Then, concerning the soft prompt parameters, they are initialized using the parameters from the
embedding layer of the deberta-v3 model. The details of the progress prompts are in section 2.1. Overall,
the primary loss function ℒ for training task 𝑇𝑖 can be defined as follows.


                                                  ℒ = ℒ𝑐𝑒                                                  (1)

The loss ℒ𝑐𝑒 means a cross-entropy loss to optimize the encoder block, classifier, and soft prompt
parameters.

2.1. Progress prompt
Firstly, the PAN has provided three difficult datasets for Multi-Author Writing Style Analysis. Given a
batch named 𝐵, which comes from the current training task 𝑇𝑖 , the contents of 𝐵 can be defined as
{(𝑆1 , 𝑦1 ), (𝑆2 , 𝑦2 ) . . . (𝑆𝑗 , 𝑦𝑗 )} ∈ 𝐵, where 𝑆𝑗 means the paragraph pair, and 𝑦𝑗 is the corresponding
label. Then we retain a learnable soft prompt 𝑃𝑖 for each 𝐵 and sequentially concatenate it with all
previously learned soft prompts 𝑃𝑘 ; 𝑘 < 𝑖 ≤ 3. The soft prompt is obtained through the embedding
layer of the pre-trained language model. Specifically, we select the last 𝑚 tokens from the pre-trained
language model vocabulary 𝑉 as pseudo tokens and then pass these pseudo tokens into the embedding
layer of the pre-trained language model to obtain all soft prompts. Then, we combine 𝑃𝑖 , 𝑃𝑘 and 𝑒(𝑆𝑗 )
which is the current input embedding, sending them to the pre-trained model. The pre-trained model
consists of the transformer [12] block, to obtain the corresponding hidden state ℋ𝑗 . 𝑒(𝑆𝑗 ) represents
the input 𝑆𝑗 encoded by the embedding layer of the pre-trained language model.
                   input embeddings


                   prompt for task 1


                   prompt for task 2

                    Bidirectional
                   attention block

                   causal masking

                   frozen parameters

                     trainable
                    parameters                                    Encoder Block


                                                                     Classifier

Figure 1: Model Architecture


                                          ℋ𝑗 = 𝑒𝑛𝑐𝑜𝑑𝑒𝑟(𝑃𝑘 , ..., 𝑃1 , 𝑆𝑗 )                             (2)

After obtaining the hidden state ℋ𝑗 we use a classifier to generate the soft labels for each category.


                                                             1                    2
                                                     𝑒(𝜑(ℋ𝑗 ) )           𝑒(𝜑(ℋ𝑗 ) )
                                     1 2
                     𝜙𝑗 = 𝜙(𝑆𝑗 ) = (𝑦𝑗 , 𝑦𝑗 ) = ( ∑︀𝐶                , ∑︀𝐶                )            (3)
                                                          (𝜑(ℋ𝑗 )𝑐 )           (𝜑(𝐻𝑗 )𝑐 )
                                                    𝑐=1 𝑒                𝑐=1 𝑒

where 𝜙(·) is the soft label of sample 𝑗, 𝜑(·)𝑐 indicates the output of the linear layer for category 𝑐,
and 𝐶 represents the total number of categories. Then we calculate the cross-entropy loss for the
classification


                                                   ∑︁
                                       ℒ𝑐𝑒 = −                𝑙𝑜𝑔𝑝(𝑦𝑗 |𝜙(𝑆𝑗 ), 𝜃, 𝜃𝑃𝑖 )                (4)
                                                 𝑆𝑗 ,𝑦𝑗 ∈𝑇𝑖


where 𝜃 refers to the model parameters of the encoder and classifier, and 𝜃𝑃𝑖 denotes the trainable
parameters of the soft prompt for the 𝑖-th task in the embedding layer. By training with Equation (4),
we obtain the final pre-trained language model 𝑀𝑖 for the current task 𝑇𝑖 . Since the PAN committee
provides 3 datasets of varying difficulty for the Multi-Author Writing Style Analysis task, these datasets
can be arranged in different combinations to form 6 task sequences with different orders. We will apply
our proposed method to these 6 task sequences. Then, we will select and save the model weights that
achieve the highest performance on the validation set for the easy, medium, and hard datasets from
these 6 task sequences.
3. Experiments and Results
3.1. Data analysis
The PAN organizers have provided all data and the data is available in three difficulty levels: easy,
medium, and hard. Each difficulty data set is divided into a training set, a validation set, and a test
set. The distribution of each dataset is 70%, 15%, and 15%, respectively and the statistical analysis
reveals that the token length of most entries is less than 512. Then we organize the data according
to the method mentioned in section 2.1. In addition to this, when documents in the datasets of three
difficulties are provided by only two authors (also given in the ground truth), it is possible to further
analyze which author wrote each paragraph in the documents. Therefore, besides using consecutive
pairs of paragraphs as paragraph pairs, we incorporate additional non-consecutive pairs of paragraphs
into our paragraph pair set and assign them labels based on the inferred relationships between the
authors. For example, if the same author is believed to have written both paragraphs, it is assumed that
the style has not changed, and vice versa.

3.2. Experiment setting
In this work, the deberta-v3 base model is selected for classification. It concludes with 12 transformer
encoder layers, its hidden size is 768. The three difficulty datasets are formed into six different task
sequences, as Table 1 depicts. We train the model sequentially on datasets of varying task difficulties,
following the given sequence. We set the early stopping to 10, the prompt length of 10 tokens for each
difficulties dataset, and the learning rate to 5e-5, 3e-5, and 3e-5 for three datasets respectively. All
experiments are conducted on NVIDIA A800 GPU with 80GB memory with a batch size of 64.

Table 1
Task sequences.
                             order                Task sequence
                                i                    
                                              medium hard easy
                               ii                    
                                              hard medium easy
                               iii                 
                                              easy hard medium
                               iv                  
                                              hard easy medium
                                v                    
                                              easy medium hard
                               vi                    
                                              medium easy hard


3.3. Results
We will conduct four experiments for validation datasets: the fine-tune method with deberta-v3, the
best performance on the validation set, different datasets from all sequences with data augmentation
and different datasets from all sequences including partial datasets with data augmentation or without
augmentation. The results are presented in tables 2-5 respectively. We will then select the model weights
that achieve the highest F1 scores on the validation sets corresponding to the difficulty levels across all
sequences and submit these to the TIRA platform [13]. The final test set results are presented in Table 6.

Table 2
The fine-tune method with deberta-v3-base and our best performance with our methods on the validation set.
       task                     fine-tune                   best validation set score with our methods
       easy                        96.9                                         99.6
      medium                       83.7                                         84.5
       hard                        83.4                                         84.1
Table 3
Different datasets from all sequences with data augmentation.
                              order                     F1-score

                                i
                                        medium
                                          83.9           hard
                                                           77.3      easy
                                                                        97

                                ii
                                          hard
                                           76           medium
                                                           84.5      96.7
                                                                       easy


                               iii
                                          easy
                                          98.3           hard
                                                           80.7      medium
                                                                       82.7

                               iv
                                          hard
                                          75.2           easy
                                                            96       83.4
                                                                      medium


                                v
                                          easy
                                          98.5          medium
                                                           82.8      hard
                                                                       68.6

                               vi
                                        medium
                                          83.7           easy
                                                           97.7      77.6
                                                                       hard


Table 4
Different datasets from all sequences without data augmentation.
                              order                     F1-score

                                i
                                        medium
                                          84.1           hard
                                                           83.4      easy
                                                                       98.9

                                ii
                                          hard
                                          83.2          medium
                                                           82.9      99.4
                                                                       easy


                               iii
                                          easy
                                          98.1           hard
                                                           81.4      medium
                                                                       82.6

                               iv
                                          hard
                                          83.9           easy
                                                           99.6      83.7
                                                                      medium


                                v
                                          easy
                                          97.5          medium
                                                           81.1      hard
                                                                       82.2

                               vi
                                        medium
                                          83.8           easy
                                                           98.4      83.7
                                                                       hard


Table 5
Different datasets from all sequences including partial datasets with data augmentation or without augmentation.
*with data augmentation
                              order                     F1-score

                                    i
                                        medium
                                         *83.3           hard
                                                           83       easy
                                                                      98.4

                                 ii
                                         hard
                                          83.2         medium
                                                         *82.9      99.4
                                                                      easy


                                iii
                                         easy
                                          97.3           hard
                                                          81.3      medium
                                                                      *82.5

                                iv
                                         hard
                                          83.3           easy
                                                          99.2      *83.7
                                                                     medium


                                 v
                                         easy
                                          94.6         medium
                                                         *80.6      hard
                                                                       79.8

                                vi
                                        medium
                                         *83.7           easy
                                                           98       84.1
                                                                      hard
Table 6
Overview of the F1 accuracy for the multi-author writing style task in the test set. The word "combine" in the
method refers to the performance of the test set obtained after adding the specified number of validation set
data to the training set for training. The number added to the training set is the content after "combine-va".
                           Approach                      Task 1 Task 2 Task 3
                           combine-va-1500-te-hard          —        —       0.828
                           combine-va-1500-te-medium        —      0.820       —
                           combine-va-1500-te-easy        0.988      —         —
                           combine-va-full-te-hard          —        —       0.826
                           combine-va-full-te-medium        —      0.825       —
                           combine-va-full-te-easy        0.989      —         —
                           combine-va-750-te-hard           —        —       0.817
                           combine-va-750-te-medium         —      0.824       —
                           combine-va-750-te-easy         0.993      —         —
                           hard-test-final-score            —        —       0.832
                           medium-test-final-score          —      0.830       —
                           easy-test-final-score          0.991      —         —
                           Baseline Predict 1             0.466    0.343     0.320
                           Baseline Predict 0             0.112    0.323     0.346


4. Conclusion
In this paper, we have completed the tasks set by PAN and have employed the progress prompt method
to tackle the Multi-Author Writing Style Analysis task. Instead of using traditional MTL (Multi-Task
Learning) techniques, we utilize the progress prompt method to transfer knowledge from datasets of
varying difficulties to the current training dataset. The proposed method achieves scores of 0.993, 0.830,
and 0.832 on three test datasets. These results validate the effectiveness of our proposed method in
performing the Multi-Author Writing Style Analysis task.


Acknowledgments
This research was supported by the Natural Science Foundation of Guangdong Province, China
(No.2022A1515011544)


References
 [1] Z. Ye, C. Zhong, H. Qi, Y. Han, Supervised contrastive learning for multi-author writing style
     analysis, in: Conference and Labs of the Evaluation Forum, 2023.
 [2] W. Liu, S. Rajagopalan, P. Nigam, J. Singh, X. Sun, Y. Xu, B. Zeng, T. Chilimbi, Asynchronous
     convergence in multi-task learning via knowledge distillation from converged tasks, in: Proceed-
     ings of the 2022 Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies: Industry Track, 2022, pp. 149–159.
 [3] A. Hashemi, W. Shi, Enhancing writing style change detection using transformer-based models
     and data augmentation, Working Notes of CLEF (2023).
 [4] E. Zangerle, M. Mayerl, M. Potthast, B. Stein, Overview of the Multi-Author Writing Style Analysis
     Task at PAN 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working
     Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024.
 [5] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Ko-
     renčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova,
     E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024:
     Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
     Analysis, and Generative AI Authorship Verification, in: Experimental IR Meets Multilinguality,
     Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the
     CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg
     New York, 2024.
 [6] A. Razdaibiedina, Y. Mao, R. Hou, M. Khabsa, M. Lewis, A. Almahairi, Progressive prompts:
     Continual learning for language models, arXiv preprint arXiv:2301.12314 (2023).
 [7] B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter-efficient prompt tuning, arXiv
     preprint arXiv:2104.08691 (2021).
 [8] S. Thrun, Lifelong learning algorithms, in: Learning to learn, Springer, 1998, pp. 181–209.
 [9] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, J. Tang, Gpt understands, too, AI Open (2023).
[10] Z. Ke, B. Liu, N. Ma, H. Xu, L. Shu, Achieving forgetting prevention and knowledge transfer in
     continual learning, Advances in Neural Information Processing Systems 34 (2021) 22443–22456.
[11] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with
     gradient-disentangled embedding sharing, arXiv preprint arXiv:2111.09543 (2021).
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
     Attention is all you need, Advances in neural information processing systems 30 (2017).
[13] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
     Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
     F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
     in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
     in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
     978-3-031-28241-6_20.