1. Introduction

Team cornell-1 at PAN: Ensembling Fine-Tuned Transformer Models for Writing Style Analysis

Deniz Bölöni-Turgut

Dhriti Verma

Claire Cardie

0 0 Cornell University , Ithaca, NY 14853 USA

2025

This paper describes our system for the Multi-Author Writing Style Analysis shared task for the PAN Lab at CLEF 2025. We design and train an ensemble model from multiple fine-tuned transformer models. Each model in the ensemble follows our custom BertStyleNN architecture, a PyTorch neural network consisting of a fine-tuned encoder model and a feed-forward neural network classification head. We train each BertStyleNN model end-to-end on a combined dificulty (easy, medium, and hard) training dataset, using five diferent pre-trained feature extractors. We then conduct an exhaustive search over three ensembling methods and model combinations for each dificulty level. Our final system achieves a macro F1 of 0.8 averaged over the three dificulty levels, significantly outperforming the baseline.

eol>PAN 2025 multi-author style analysis sentence embeddings ensemble models transformers

1. Introduction 2. Related Work

Early techniques for style analysis employed manual feature engineering of lexical or syntactic features [ 4 ]. More recent work uses embeddings from pre-trained language models. Since many sentence embedding models are trained with semantic similarity objectives, fine-tuning the pre-trained models on data labeled for style change is common and often necessary.

The goal of the PAN 2024 Multi-Author Style Analysis task was to identify style changes between paragraphs as opposed to between sentences as it is in 2025 [ 5 ]. Of the top two submissions to the 2024 version of this task, one fine-tuned the open-source large language model Llama-3-8b [ 6 ] with low-rank adaption [ 7 ], and the other ensembled three pre-trained transformer models with additional semantic similarity checks applied for the easy and medium dificulty levels [ 8 ].

Document-level authorship attribution approaches include using static embeddings as input to Siamese networks trained with contrastive loss to perform classification [ 4 ].

3. Dataset Exploration

The most notable observation from our data exploration is the class imbalance. Only 19.9% and 20.4% of sentence pairs in the combined dificulty training and validation sets respectively are instances of a style change. To investigate the significance of this class imbalance, we constructed a 50/50 class balanced training set. This balanced training set was augmented with problems randomly chosen from the PAN 2024 Multi-Author Style Analysis task dataset [ 9 ]. We use both the balanced and original imbalanced training sets to extract sentence embeddings from the pre-trained all-MiniLM-L12-v2 model and train a feed-forward neural network (FFNN) as a binary classifier. We do not fine-tune the embedding model at all, only the FFNN. The validation set metrics for both training runs are shown in Table 1; we evaluate each run on the original imbalanced validation set.

4. The BertStyleNN

We introduce the BertStyleNN, our custom neural network based model which contains a binary sequence classification head and is implemented with PyTorch. The code and links to download our trained models from HuggingFace can be found at https://github.com/denizbt/pan-styleAnalysis25.

In this section, we describe the architecture and training process for BertStyleNN models.

4.1. Model Architecture

A BertStyleNN has two parts: a transformer encoder for feature extraction and a FFNN for binary classification. BertStyleNN supports a variety of pre-trained SentenceTransformers models and general feature extractors as its encoder. No architectural changes are made to any pre-trained encoder; it is only fine-tuned.

The architecture of the FFNN is relatively straightforward and is the same for every encoder model. It consists of 4 hidden layers with ReLU activation functions, a 1D BatchNorm layer, and a Dropout layer with = 0.4. The details of the architecture were determined from experimentation with the all-MiniLM-L12-v2 sentence embedding model; each sentence pair in the PAN dataset (all dificulties combined) was embedded using the all-MiniLM-L12-v2 model out-of-the-box and then used to train the FFNN. The architecture that resulted in the highest validation macro F1 was chosen.

The forward pass of a BertStyleNN proceeds as follows. The pair of sentences to check for style changes are passed in as input. Then, BertStyleNN extracts embeddings independently for each sentence using its encoder, concatenates the embeddings, and finally applies the FFNN to get onedimensional output for the binary classification. The complete architecture for BertStyleNN is shown in Figure 1.

Sentence 1 Sentence 2 Encoder Mean Pooling Embedding 2 Mean Pooling Embedding 1 FFNN Concatenation Linear ReLU BatchNorm Dropout (p=0.4) Output Projection (logits) 4.2. Training

Training a BertStyleNN involves simultaneously fine-tuning a pre-trained encoder model and training a FFNN for classification (i.e. end-to-end training).

We select and fine-tune five diferent pre-trained encoder/sentence embedding models as the encoders for the BertStyleNN, listed below. All models are downloaded from HuggingFace. • roberta-base [ 10 ] improves upon BERT [ 11 ] by training it on more data, using dynamic masking, and removing the next sentence prediction task. It was chosen due to its popularity and high performance as a general feature extractor. • microsoft/deberta-base [ 12 ] achieves higher performance compared to BERT and RoBERTa by using disentangled attention which uses two separate vectors for position and content and improving the decoding for the masked LM task. This model was also chosen for its popularity and high performance on natural language understanding tasks. • sentence-transformers/all-MiniLM-L12-v2 [ 13 ] was finetuned with a contrastive similarity objective from the pre-trained microsoft/MiniLM-L12-H384-uncased model [ 14 ]. As of the writing of this paper, it is the fourth highest performing model for sentence embeddings in the SentenceTransformers library [ 15 ]. • sentence-transformers/all-mpnet-base-v2 [ 16 ] was finetuned from using self-supervised contrastive learning objective from microsoft/mpnet-base model [ 17 ]. It is currently the highest performing model for sentence embeddings in the SentenceTransformers library [ 15 ]. • sentence-transformers/sentence-t5-base [ 18 ] is a PyTorch version for the encoder of a T5 base model [ 19 ]. It was chosen to add to the diversity of our set of models.

Our training and validation sets are a combination of the data from all three dificulty levels. We make no other alterations or augmentations to the data. We choose to use a combined training set since each dificulty level subset is too small on its own.

We holistically select diferent hyperparameters and learning schedules for every encoder model (see Appendix A for the choices). We also conduct a linear search for the best probability prediction threshold to apply to the output and choose the best epoch for each model based on the macro F1. It is important to note that while the training hyperparameters difer, the architecture of the FFNN (including hidden layer dimensions) remains the same for all styles of encoder models. Table 2 displays the validation performance for each fine-tuned model.

4.3. Ensembling

At this point, we have trained several BertStyleNN models on the combined dificulty dataset. We now turn our attention to finding the best ensemble model for each dificulty level.

We experiment with three ensembling methods: majority voting, unweighted average of output probabilities, and unweighted average of output logits. For each dificulty level, we test all three methods on the validation set for every subset of trained models size three or more. We report the metrics for the highest performing subset and method for each dificulty level in Table 3. Figure 2 illustrates our complete system pipeline, including ensembling: the Ensemble-BertStyleNN.

5. Results

For the final system submission, we use the ensemble models along with the prediction thresholds that performed best on the validation set; details for the ensemble used for each dificulty level are given in Table 3. The results of our Ensembled-BertStyleNN approach on the hidden test set are in Table 4. Our system significantly outperforms the naive baseline of predicting the majority class (0). MEmeabnedPodoinlign1gConcatenationMEmeabnedPodoinlign2g

Linear ReLU BatchNorm

Dropout(p=0.4)

OutputProjection(logits) Sentence1

Sentence2

Encoder MEmeabnedPodoinlign1gConcatenationMEmeabnedPodoinlign2g

Linear ReLU BatchNorm

Dropout(p=0.4)

OutputProjection(logits) Sentence1

Sentence2

Encoder MEmeabnedPodoinlign1gConcatenationMEmeabnedPodoinlign2g

Linear ReLU BatchNorm

Dropout(p=0.4)

OutputProjection(logits) BertStyleNNs

Ensemble

Models (avg-probs, avg-logits)

Validation Prediction Threshold

6. Conclusion

This paper describes an ensemble model system for the Multi-Author Style Analysis task. We fine-tune and ensemble new BertStyleNN models with 5 distinct pre-trained encoder models and a FFNN for binary classification. Our final system Ensembled-BertStyleNN achieves 0.8 macro F1 averaged over the three dificulty levels, indicating promise for ensemble transformer model approaches to the style analysis task.

Declaration on Generative AI

The authors have not employed any Generative AI tools. In this section, we provide more details about our training parameters. For all training runs, we use nn.BCEWithLogitsLoss with the pos-weight parameter set to 00..82 , i.e. the approximate imbalance between positive (1) and negative (0) labels in the train and validation set. The pos-weight parameter penalizes false negatives (predicting a 0 when it should be a 1 more harshly than false positives (predicting 0 on true label 1), encouraging the model to predict more 1s. This mitigates some of the negative efects of the imbalanced training data.

Table 5 shows the complete list of hyperparameters used in training all models. Additionally, we used the AdamW optimizer, a consistent batch size of 16, and mean pooling of the encoder output for all models. roberta-base deberta-base all-MiniLM-L12-v2 5 5 5

[1]

Bevendorf ,

Dementieva ,

Fröbe ,

Gipp ,

Greiner-Petter ,

Karlgren ,

Mayerl ,

Nakov ,

Panchenko ,

Potthast ,

Shelmanov ,

Stamatatos ,

Stein ,

Wang ,

Wiegmann , E. Zangerle, Overview of PAN 2025: Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025 .

[2]

Zangerle ,

Mayerl ,

Potthast ,

Stein , Overview of the Multi-Author Writing Style Analysis Task at PAN 2025 , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org , 2025 .

[3]

Fröbe ,

Wiegmann ,

Kolyada ,

Grahm ,

Elstner ,

Loebe ,

Hagen ,

Stein ,

Potthast , Continuous Integration for Reproducible Shared Tasks with TIRA.io , in: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023 , pp. 236 - 241 .

[4]

Dubey , Capturing Style Through Large Language Models - An Authorship Perspective ( 2024 ). URL: https://hammer.purdue.edu/articles/thesis/Capturing_Style_Through_Large_ Language_Models_-_ An _Authorship_Perspective/27947904. doi: 10 .25394/PGS.27947904. v1 .

[5]

Zangerle ,

Mayerl ,

Potthast ,

Stein , Overview of the Multi-Author Writing Style Analysis Task at PAN 2024 , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. Herrera (Eds.), Working Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS .org, 2024 , pp. 2513 - 2522 . URL: http://ceur-ws. org/ Vol- 3740 /paper-222.pdf.

[6]

Lv ,

Yi ,

Qi , Team Fosu-stu at

PAN

: Supervised fine-tuning of large language models for Multi Author Writing Style Analysis , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. Herrera (Eds.), Working Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS .org, 2024 , pp. 2781 - 2786 . URL: http://ceur-ws. org/ Vol- 3740 /paper-265.pdf.

[7]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Chen , Lora: Low-rank adaptation of large language models , 2021 . URL: https://arxiv.org/abs/2106.09685. arXiv: 2106 . 09685 .

[8]

Lin ,

Wu ,

Lee , Team

NYCU

-NLP at PAN 2024: Integrating Transformers with Similarity Adjustments for Multi-Author Writing Style Analysis , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. Herrera (Eds.), Working Notes Papers of the CLEF 2024 Evaluation Labs, CEUR-WS .org, 2024 , pp. 2716 - 2721 . URL: http://ceur-ws. org/ Vol- 3740 /paper-255.pdf.

[9]

Zangerle ,

Mayerl ,

Potthast ,

Stein , Pan24 multi-author writing style analysis , 2024 . URL: https://doi.org/10.5281/zenodo.10677876. doi: 10 .5281/zenodo.10677876.

[10]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , 2019 . URL: https://arxiv.org/abs/ 1907 . 11692. arXiv: 1907 .11692.

[11]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , 2019 . URL: https://arxiv.org/abs/ 1810 .04805. arXiv: 1810 .04805.

[12]

He ,

Liu ,

Gao , W. Chen, Deberta: Decoding-enhanced bert with disentangled attention , 2021 . URL: https://arxiv.org/abs/ 2006 .03654. arXiv: 2006 .03654.

[13] Sentence-Transformers , all-minilm-l12-v2 , https://huggingface.co/sentence-transformers/ all-MiniLM-L12-v2, 2024 . URL: https://arxiv.org/abs/ 1810 .04805, accessed: 2024 -05-30.

[14]

Wang ,

Wei ,

Dong ,

Bao ,

Yang ,

Zhou , Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers , 2020 . arXiv: 2002 .10957.

[15] Sentence-Transformers , Pretrained models documentation , http://sbert.net/docs/sentence_ transformer/pretrained_models.html, 2024 . Accessed: 2024 -05-30.

[16] Sentence-Transformers , sentence-transformers/all-mpnet-base-v2, https://huggingface.co/ sentence-transformers/all-mpnet-base- v2 , 2024 . Accessed: 2024 -05-30.

[17] Microsoft , microsoft/mpnet-base, https://huggingface.co/microsoft/mpnet-base, 2024 . Accessed: 2024 -05-30.

[18]

Ni ,

G. H.

Ábrego ,

Constant , J. Ma, K. B. Hall , D.

Cer , Y.

Yang , Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models , 2021 . URL: https://arxiv.org/abs/2108.08877. arXiv: 2108 . 08877 .

[19]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unified text-to-text transformer , 2023 . URL: https://arxiv.org/ abs/ 1910 .10683. arXiv: 1910 .10683.