<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Madrid, Spain
* Corresponding author.
†These authors contributed equally.
$ ewelina.ksiezniak@ue.poznan.pl (E. Księżniak); krzysztof.wecel@ue.poznan.pl (K. Węcel); marcin.sawinski@ue.poznan.pl
(M. Sawiński)
 https://kie.ue.poznan.pl/en/ (E. Księżniak)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>OpenFact at PAN 2025: Punctuation-Guided Pretraining for Sentence-Level Style Change Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ewelina Księżniak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krzysztof Węcel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcin Sawiński</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Systems, Poznań University of Economics and Business</institution>
          ,
          <addr-line>Al. Niepodległości 10, 61-875 Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>This paper presents our approach to the PAN 2025 shared task on multi-author style change detection. The task involves identifying sentence-level boundaries where the writing style changes, presumably due to a switch in authorship. Motivated by the stylistic nature of the task, we propose a method based on intermediate-task learning. Specifically, we first perform contrastive pretraining of encoders using auxiliary tasks focused on detecting the presence of stylistic punctuation features-such as question marks and quotation marks-in order to enhance the encoder's sensitivity to fine-grained stylistic variation. These pretrained models are then fine-tuned on the main style change detection task. Additionally, we conducted error analysis and probing experiments to assess the stylistic awareness of the representations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;PAN 2025</kwd>
        <kwd>style change detection</kwd>
        <kwd>intermediate-task learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Intermediate-task learning is a transfer learning strategy in which a pretrained language model is
further fine-tuned on an auxiliary task before being adapted to the target task. Two main paradigms
are typically distinguished: sequential learning, where the model is trained on the intermediate task and
then fine-tuned on the target task, and multitask learning, where both tasks are learned jointly, often
with task-specific heads or shared representations. Sequential learning aims to transfer task-relevant
skills in stages, while multitask learning promotes shared generalization across tasks [3, 4].
Contrastive learning is an approach for learning text representations by encouraging similar samples
to be closer in embedding space while pushing dissimilar ones apart. Supervised contrastive learning
(SCL) extends this idea by leveraging label information to define positive pairs. It encourages the model
to group together representations of samples from the same class [5].</p>
      <p>Probing is a widely used technique for analyzing the internal representations of pretrained language
models, aiming to assess whether specific linguistic properties are encoded in their hidden states. The
basic method involves training lightweight classifiers, known as probes, on top of frozen model layers
to predict linguistic features such as part-of-speech tags, syntactic structures, or semantic roles [6].
To validate that the probe truly reflects model knowledge rather than overfitting to surface patterns,
control tasks and diagnostic classifiers are often employed [ 7]. Probing has been extensively applied
to examine a range of linguistic phenomena, including morphology, coreference, syntactic depth, and
increasingly, stylistic aspects such as punctuation, capitalization, and sentence length.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Analysis and Preparation</title>
      <p>To prepare the data for our experiments, we utilized the oficial datasets provided for the 2025 edition
of the style change detection task. Each dataset contains comments sourced from the Reddit platform
and corresponds to one of three dificulty levels— Easy, Medium, and Hard—each divided into training
and validation subsets. In the Easy subset, documents span a wide range of topics, allowing models to
leverage topical cues for authorship change detection. Medium subset includes documents with limited
topical diversity, requiring greater reliance on stylistic features. In the Hard subset, all sentences share
the same topic, making style the primary discriminative signal [1]. The task is formulated as a binary
classification problem, where the label 1 denotes a sentence boundary at which the author changes, and
0 indicates no change in authorship. The label distributions for each subset are summarized in Table 1.</p>
      <p>Due to the significant class imbalance—where negative examples (label 0) greatly outnumber positive
ones (label 1)—we applied random undersampling to the training datasets to create more balanced
class distributions during model training. Additionally, to enable internal model evaluation during
development, we randomly sampled 20% of each validation set to construct an internal test set.</p>
      <p>To justify our approach, we investigated the extent to which authorship change correlates with shifts
in specific punctuation patterns. For each pair of consecutive sentences, we examined whether both
sentences contained any of the following features: ellipses (...), question marks (?), exclamation marks
(!), quotation marks ("), or fully capitalized words (e.g., IMPORTANT). We applied the chi-squared test
of independence to evaluate whether the distribution of stylistic punctuation changes is associated
with authorship transitions. The null hypothesis (0) stated that the occurrence of a given punctuation
feature change is independent of whether an authorship change occurred. The alternative hypothesis
(1) posited a dependency between these variables [8].</p>
      <p>The results showed statistically significant associations (  &lt; 0.05) between authorship change and
selectected stylistic features: for the easy dataset, significant dependencies were observed for quotation,
ellipses, question marks, and capitalized words. In the hard and medium datasets, significant results were
found for quotation, ellipses, question, exclamation marks and capitalized words.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Baseline Selection</title>
      <p>To establish an internal baseline, we fine-tuned two pretrained language models— xlm-roberta-base
and mdeberta-v3-base—independently for each dificulty level of the dataset. For each model,
we experimented with three learning rates: 1e− 5, 2e− 5, and 3e− 5, and for each learning rate, we
ran training with three diferent random seeds to ensure robustness. The data preparation involved
concatenating each pair of consecutive sentences using a separator token to form the model input.
All models were fine-tuned with a batch size of 16 for xlm-roberta-base and a batch size of 4 for
mdeberta-v3-base, for up to 10,000 steps. We applied early stopping with a patience of 2, based
on the F1 score on the validation set. Based on the results of our initial experiments, we selected
the following configurations as baselines for each dificulty level: for both the easy and medium
subsets, we used xlm-roberta-base with a learning rate of 2 × 10− 5; for the hard subset, we used
the same model with a learning rate of 3 × 10− 5. All subsequent experiments reported in this paper
were conducted using these configurations for the respective subsets.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Supervised Contrastive Pretraining</title>
      <p>To sensitize the encoder to subtle stylistic cues, we explored pretraining using a supervised contrastive
learning objective. First, we extracted sentences from the training subsets (separately for each dificulty
level: Easy, Medium, and Hard) that contained at least one of the following stylistic markers: question
marks (?), exclamation marks (!), quotation marks ("), ellipses (...), and capitalized words (entire
tokens in uppercase). Table 2 summarizes the number of sentences per feature and dificulty level.</p>
      <p>For each feature, we constructed a binary contrastive dataset consisting of 2,000 sentence pairs: 1,000
labeled as 0 (similar), where both sentences either contain or lack the target feature, and 1,000 labeled
as 1 (dissimilar), where the feature appears in only one of the two sentences. These datasets were
constructed independently for each dificulty level and each stylistic feature, resulting in 15 datasets in
total (5 features × 3 levels). Each dataset was used to train a dedicated contrastive encoder based on
xlm-roberta-base, resulting in 15 encoders—each specializing in one stylistic feature and dificulty
level. The models were trained using Supervised Contrastive Loss. All encoders were trained for 10
epochs with a temperature scaling parameter of 0.2 and a batch size of 16.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Final Submission</title>
      <p>Subsequently, we fine-tuned each of the contrastively pretrained encoders on the main downstream task
of author style change detection. For each dificulty level ( easy, medium, hard), we adopted the optimal
hyperparameter configurations identified during the baseline model selection phase, including learning
rate, batch size, and early stopping criteria. To further explore the generalizability and robustness of the
stylistic representations learned during contrastive pretraining, we fine-tuned each encoder multiple
times while freezing diferent numbers of its initial transformer layers: from zero (i.e., full fine-tuning)
up to four frozen layers.</p>
      <p>elipsis_2.0
capslock_3.0
question_3.0
capslock_2.0
elipsis_3.0
question_4.0
question_2.0
lve elipsis_1.0
ze
Lequestion_0.0
ree elipsis_0.0
Fd elipsis_4.0
en
lacapslock_4.0
odquestion_1.0
M quote_2.0
quote_4.0
capslock_1.0
quote_1.0
quote_3.0
baseline
quote_0.0
0.00</p>
      <p>F1-macro Scores Across Models and Freeze Levels (Easy Subset)
0.1737 0.4413</p>
      <p>0.4461 0.51030.5513</p>
    </sec>
    <sec id="sec-7">
      <title>7. Results on Internal Test Dataset</title>
      <p>The charts illustrate the F1-macro scores achieved by both the baseline and various fine-tuned models,
using contrastively pre-trained encoders with diferent numbers of frozen transformer layers, evaluated
on an internal test set. The most substantial performance gain was observed in the hard subset,
where the baseline F1-macro was 71.14%, and the best-performing model—fine-tuned from an encoder
pretrained to discriminate between sentence pairs containing question marks, with no frozen layers—
achieved 76.74%, indicating a ∼ 7% relative improvement. A moderate improvement was found for
the medium subset, where the baseline scored 78.14%, and the top result (79.79%) was obtained by
ifne-tuning an encoder pretrained on quotation detection with three lower layers frozen. In contrast,
the smallest improvement occurred in the easy subset, where the best model outperformed the baseline
(91.01%) by only 0.01, using a quotation-sensitive encoder with no frozen layers.</p>
      <p>This pattern is intuitive given the nature of the data: in the easy subset, which likely features more
topic diversity and simpler sentence structures, semantic cues dominate over subtle stylistic signals.
Meanwhile, the hard subset likely benefits more from encoders sensitive to fine-grained stylistic
patterns. Nevertheless, despite these promising tendencies—particularly for the hard condition—the
method exhibits instability. Multiple instances were observed where the same pretrained encoder,
when fine-tuned with a diferent number of frozen layers, led to significant performance degradation,
highlighting the need for careful hyperparameter control.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Final Submission</title>
      <p>For the final submission, we selected three models based on their performance across the internal
validation sets. For the easy subset, we submitted a model fine-tuned on a contrastively pretrained
encoder sensitive to quotation marks, with no frozen layers. For the medium subset, we used a model
pretrained on question mark discrimination with two lower layers frozen. Notably, although a model
pretrained on quotation detection with three frozen layers achieved the best performance on our
internal test set for the medium subset, we opted for the question-mark model due to technical issues
encountered during final deployment. For the hard subset, we submitted a model fine-tuned on a
question-mark-sensitive encoder without any frozen layers. The evaluation results on the oficial test
set provided by the competition organizers are presented in Table 3.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Errors Analysis and Model Probing</title>
      <p>To assess whether intermediate pretraining improves model sensitivity to specific stylistic cues, we
conducted an error analysis using our final submission models. For each subset (easy, medium, hard),
the internal test set was partitioned into sentence pairs based on the presence of a target stylistic
feature (e.g., question or quotation marks): both sentences containing the feature, neither containing
it, or only one. As shown in the heatmaps, for the hard subset we observed improvements across all
conditions—including pairs where neither sentence contained the relevant marker. This may indicate
that the pretraining phase enhanced the encoder’s ability to capture not only the targeted punctuation
(e.g., question marks), but also broader stylistic or semantic traits. In the easy subset, the most notable
gain (nearly 7 percentage points) occurred in pairs where both sentences contained quotation marks,
while in the medium subset a moderate improvement (about 2 points) was observed specifically in
cases where only one sentence included the target punctuation. These findings suggest that the benefits
of stylistic pretraining may generalize beyond the pretraining signal itself, but manifest diferently
depending on task dificulty and input structure.</p>
      <p>F1 Score Heatmap (Easy Subset)</p>
      <p>Both sentences contain '"' 0.6211 0.8095
n
o
i
itd Neither sentence contains '"'
n
oC
Only one sentence contains '"'
0.9140</p>
      <p>0.9124
0.8862 0.8903</p>
      <p>Baseline ModelIntermediate
(a) Easy</p>
      <p>Building on these findings, we carried out a complementary model probing experiment to further
investigate whether the proposed approach enhances the model’s sensitivity to specific stylistic features.
The task was formulated as binary classification: determining whether a sentence contains a specific
punctuation mark. For each stylistic marker, we trained separate logistic regression classifiers using
LogisticRegression(max_iter=1000, random_state=42). The results, expressed as F1-scores
for the positive class (F1-pos), are presented in Table 4.</p>
      <p>
        We compared three model configurations: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) XML-RoBERTa base, representing sentence
embeddings from the unmodified base model; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Target finetuning, where embeddings were taken from a
model fine-tuned exclusively on the author style change task; and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Pretrained encoder finetuned,
where the encoder was first subjected to intermediate contrastive pretraining on a stylistic signal and
subsequently fine-tuned on the main task (i.e., the final submission models). For both the easy and
hard subsets, the intermediate learning approach led to a notable increase in embedding-level stylistic
signal, outperforming standard finetuning not only for the pretraining-related feature, but also for other
stylistic tasks. Surprisingly, this efect was not replicated in the medium subset. Here, target finetuning
embeddings generally yielded higher or equivalent stylistic separability compared to the intermediate
learning setup.
10. Conclusion and Future Work
Our experiments demonstrate that contrastive intermediate-task pretraining focused on stylistic
punctuation features can enhance encoder sensitivity to fine-grained authorial variation, particularly in
more challenging subsets of the style change detection task. While the improvements over the baseline
are promising, especially for the hard dataset, our results also reveal performance instability across
ifne-tuning configurations—particularly with varying numbers of frozen layers—which suggests that
encoder robustness remains a concern. This variability indicates a need for more systematic
regularization or architectural calibration. As a potential direction for future work, we plan to explore ensemble
methods that combine multiple stylistically specialized encoders. Such ensembles could help mitigate
individual encoder fluctuations while leveraging complementary stylistic representations to further
improve detection accuracy.
11. Declaration on Generative AI
During the preparation of this work, the authors used ChatGPT to check grammar, spelling, and style.
The tool was applied to selected paragraphs, and all corrections were manually reviewed and approved.
[4] Y. Pruksachatkun, J. Phang, H. Liu, P. M. Htut, X. Zhang, R. Y. Pang, C. Vania, R. T. McCoy, S. R.
      </p>
      <p>Bowman, Intermediate-task transfer learning with pretrained language models: When and why
does it work?, in: Proceedings of ACL, 2020.
[5] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, D. Krishnan,
Supervised contrastive learning, in: Advances in Neural Information Processing Systems (NeurIPS),
2020.
[6] A. Conneau, D. Kiela, What you can cram into a single vector: Probing sentence embeddings for
linguistic properties, in: Proceedings of ACL, 2018.
[7] J. Hewitt, P. Liang, Designing and interpreting probes with control tasks, in: Proceedings of EMNLP,
2019.
[8] R. J. Tallarida, R. B. Murray, R. J. Tallarida, R. B. Murray, Chi-square test, Manual of pharmacologic
calculations: with computer programs (1987) 140–142.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2025:
          <article-title>Generative AI Authorship Verification, Multi-Author Writing Style Analysis, Multilingual Text Detoxification, and Generative Plagiarism Detection, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the Multi-Author Writing Style Analysis Task at PAN 2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Phang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Févry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks</article-title>
          , in: arXiv preprint arXiv:
          <year>1811</year>
          .01088,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>