<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Team OpenFact at PAN 2024: Fine-Tuning BERT Models with Stylometric Enhancements</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ewelina Księżniak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krzysztof Węcel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcin Sawiński</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Poznań Uniwersity of Economics and Business</institution>
          ,
          <addr-line>Al. Niepodległości 10, 61-875 Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>This paper presents our solution for the Multi-Author Style Change Detection task at PAN 2024. The task involves detecting paragraph-level writing style changes in texts, with datasets classified into easy, medium, and hard dificulty levels. We incorporated stylometric tags directly into the text to enhance the sensitivity of BERT-family models to stylistic features. Our approach aimed to improve the model's detection of authorship changes by adding these tags to the training dataset and model sensivity to stylometric features. The results showed F1 improvements when training on smaller datasets, indicating the method's potential for hard-to-obtain data types.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Stylometric Analysis</kwd>
        <kwd>Style Change Detection</kwd>
        <kwd>BERT Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Multi-author style change detection has been a task organized by PAN since 2016 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Prior to the advent
of BERT models, style-change detection techniques predominantly relied on traditional stylometric
features, including lexical elements (n-grams), word frequencies, and syntactic characteristics like parts
of speech or syntactic trees. For instance, in 2018, the leading approach for cross-domain authorship
attribution task involved text distortion and extraction of character n-grams to emphasize punctuation,
numbers, and diacritic characters [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], as demonstrated by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The top performance in the 2018 style
detection task was achieved by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], who utilized features such as repetition, contracted wordforms,
frequent words, quotation marks, vocabulary richness, and readability to train an ensemble classifier.
However, starting from 2020, most participants have submitted solutions by fine-tuning pre-trained
models [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For example, in the 2023 edition, the highest accuracy on easy and medium datasets was
achieved using BERT, RoBERTa, and ELECTRA combined with a binary classification layer [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        This paper describes the solution submitted for the Multi-Author Style Change Detection task, part
of the PAN 2024 workshop series. This task aims to identify paragraph-level writing style changes
in a given text between consecutive paragraphs. It includes three levels of dificulty: easy, medium
and hard [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For each subtask, distinct datasets were provided: 1) Easy – paragraphs cover various
topics, allowing topic information to aid in detecting authorship changes; 2) Medium – there is limited
topical variety, requiring a greater emphasis on stylistic diferences; 3) Hard – all paragraphs cover the
same topic, thus relying solely on stylistic cues to identify changes. The entire dataset was in English
and sourced from comments on Reddit. It included metadata indicating the points of author change
between paragraphs, as well as the total number of authors within each set [7].
      </p>
      <p>Given the stylometric nature of the task and the importance of stylistics in detecting authorship
changes, we decided to employ a method that directly adds stylometric tags to the texts in the training
dataset used for training models from BERT-family. Our approach aims to enhance the model’s sensitivity
to stylistic features, acknowledging that in authorship change detection, semantic content alone could be
insuficient. This paper presents our methodology and findings, ofering insights into the efectiveness of
using proposed stylometric enhancements in training language models for authorship change detection.
Additionally, it presents background studies aimed at determining whether the proposed method
enhances the sensitivity of BERT-family language models to stylometric features.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Background studies</title>
        <p>To explain the rationale behind the proposed method, we conducted an additional experiment to
determine which stylometric features are important in the author style change detection task. We
calculated stylometric features that describe text complexity, text formality, text fogginess, and patterns
related to punctuation and grammar. Subsequently, we computed the absolute values of diferences
for specific features between text pairs created from consecutive paragraphs. We hypothesized that
absolute diferences between pairs of texts would be smaller when there is no change in authorship and
larger in the case of authorship change.</p>
        <p>To assess the statistical significance of the diferences in data distribution for each specific feature, we
employed the Mann-Whitney U test. This is a non-parametric test used to determine whether there is a
diference between two groups (i.e., these with changing and non-changing authorship) by comparing
the rank sums of the two samples rather than the means. The null hypothesis (H0) states that the
distributions of the two groups are identical, while the alternative hypothesis (H1) suggests that the
distributions difer [ 8]. We chose this test because our data did not follow a normal distribution which
is required for the t-test for independent samples. Furthermore, we analyzed diferences across two
dimensions: real labels (0 and 1), predicted labels (0 and 1).</p>
        <p>The background experiment was conducted on the validation datasets for the specific subtasks,
using fine-tuned RoBERTa models, which served as our internal baselines and based on the following
features: the number of sentences in the text, the average number of words per sentence, the count of
punctuation marks, the count of personal words, the count of reported speech, the formality score, the
Flesch Reading Ease score, the SMOG index, the Flesch-Kincaid Grade level, the Coleman-Liau index,
the Automated Readability Index, the count of dificult words, and the frequencies of nouns, verbs,
adjectives, adverbs, and prepositions.</p>
        <p>Table 1 presents normalized mean absolute diferences obtained for easy, medium, and hard validation
datasets for features that showed statistically significant diferences across the authorship change (label:
1) and no authorship change (label: 0) groups within actual and predicted labels. Most measures
consistently exhibited higher absolute diferences when there was a change in authorship, evident in
both real and predicted label distributions. Surprisingly, for the medium and hard datasets, higher
diversity was observed for some features without author changes. For the medium dataset, this was
seen in sentence complexity, and the frequency of nouns and verbs. For the hard dataset, it was noted
in the frequency of nouns and adverbs, as well as the Coleman-Liau readability index.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System Overview</title>
      <sec id="sec-3-1">
        <title>3.1. Methodology</title>
        <p>Based on the results presented in [9], there are indications that models from BERT-family capture certain
stylometric features. Building on these observations, we developed a method that enriches the text
by directly incorporating stylometric tags. This approach aims to determine if this enhancement can
improve model classification and make BERT family models more sensitive to stylometric characteristics.</p>
        <p>The experiment was carried out in four phases:
1. Fine-tuning of models and selection of baseline models.
2. Feature engineering.
3. Training models with data augmented by stylometric tags on entire dataset and subsamples.
4. Conducting experiments by combining multiple tags within a single text.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Baseline models selection</title>
          <p>For each dataset variant (easy, medium, hard), we fine-tuned the RoBERTa base and DeBERTa v3 base
models in the initial phase and selected baseline models based on the results obtained. We experimented
with hyperparameters, including learning rates of 1e-5, 2e-5, and 3e-5, and seed values of 42, 100, and
1111. The default AdamW optimizer and a batch size of 4 were used. Additionally, we tested an approach
implementing layer-wise decay in the RoBERTa base model, applying diferent learning rates to each
layer to capture either general language information or task-specific details.</p>
          <p>Each model was trained for 5 epochs using the original dataset provided by the task organizers. The
data preparation involved concatenating two consecutive paragraphs with a separator token. After
completing this phase, we chose the fine-tuned RoBERTa-base model as the baseline for the second part
of the experiment. The training was conducted on server equipped with four NVIDIA GeForce RTX
2080 Ti GPU cards, each with 11 GB of memory.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Features Engineering</title>
          <p>In the second iteration, we enhanced the original datasets by integrating stylometric feature tags. The
features were chosen based on a manual review of prediction errors from the baseline models.</p>
          <p>We decided to augment the dataset by adding tags related to the following stylometric dimensions:
• text complexity
• text formality
• punctuation.</p>
          <p>To quantify text complexity, we developed two metrics: text length, determined by counting the
number of sentences, and sentence complexity, calculated as the average number of words per
sentence.</p>
          <p>Text formality was evaluated using the method proposed by [9]. This method introduced “seed
words” with the same semantic meaning but diferent levels of formality. The seed pairs included among
others: my gosh - Jesus, breathing - respiratory, yeah - yes, ten years - decade, first of all - foremost , a
whole bunch - full, and my dad - father. The method involved calculating the mean diference between
the word embeddings of each pair to create a “stylometric embedding”. The formality level of a text
was then determined by measuring the cosine similarity between the input text embedding and the
“stylometric embedding”. We also created features to measure text formality by analyzing reported
speech occurrences and personal style. The degree of personal style was measured by the frequency
of words used to express personal opinions or experiences (e.g., I, me, my).</p>
          <p>We also analyzed punctuation patterns, focusing specifically on infrequently used punctuation
marks: the ampersand, ellipsis, question mark, quotation mark, and semicolon.</p>
          <p>To generate tags for the dataset based on text length, sentence complexity, and formality, we computed
descriptive statistics: mean, standard deviation, and quantiles across the entire training dataset. These
statistics were then used to establish thresholds for embedding stylometric tags into the original text.</p>
          <p>For text length, we prefixed the original text with The text is long. if it contained at least three
sentences and with The text is short. if it contained only one sentence. For sentence complexity, we
added phrase The text contains long sentences. at the beginning if the average sentence length exceeded
21 words, and The text contains short sentences. if it was below 15 words. For formality, we used the
phrase The text is highly informal. if the formality measure exceeded 0.2, and The text is formal. if it
was below 0.05. Additionally, we added the tag This text contains reported speech if any reported speech
patterns were detected.</p>
          <p>To generate tags related to punctuation and personal style, we added specific tag when a designated
word or punctuation mark occurred in the text. For personal style, we identified words such as “I”,
“me” and “my”. For punctuation, we looked for marks including the ampersand, ellipsis, question mark,
quotation mark, and semicolon.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Training models with data augmented by stylometric tags on entire dataset and subsamples</title>
          <p>In the subsequent phase, we trained models using datasets augmented with specific tags. To evaluate
the impact of these tags, we used the same hyperparameters as those employed for the baseline models.
To expedite the training process, initial experiments were conducted on a randomly reduced dataset
comprising of 10,000 observations.</p>
          <p>Following these preliminary experiments, we proceeded to train on the entire dataset, exploring
several variants: datasets modified by the addition of tags: the number of sentences, average words
per sentence, punctuation, personal style, reported speech, and formality level, each tested separately.
Additionally, models were trained on data incorporating various combinations of these tags.</p>
          <p>Initially, we combined all engineered tags; however, this approach introduced excessive noise into
the data. Given the RoBERTa base model’s token limit of 512, we then tested combinations of tags
related to personal words and punctuation, which were relatively short. As a result, three additional
models were trained for each task level (easy, medium, and hard) in this phase.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Text Augmentation</title>
        <p>Using TIRA [10], we opted to make only the final submission for models that demonstrated the best
performance during the internal testing phase. The software used for the final submission processes a
pair of texts as input and adds diferent tags depending on the dataset level:
• Hard dataset: The input text pairs are modified by adding a punctuation tag.
• Medium dataset: The input text pairs are modified by adding a tag related to sentence complexity
(average words per sentence).
• Easy dataset: The input text pairs are modified by adding both the punctuation tag and the tag
related to personal words</p>
        <p>Here are examples of original and tagged text using the proposed approach for easy, medium, and
hard datasets:
Modification by adding punctuation and personal style tag</p>
        <p>Original text:</p>
        <p>Why did I think hemp production started earlier? I wonder if it was just government controlled back
then and the 2014 farm bill opened it up more for private business...</p>
        <p>Tagged text:</p>
        <p>Why did I (personal style) think hemp production started earlier? (question mark) I (personal style)
wonder if it was just government controlled back then and the 2014 farm bill opened it up more for private
business. . . (ellipse mark)
Modification by adding only a sentence complexity tag</p>
        <p>Original text:</p>
        <p>If the Russian soldiers (or their wives back in Russia) hear this, it could keep the already low morale of
the Russian solders low. . . .</p>
        <p>Tagged text:</p>
        <p>The text contains long sentences. If the Russian soldiers (or their wives back in Russia) hear this, it could
keep the already low morale of the Russian solders low. . . .</p>
        <p>Modification by adding only a punctuation tag</p>
        <p>Original text:</p>
        <p>The tu quoque defense (Latin for ýou too)´ asserts that the authority trying a defendant has committed
the same crimes of which they are accused. It is related to the legal principle of clean hands, reprisal, and
"an eye for an eye". The tu quoque defense does not exist in international criminal law and has never been
accepted by an international court.</p>
        <p>Tagged text:</p>
        <p>The tu quoque defense (Latin for ýou too)´ asserts that the authority trying a defendant has committed
the same crimes of which they are accused. It is related to the legal principle of clean hands, reprisal, and
(quotation mark) "an eye for an eye" (quotation mark). The tu quoque defense does not exist in international
criminal law and has never been accepted by an international court.</p>
        <p>After preprocessing the text pairs according to a specific schema, the system makes predictions
using models trained on tagged versions of the dataset. Each subtask utilizes a separate model, and the
training methodology is detailed in Section 3.1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Internal testing phase</title>
        <p>During our internal testing phase, we trained the models on subsamples and the entire dataset (separately
for easy, medium, and hard tasks) on original and tagged data. Table 2 presents the macro F1 scores
for hard, medium, and easy datasets obtained by training on a randomly selected subsample (10,000
observations). The results indicate that incorporating stylometric tags - in most cases - led to an
improvement in the F1-macro score, with the best results achieved by adding reported speech tag and
text length tag for hard, and easy datasets, respectively. Augmenting the dataset with stylometric tags
improved the macro F1 score by 2% and 1% for the hard, and easy datasets, respectively. Adding tags for
the medium dataset does not afect the results, except for the sentence complexity tag, which decreases
the macro F1 score by approximately 1%.
hard</p>
        <p>Table 3 presents the classification results for the hard, medium, and easy validation datasets trained
on the entire tagged dataset, alongside the baseline model outcomes. Notably, the impact of adding
tags was significantly lower than the results achieved by training on dataset subsamples. The best
results for the hard dataset were obtained by adding the punctuation tag, achieving an macro F1 score
of 0.813 on validation dataset, while adding other tags surprisingly worsened the baseline results. For
the medium dataset, the most significant improvement was observed compared to the baseline, with
the highest diference from 0.819 to 0.836. Each tag or combination of tags improved over the baseline
results. For the easy dataset, the best results were achieved by adding personal style, formality level, and
punctuation tags, as well as a combination of punctuation and personal tags. However, the improvement
over the baseline was only 0.001.</p>
        <p>Building on previous studies, we aimed to determine whether incorporating stylometric tags directly
into the text enhances model sensitivity to stylometric features. Our objective was to establish whether
the observed improvements from adding stylometric features were genuinely due to increased model
sensitivity to specific stylometric aspects (e.g., better understanding of punctuation marks and their
significance in detecting author style changes) or if other factors, such as randomness, played a role.</p>
        <p>To test this, we validated the distribution of mean absolute diferences between the real labels for no
authorship change group and the correct predictions for no authorship change group in both baseline
models and those trained on the tagged dataset. The assumption was that if adding tags enhances the
model’s sensitivity to stylometric features, the diference between the trend observed for actual labels
and model predictions would be smaller for the model trained with tags.
hard</p>
        <p>Table 4 presents the results of the mean absolute diferences across two dimensions: real labels (0) and
correct predictions (0) from baseline models and models trained on tagged data. Surprisingly, models
trained on tagged datasets showed higher diferences between predictions and real labels compared to
baseline models. This suggests that the observed improvement in macro F1 scores may not be due to
increased sensitivity to stylometric features, but rather other factors.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Final submission</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>Main Findings:
This study introduced a method for directly integrating text with stylometric information.
• The method yielded the most significant F1 improvements when training on smaller datasets,
suggesting its potential use for data types that are dificult to obtain (e.g., authorship for
insurance claims). While there were tags that improved F1 when training on the entire dataset, the
improvements over the baseline were minimal, especially for the hard and easy datasets.
• An attempt was made to determine if adding tags with stylometric information genuinely increased
the model’s sensitivity to specific stylometric features. However, the analysis did not confirm
this hypothesis, indicating the need for further exploration in this area.</p>
      <p>Future Work:
• Given the observed improvements on the medium dataset and the training on subsamples, we
see potential in the proposed method. Future research should focus on a detailed analysis of
which stylometric features are significant. We hypothesize that adding tags may be particularly
beneficial for stylometric features that BERT-family models cannot “learn” independently by
itself.
• Additionally, BERT-family models typically have a limited number of tokens they can process.</p>
      <p>Therefore, future work should focus on constructing tags in a concise or implicit manner to
accommodate this limitation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The research is supported by the project “OpenFact – artificial intelligence tools for verification of
veracity of information sources and fake news detection” (INFOSTRATEG-I/0035/2021-00), granted
within the INFOSTRATEG I program of the National Center for Research and Development, under the
topic: Verifying information sources and detecting fake news.
[7] E. Zangerle, M. Mayerl, M. Potthast, B. Stein, Overview of the Multi-Author Writing Style Analysis
Task at PAN 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working
Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024, p. 2.
[8] P. E. McKnight, J. Najab, Mann-whitney u test, The Corsini encyclopedia of psychology (2010)
1–1.
[9] Q. Lyu, M. Apidianaki, C. Callison-Burch, Representation of lexical stylistic features in language
models’ embedding space, arXiv preprint arXiv:2305.18657 (2023).
[10] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
978-3-031-28241-6_20.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. B.</given-names>
            <surname>Casals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elnagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Korenčić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smirnova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taulé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ustalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle,
          <article-title>Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2024</year>
          , p.
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tschuggnall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Daelemans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Specht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Potthast, Overview of the author identification task at pan-2018: cross-domain authorship attribution and style change detection</article-title>
          ,
          <source>in: Working Notes Papers of the CLEF</source>
          <year>2018</year>
          <article-title>Evaluation Labs</article-title>
          . Avignon, France,
          <source>September 10-14</source>
          ,
          <year>2018</year>
          /Cappellato, Linda [edit.]; et al.,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Custódio</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Paraboni</surname>
          </string-name>
          ,
          <article-title>Each-usp ensemble cross-domain authorship attribution</article-title>
          ,
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zlatkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kopev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mitov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Atanasov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hardalov</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>An ensemble-rich multi-aspect approach for robust style change detection, in: CLEF 2018 Evaluation Labs</article-title>
          and Workshop-Working Notes Papers, CEUR-WS. org,
          <year>2018</year>
          , p.
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the multi-author writing style analysis task at pan 2023</article-title>
          , Working Notes of CLEF (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hashemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Enhancing writing style change detection using transformer-based models and data augmentation</article-title>
          , Working Notes of CLEF (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>