Ensemble-Based Clustering for Writing Style Change
Detection in Multi-Authored Textual Documents
Notebook for PAN at CLEF 2022

Shams Alshamasi1 , Mohamed Menai2
1
  Imam Mohammed Ibn Saud Islamic University, College of Computer and Information Science, Computer Science
Department, Riyadh, Saudi Arabia.
2
  King Saud University, College of Computer and Information Science, Computer Science Department, Riyadh, Saudi
Arabia.


                                         Abstract
                                         Style change detection aims at detecting writing style breaches: the positions at which the writing
                                         style changes and authors switch within a multi-authored document. This task has a significant role
                                         in forensic linguistics, cybercrime investigation, and intrinsic plagiarism detection. Detecting authors’
                                         switch positions require decomposing the text into its authorial components. One of the feasible solutions
                                         to achieve this goal is to cluster the textual document into stylistically homogeneous clusters where
                                         each cluster includes all text fragments that are similar in writing style and hence are written by
                                         the same author. In this paper, we propose a within-document authorship clustering method based
                                         on ensemble learning to tackle style change detection in multi-authored documents. The proposed
                                         authorship clustering is an unsupervised learning method that does not require training or parameter
                                         tuning. The only parameter needed is the number of authors which is estimated by using an ensemble
                                         paragraph clustering to accurately capture author distribution at the paragraph level and precisely
                                         predict the number of authors. The experimental results obtained on PAN 2022 test dataset show that
                                         the proposed method achieved an F1 score of 0.52 to detect style changes between paragraphs and an F1
                                         score of 0.49 to detect authors’ switches between sentences. To attribute authors within a document, the
                                         method achieved an F1 score of 0.22, a Diarization error rate of 0.57, and a Jaccard error rate of 0.35.

                                         Keywords
                                         Style Change Detection, Writing Style Analysis, Multi-author Analysis, Authorship Attribution, Author-
                                         ship Clustering


1. Introduction
Style change detection has become an attractive research area related to authorship analysis. It
aims to detect text positions at which the writing style changes and authors switch [1]. The
rapid increase in cybercrimes and digital text forensics has led to extreme demand for authorship
analysis, particularly, authorship attribution that aims at identifying the original author of
a given text [2]. Authorship attribution relies on the fact that authors are distinguished by
their unique writing style which is characterized by some stylistic features [3] including text
readability that measures the text simplicity, clearness, and assesses the reading ease of an
author [4], vocabulary richness that measures the diversity of the vocabulary within a given
CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ sfalshmasi@imamu.edu.sa (S. Alshamasi); menai@ksu.edu.sa (M. Menai)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
text [5], and text length that represents the average length of sentences, paragraphs, and words.
For example, some authors usually write short, simple, clear, and easy-to-read sentences with
simple, short, and common words while others write long sentences with complex and foreign
terms.
    Traditional authorship attribution focuses on single-authored documents that are labeled
with their original author [6]. In this sense, a model could be trained to predict an author
for a given anonymous document. Alternatively, modern authorship attribution focuses on
multi-authored documents where multiple authors contribute to a single document. Analyzing
multi-authored documents for authorship attribution is challenging due to the absence of
prior knowledge about the number of collaborative authors, the number of style changes, the
authors’ style, and the variation of authors’ distribution within a document. Thus, analyzing
writing style within multi-authored documents requires decomposing the document into its
authorial components as a preliminary step for authorship attribution. This could be achieved
by identifying the text positions at which writing style changes. In this regard, style change
detection is required for improving authorship analysis in a multi-authored document to achieve
robust and precise authorship attribution. The advantages of style change detection in improving
authorship attribution are extended to cover other potential areas including forensic linguistics
for detecting suspicious documents, cybercrime investigation, law enforcement, social media
analysis for detecting identity stealing, intrinsic plagiarism detection, and literary research
analysis for literary plagiarism detection.
    Style change detection has been one of the attractive shared tasks involved in PAN: the series of
shared tasks on writing style analysis and digital text forensics, which holds at CLEF conferences
1 . The evaluation results of the annual style change detection competitions from PAN 2017 to

PAN 2021 [6, 7, 8, 9, 10] show that style change detection is extremely challenging and has not
been adequately resolved yet. The state-of-the-art style change detection systems have some
limitations and demonstrate relatively weak performance, particularly due to the large number
of extracted features which involve high space complexity and long-running time [11, 12]. Thus,
selecting the appropriate stylistic features to discriminate authors’ writing style, and choosing
the adequate model to precisely identify authorial boundaries are the main challenges to be
tackled. Despite these challenges, the awareness of the significant role of style change detection
in improving authorship attribution within a multi-authored document motivates for placing
more emphasis on developing the appropriate automated style analysis model that is capable of
discriminating authors’ writing style and identifying authorial boundaries within a document
in a more accurate, robust, and cost-effective way.
    As evidenced in [13, 14, 15], clustering analysis is beneficial for tackling the style change
detection problem in a textual document in which the document is segmented into stylistically
similar groups to detect writing style inconsistency between different text fragments. In this
paper we propose ensemble-based authorship clustering method for detecting style changes
between paragraphs and sentences. The proposed method is based on ensemble clustering
that composes of different K-means clustering models to identify the approximate number of
authors who contribute to a document. The predicted number of authors is then used to cluster
paragraphs or sentences within a document into disjoint clusters where each cluster includes

    1
        https://pan.webis.de/
all text fragments written by the same author. Paragraphs or sentences that are belonging to
different clusters are predicted to be written by different authors which indicate style changes
between them.
   The rest of this paper is organized as follows: Section 2 describes the tasks required for PAN
2022 style change detection competition. Section 3 reviews the state-of-the-art style change
detection approaches. Section 4 introduces the proposed authorship clustering method. It
highlights the proposed stylistic features at both sentence and paragraph levels and describes
the proposed clustering method including the proposed ensemble paragraph clustering for
identifying the number of authors within a document and the applied clustering methods for
solving each required task. The experimental setup is presented in section 5. Section 6 covers
the performance evaluation. It outlines the evaluation metrics, highlights the experimental
results, and discusses the obtained results. The paper is concluded in section 7.


2. Task Description
Style change detection competitions are established officially by PAN in 2017. Since 2017, PAN
has provided researchers and competitors with complete datasets for the annual style change
detection competitions. These competitions involved different shared tasks every year.
  This year, PAN 2022 [16] style change detection competition aims to detect writing style
changes and authors switch between consecutive paragraphs and sentences as well as to attribute
paragraphs to their original authors [17]. This competition involves three main tasks:

Task-1: detect authors switch between consecutive paragraphs given a document written by
     two authors that contains only single style change.

Task-2: given a multi-authored document written by two or more authors the task is to assign
     a unique author to each paragraph (Authorship attribution at the paragraph level.)

Task-3: given a text document written by two or more authors the task is to identify the text
     positions at which writing style changes and authors switch between sentences.

  Figure 1 illustrates an example of some possible scenarios and the expected output.
  As shown in Figure 1, the output of Task-1 and Task-3 is represented as a list of binary
values (0 or 1) where 0 indicates no style changes between consecutive paragraphs (Task-1) or
consecutive sentences (Task-3), and 1 indicates a style breaches and author switch between
consecutive paragraphs (Task-1) or sentences (Task-3). The output of Task-2 is a list of integer
numbers that represent the authors attribution result where each number represents a unique
author who contributes to a document.
  Document A shows an example of authors switch existing between the first and the second
paragraph within the document. Document B shows an example of a multi-authored document
written by three authors where the first author writes the first paragraph, the second author is
assigned to the second and third paragraphs, and the third author is attributed to the fourth
paragraph. Document C illustrates an example of a document written by three authors where
authors switch between sentences.
Figure 1: Style Change Detection - Possible Scenarios.
Adapted from https://pan.webis.de/clef22/pan22-web/style-change-detection.html#task


3. Related Works
This section reviews state-of-the-art style change detection approaches that have participated in
PAN style change detection competitions from PAN CLEF 2017 to PAN CLEF 2021 [7, 6, 8, 9, 10].
First style change detection approaches were proposed in 2017 including the similarity-based
approach which aims to measure the similarity between text fragments and the statistical
approach. Khan [18] proposed a threshold-based text segmentation method for segmenting
the text into stylistically homogeneous parts by measuring the similarity between adjacent
text windows and merges windows that are similar in writing style using a threshold value. A
sentence outlier detection method was proposed by Safin and Kuznetsova [19] to measure the
similarity between neural sentence embeddings. Wilcoxon Signed-Rank statistical test [20] was
proposed by Karas et al. [21] to verify whether two consecutive paragraphs have a significant
stylistic differences.
   Machine learning approaches have been proposed since 2018 to solve different style change
detection tasks. Supervised machine learning methods (binary classification methods) were
proposed to discriminate single-authored documents from multi-authored ones. A stacking
ensemble classifier based on some lexical and syntactic features was proposed by Zlatkova et
al. [22] for classifying documents into single-authored or multi-authored. Safin and Ogaltsov
[23] proposed another ensemble classifier to discriminate single-authored documents from
multi-authored ones. Strom [24] also proposed a stacking ensemble classifier based on BERT
[25] embeddings and features proposed by Zlatkova et al. [22] to verify whether a document is
single or multi-authored as well as to detect style changes between paragraphs. The proposed
approach uses a recursive method based on the prediction generated by the ensemble to attribute
an author to each paragraph. An authorship verification method was proposed by Singh et al.
[26] to detect style changes between paragraphs and attribute each paragraph to its original
author. A simple and fast divide-and-conquer method was proposed by Khan [27] to measure
the similarity between text groups using some lexical and syntactic features to predict whether
a document is written by a single or multiple authors.
   Unsupervised machine learning approaches (clustering methods) were proposed to identify
the number of authors who contribute to writing a document as well as to detect style changes
between paragraphs. Window merge clustering and threshold-based clustering using the top
50 frequent terms were proposed by Nath [28] for segmenting the document into stylistically
homogenous groups. Castro et al. [29] proposed a non-overlapping B0-maximal clustering
algorithm based on some lexical and syntactic features to cluster paragraphs within a document.
The method uses a heuristic based on paragraph order to minimize the overlap between the
generated paragraph clusters. Zuo et al. [30] proposed a hybrid approach that combines
supervised and unsupervised methods to identify the number of authors within a multi-authored
document. The hybrid approach first predicts whether a document is written by one or more
authors using a feed-forward neural network trained on the TF-IDF document representation.
An ensemble clustering of Kmeans and a hierarchical clustering algorithm based on lexical
and syntactic features is then used to identify the number of authors at the paragraph level.
Another hybrid approach that combines deep features and classification method was proposed
by Iyer and Vosoughi [31] to detect style changes between paragraphs using the BERT sentence
embeddings in conjunction with a random forest classifier.
   Deep learning approaches were proposed to verify whether a document is written by one
or more authors. A Character-based Convolutional Neural Network (CNN) was proposed by
Schaetti [11] to learn documents’ stylistic characteristics for predicting whether a document is
written by a single author or multiple authors. A parallel multi-level Recurrent Neural Network
(RNN) was proposed by Hosseinia and Mukherjee [12] to verify whether a document is single
or multi-authored. The proposed RNN learns the underlying language structure of a document
using sentence parse trees. Siamese Neural Network was proposed by Nath [32] to estimate the
similarity between all paragraphs within a given document for predicting style breaches at the
paragraph level. Multi-layer perceptron and a bidirectional Long-Short Term Memory (LSTM)
were proposed by Deibel and Lofflad [33] using word embeddings generated by the FastText
model and some lexical and syntactic features to verify whether one or more authors write
a document as well as to detect style changes between paragraphs. The prediction made by
the LSTM is also used to attribute authors to each paragraph. A similarity-based classification
method was proposed by Zhang et al. [34] to predict style change within a textual document.
The proposed method uses a fully connected neural network combined with BERT embeddings
to predict the similarity between consecutive paragraphs. The predicted similarity is then used
to verify whether a document is written by one or more authors, detect style changes between
paragraphs as well as assign authors to each paragraph.
   Some of the proposed deep learning approaches have some limitations due to the complexity
of these models that involve long running time and high space complexity caused by a large
number of extracted features as evidenced in [12] which proposed an RNN trained using a
large number of extracted parse trees. Moreover, some proposed deep learning models were
not trained or tuned deeply due to the lack of training data which caused overfitting that
significantly affects the performance and reduces the accuracy, as evidenced in [11].
   The review of the state-of-the-art style change detection approaches shows that ensembles
of classic machine learning algorithms usually provide good results for detecting style breaches.
The stacking ensemble classifier proposed by Zlatkova et al. [22] was the best performing
approach to distinguish single-authored documents from multi-authored ones. The ensemble
clustering proposed by Zuo et al.[13] outperforms the existing works in identifying the number
of authors within a document. clustering methods have proven their strengths in learning the
hidden stylistic patterns for distinguishing multiple authors, particularly with the absence of
prior knowledge about the authorship, as evidenced in [14, 13, 29].


4. Proposed Approach
Identifying multiple authors within a multi-authored document requires decomposing it into
its authorial components. To achieve this goal, we propose a within-document authorship
clustering approach based on ensemble learning to tackle style change detection within a multi-
authored document. The proposed approach aims to first identify the number of authors within
a document using an ensemble paragraph clustering and then decompose the textual document
into its authorial components by clustering the text into stylistically homogeneous groups
where each group includes all text fragments (sentences/paragraphs) that are similar in writing
style and written by the same author. This method is based on measuring the similarity between
consecutive text fragments (sentences or paragraphs) to verify whether they are written by
the same author or not. From this point of perspective, the method is also contributed to the
task of authorship verification [35]. The proposed method is an unsupervised learning method
that does not require training or parameter tuning. The only needed parameter is the number
of the author which is predicted by the ensemble paragraph clustering to capture the author
distribution at the paragraph level. These are the main strengths of the proposed method to save
the time required for training a model, as well as the efforts needed for tuning parameters. This
section highlights the selected stylistic features and describes the clustering methods applied to
tackle each of the required style change detection tasks.

4.1. Stylistic Features
The task of style change detection this year aims at detecting authors’ switches between
paragraphs and sentences. Thus, we proposed sets of features at both paragraph and sentence
levels. Table 1 outlines the selected features at the sentence level.
   We selected lexical and syntactic features due to their relevance to authors’ style as they
represent authors writing preferences and stylistic choices in comparison with the other possible
features such as context-based features and semantic features which are suitable for modeling
the topics or representing text context and meaning. Moreover, the extraction of lexical and
syntactic features is easy, simple, and usually requires a short running time.
   Sentences are usually short and hence the number of features that could be used to characterize
authors writing style at the sentence level is limited. We referred to the fact that some authors
usually write long and complex sentences while other authors prefer to use short and simple
sentences. Moreover, some authors use complex vocabulary (e.g., long words, 2-syllable words,
3-syllable words, etc.), on the other hand, some authors use simple and short words that compose
one syllable. As shown in Table 1 , we selected features that describe the length, simplicity,
and complexity of sentences including sentence length, average word length per sentence, and
Table 1
Sentence-Level Features
                                #         Stylistic Features
                                1   Sentence Length By Characters
                                2   Sentence Length By Words
                                3   Average Word Length
                                4   Average Word Syllable
                                5   Stopwords Count
                                6   Function words Count
                                7   Punctuation Marks Ratio


average word syllable per sentence. Authors differ in terms of the number of stopwords they
usually used including the prepositions, pronouns, and determiners. Also, authors differ in
the way they use the function words that represent the grammatical and structural relation
between content words but don’t have an intrinsic meaning on their own 2 such as auxiliary,
modals, qualifiers, etc. We proposed stopword count and function words count that represent
the total number of stopwords/function words within a sentence. This is due to the short length
of the sentence where most stopwords/function words are not existing and hence the existence
of individual stopword/function words does not matter but their count does. We utilized the
list of stopwords provided by the NLTK library [36] and the list of function words proposed
by Zlatkova et al. [22] Punctuation marks could be a distinctive characteristic for identifying
authors since some authors usually used a large number of punctuation marks in comparison
with others. We proposed the punctuation ratio that represents the total number of punctuation
marks to the total number of words per sentence. Table 2 highlights the selected features at the
paragraph-level.
   Paragraphs are longer than sentences which makes some features more discriminatory when
they are extracted at the paragraph level rather than the sentence level. As shown in Table 2,
we proposed the readability that measures the simplicity and clearness as well as assesses the
reading ease of a given text since authors differ in simplicity and clarity of their writing. We
extracted different readability scores including Flesch Reading Ease Score (FRES) [37], Flesch
Kincaid Readability Index (FKRI) [38], Automated Readability Index (ARI) [39], Linsear Write
Formula (LWF) [40], and difficult words. These readability scores are extracted using textstat
python package 3 . Vocabulary richness that measures vocabulary variation and diversity is
also proposed to discriminate authors since some authors have rich language and use unique
words while others usually have a limited set of common vocabulary. We proposed n-grams
including word bigrams and word trigrams since they capture the association and co-occurrence
between terms which allows for capturing the frequent phrases preferred by some authors.

    2
      https://www.bitgab.com/english-grammar/function-words
    3
      Textstat is a python package that provides functions to measure text readability and complexity
(https://github.com/shivam5992/textstat)
Table 2
Paragraph-Level Features
                    #                      Stylistic Features
                    1      Paragraph Length By Characters
                    2      Paragraph Length By Sentences
                    3      Average Sentence Length
                    4      Average Word Length
                    5      Vocabulary Richness
                    6      Readability
                    7      Stopwords TF-IDF
                    8      Top 50 Frequent Terms
                    9      Words N-grams (Bigrams and Trigrams)
                    10     Top 150 Character N-grams (Bigrams and Trigrams)
                    11     POS Frequency
                    12     POS N-grams (Bigrams and Trigrams)


Character n-grams are proposed to capture some low-level information. By varying n, character
n-grams allow for capturing some punctuation marks, special characters, frequent stopwords,
and some short function words. We extracted the top 150 frequent character bigrams and
trigrams to reduce the dimensionality of the extracted characters. Finally, since authors differ
by the number of different Part-Of-Speech (POS) tag categories (e.g., the number of used nouns,
verbs, prepositions, modals, etc.), we propose the pos frequency to measure the frequency
of different pos tags categories that appeared in a given text. We utilized a POS dictionary
proposed by Zlatkova et al. [22]. POS n-grams represent the syntax of a given text by capturing
the location of different POS categories (e.g., whether the adjectival phrase occurs before or
after the subject). We proposed POS n-grams due to their ability to capture the grammatical
aspects since authors are characterized by their unique way of using the language while writing.
For example, some authors write the adverb at the beginning of the sentence while others prefer
to write the adverb in the middle of a sentence before the verb.

4.2. Clustering Method
To detect writing style inconsistency between paragraphs and sentences we propose an au-
thorship clustering method. We implemented the authorship clustering method to measure the
style similarity between paragraphs and sentences and cluster the paragraphs or sentences into
well-separated clusters in which each cluster includes all paragraphs/sentences written by the
same author. Each generated cluster represents a unique author, and hence, the number of the
generated clusters represents the number of authors who contribute to the document. Based
on the problem description, authors can switch between paragraphs (Task-1 and Task-2) but
not within a paragraph, and hence each paragraph should be written by exactly one unique
author. For Task-3, authors switch between sentences, and hence each sentence should be
assigned to exactly one author. Thus, the generated clusters should be well-separated and
non-overlapping. For this reason, we selected a partition-based clustering method in which
all paragraphs/sentences are clustered into mutually disjoining groups without any overlap
between clusters.
   Various partition-based clustering algorithms could be used for text clustering. We realized
the popularity of the K-means clustering algorithm for authorship clustering. K-means is simple,
easy to implement, and usually converges fast. Due to its strengths, we selected the K-means
clustering algorithm to tackle the style change detection problem in textual documents. We used
a simple K-means algorithm 4 provided by the Scikit-learn library [41] with random centroids
initialization and Euclidean distance to measure the similarity between text fragments and
cluster centroids. K-means algorithm requires the input parameter k that represents the number
of disjoint clusters/partitions to be pre-determined. This was the main challenging step in the
proposed approach since the number of clusters that represent the number of collaborative
authors is unknown. Hence, the goal first is to automatically estimate the number of authors
who contribute to the document. We propose an ensemble paragraph clustering method to find
the approximate k value corresponding to the number of authors.

4.3. Ensemble Paragraph Clustering
Detecting authors’ switch between sentences is challenging since sentences are usually very
short to discriminate authors. Paragraphs are the smallest meaningful text unit that could
be used for characterizing authors writing style and capturing authors distribution within
a document. From this point of perspective, we propose to identify the number of authors
who contribute to writing a document at the paragraph level. The proposed method aims at
generating ensemble paragraph clustering models by clustering document paragraphs using
different values of possible k. Each generated paragraph clustering model is then evaluated
using an optimization score. The best clustering model that optimizes the required score is
then selected and the optimal k value that corresponds to the selected model is then used to
cluster text within a document. To ensure that the generated clusters are well-separated, the
best clustering model to be selected is the model that minimizes the intra-distances between
cluster instances and maximizes the inter-distance between different clusters.
   Within-Cluster Sum-of-Square Error (WCSSE) that measures the intra-distance is proposed
in this paper as an optimization score to assess the generated paragraph clustering models. This
is similar to an elbow method that selects the optimal k which optimizes WCSSE by considering
the value of WCSSE that shape an elbow. Since the elbow method requires plotting a graph
to show the relationship between each possible value of k and the corresponding estimated
WCSSE which is hard to be plotted per document, thus we define an optimization score that
guarantees to find the optimal k value corresponding to the WCSSE near to the point-shaped
an elbow. We define the best WCSSE score to be the minimum value greater than or equal to
the average WCSSE estimated from the ensemble paragraph clustering. Figure 2 illustrates the

   4
       https://github.com/scikit-learn/scikit-learn/blob/16625450b/sklearn/cluster/𝑘 𝑚𝑒𝑎𝑛𝑠.𝑝𝑦𝐿1126
proposed ensemble paragraph clustering to find the optimal K corresponding to the number of
authors.


Figure 2: Ensemble Paragraph Clustering


   As shown in Figure 2 , the input document is first segmented into paragraphs. Paragraphs
features (Table 2) are then extracted from each paragraph resulting in representing each para-
graph as a feature vector. Paragraphs feature vectors are then fed into an ensemble paragraph
clustering based on a k-means algorithm at which k is ranged from two to five. The minimum
value of k is set to two since the minimum number of authors who collaborate in writing a
single document is two while the maximum value is set to five based on the description of the
input documents which states that documents are written by up to five authors. The generated
paragraph clustering models are then evaluated using the WCSSE (inertia). The average WCSSE
is then estimated. The optimal K value is the value corresponding to the minimum WCSSE
greater than or equal to the average WCSSE. The selected K value at the paragraph level is then
used to cluster paragraphs or sentences. Below is a description of how we adapt the proposed
ensemble-based authorship clustering method for solving each of the required tasks.

4.4. Clustering Method: Task-1
Task-1 aims to detect style changes between paragraphs in a document written by two authors
given that a document contains only a single style change. The number of authors in this task
is given, and hence the proposed ensemble paragraph clustering to find the optimal number of
authors is not used. To solve this task, the input document is segmented into paragraphs and
the paragraphs features (Table 2) are extracted from each paragraph. The paragraph feature
vectors are then fed into a K-means clustering algorithm where k is set to two. The similarity
between paragraphs’ writing style characteristics is measured and the paragraphs are grouped
into clusters where each cluster includes paragraphs that are predicted to be written by the
same author. The resulted paragraph clusters are then used to detect style changes and authors
switch between consecutive paragraphs such that if two consecutive paragraphs are belonging
to different clusters this indicates a style change between them.
   This method would predict multiple style changes between document paragraphs. To be
compatible with the problem specification and detect the only single change, we consider all
paragraph pairs that are predicted to have a style change between them. We estimate the cosine
similarity between these paragraph pairs and merge all paragraph pairs that are very similar in
writing style (have a high cosine similarity). To avoid setting a threshold for merging paragraph
pairs, we consider only the style change between paragraph-pair that have the minimum cosine
similarity between them since this indicates a high style inconsistency between this paragraph
pair while other paragraph pairs are merged in one cluster. This ensures that only a single
style change position is detected by the method such that this position refers to the text border
between paragraph-pair with the minimum cosine similarity.

4.5. Clustering Method: Task-2
Task-2 aims to attribute each paragraph to its original author given that a document is multi-
authored. To attribute authors within a document, the approximate number of actual authors
needs to be defined. The proposed ensemble paragraph clustering is used to find the optimal
number of authors (Section 4.3). To solve this task, the input document is first segmented
into paragraphs and the paragraph features (Table 2) are extracted. The generated paragraph
vectors are fed into the ensemble paragraph clustering where k is ranged from two to five. The
resulted paragraph clustering models are then evaluated using the WCSSE (inertia) to select the
best paragraph clustering model that optimizes the inertia. The selected paragraph clustering
model is then used to attribute authors to each paragraph. The resulted paragraph clusters
are exploited to perform the authorship attribution within a document such that each resulted
paragraph cluster represents a unique author and hence should contain all paragraphs written
by that author. The Paragraph cluster labels are exploited to represent the unique authors.

4.6. Clustering Method: Task-3
Task-3 aims at detecting authors’ switches between sentences. Before detecting authors switch
between sentences, the number of actual authors who contribute to writing the document
needs to be estimated which is the main challenge of this task. Sentences are very short to
characterize the authors writing style. On the other hand, paragraphs are longer than sentences
and hence are more representative for authors distribution within a document. From this point
of perspective, we propose to use the ensemble paragraph clustering (Section 4.3) to precisely
select the optimal number of authors who contribute to writing the document. The selected K
value at the paragraph level is then used with the sentence-level features (Table 1) to cluster
document sentences.
  The input document is first segmented into paragraphs and the paragraph features are then
extracted from each paragraph. The generated paragraph vectors are then fed into the proposed
ensemble paragraph clustering to find the optimal k value corresponding to the number of
authors. The selected k value at the paragraph level is then used to cluster document sentences.
The document is segmented into sentences and the sentence features are extracted. Sentence
vectors are then fed into a K-means clustering algorithm with the selected K value to cluster
sentences into similar groups. The generated sentence clusters are then used to detect authors’
switches between sentences such that consecutive sentences belonging to different clusters
have a different writing style and hence are predicted to be written by different authors which
indicates style changes and authors’ switches between them.


5. Experimental Setup
This section describes the conducted experiments including the used dataset, the applied text
pre-processing, feature extraction, and the method applied for parameter selection.

5.1. Dataset
Style change detection 2022 dataset composes of text documents that are constructed using
user posts collected from various StackExchange sites and covering different topics [17]. All
documents are provided in English and written by up to five distinct authors. Three datasets are
provided for each required task including dataset1, dataset2, and dataset3 for solving task1,task2,
and task3, respectively. Each dataset is divided into training, validation, and testing sets. The
training set contains 70% of the whole dataset while each of the validation and testing sets
contains 15% of the whole dataset.

5.2. Experimental Setting
Text documents are not preprocessed by stopwords removal, stemming, or lemmatization since
the proposed method is based on stylometric features for discriminating author writing style
which characterizes by the use of stopwords, function words, POS tags frequency, etc. The only
text preprocessing used to prepare the documents for clustering is the text segmentation into
paragraphs or sentences and the text tokenization to represent text as a set of tokens for features
extraction. Newline is used as a delimiter as specified in the problem description to segment
the document into paragraphs for Task-1 and Task-2, or into sentences for Task-3. Segmented
paragraphs or sentences are tokenized using the NLTK tokenizer [36]. Features are extracted
from paragraphs and sentences during the running time using the NLTK and Scikit-learn library
[41]. The segmented paragraphs and sentences are clustered using the K-means algorithm
provided by the Scikit-learn library 5 .
   The value of the K input parameter is selected during the running time by using the proposed
ensemble paragraph clustering method (Section 4.3) for both Task-2 and Task-3. During the
   5
       https://github.com/scikit-learn/scikit-learn/blob/16625450b/sklearn/cluster/𝑘 𝑚𝑒𝑎𝑛𝑠.𝑝𝑦𝐿1126
running time, text fragments are clustered and style breaches are detected between consecutive
paragraphs (Task-1 and Task-2) or consecutive sentences (Task-3). The detected breaches are
then written into a solution file corresponding to each input document. The generated solution
files are used to evaluate the performance by comparing the truth or actual breaches provided
in the dataset with the predicted breaches written to the solution files. Since the method does
not require training, both the provided training and validation datasets are used to evaluate the
performance.


6. Performance Evaluation
This section outlines the evaluation metrics and presents the evaluation results.

6.1. Evaluation Metrics
Each task is evaluated independently using a macro-averaged F-score across all documents.
Task-2 is evaluated using the macro-averaged F-score combined with two extra metrics including
Diarization Error Rate (DER) and Jaccard Error Rate (JER) [17] that measure the text fractions
attributed incorrectly 6 .

6.2. Results
Foremost, the proposed clustering methods were evaluated using the provided training and
validation datasets. The methods were then submitted to the TIRA platform [42] and evaluated
using a testing dataset. Table 3 presents the obtained evaluation results from training and
validation datasets. Table 4 shows the evaluation results of the proposed model on testing
dataset and compares its performance against PAN 2022 style change detection random baseline.

Table 3
Performance Evaluation On Training and Validation Datasets
                                          Task-1                Task-2             Task-3
                        Dataset
                                         F1-score     F1-score     DER     JER    F1-score
                  Training Dataset          0.53         0.21       0.57   0.35     0.50
                  Validation Dataset        0.54         0.22       0.57   0.36     0.50


   6
       https://pan.webis.de/clef22/pan22-web/style-change-detection.html
Table 4
Performance Evaluation On Testing Dataset
                                               Task-1                 Task-2            Task-3
                         Model
                                              F1-score     F1-score     DER     JER    F1-score
             Proposed Model                      0.52          0.22      0.57   0.35     0.49
             PAN 2022 Random Baseline            0.32          0.26      0.54   0.40     0.48


6.3. Discussion and Analysis
As shown in Table 3 and Table 4, the performance of the proposed methods is stable and they
achieve approximately very close evaluation scores on the three datasets.
   The evaluation results on PAN 2022 test dataset (Table 4) show that the most challenging
task is the authorship attribution within a multi-authored document (Task-2). The proposed
method for tackling authorship attribution at the paragraph level (Task-2) achieves a low f1-
score of 0.22 but this obtained f1-score is near to the performance of the PAN 2022 random
baseline which has achieved f1-score of 0.26. The low performance of Task-2 is due to the
challenges involved in this task since the number of authors is not given and the style of
each author is unknown. Moreover, the paragraph length is very short which makes the style
discrimination between paragraphs very hard, and hence the performance is affected. However,
the evaluation results of the style change detection task this year [17] show that our proposed
method for tackling Task-2 achieves the best Diarization Error Rate (DER) and Jaccard Error
Rate (JER) in comparison with all the submitted solutions 7 . This shows the strength of the
proposed paragraph-level features for attributing paragraphs to their original author as well
as the strength of the proposed ensemble-paragraph clustering to estimate the approximate
number of authors. The proposed clustering method for Task-1 achieves an F1-score of 0.52 and
outperforms the PAN 2022 random baseline for this task. In Task-1, the number of authors is
given which is much easier than in Task-2, but still, we are restricted to detecting only one single
style change between paragraphs. However, our clustering method for Task-1 would detect
multiple style changes between consecutive paragraphs where we tackled this by estimating the
cosine similarity between all paragraph pairs that are predicted to have style changes between
them to keep only one change between paragraphs that have minimum cosine similarity. The
challenges in this task are the short paragraph length and the small number of authors (only
two authors) which cause the cosine similarity between paragraphs to be very close and hence
the discrimination between paragraph pairs to detect only one change becomes much harder.
Although Task-3 is hard to tackle due to the short length of sentences and the small number of
features used for characterizing the authors writing style at the sentence level, it achieves an
acceptable f1-score of 0.49 which is near to the PAN random baseline performance.


   7
       https://www.tira.io/task/pan22-style-change-detection
7. Conclusion
In this paper, we proposed ensemble-based authorship clustering method for tackling style
change detection in multi-authored textual documents. The proposed method is implemented
in three different ways to tackle each of the required tasks of PAN 2022. We proposed a set of
stylistic features to characterize authors’ writing style at both paragraph and sentence levels.
Moreover, we proposed an ensemble paragraph clustering to approximate the number of authors
who contribute to writing a single document. The proposed methods outperform the PAN
2022 random baseline for Task-1 and Task-3. The achieved f1-score for Task-2 is low due to
the paragraphs’ short length which makes author discrimination at the paragraph level much
harder. However, the proposed method for tackling Task-2 achieves the best DER and JER in
comparison with all the submitted works.
   The findings from this research show that the main challenges in writing style change
detection are the selection of the appropriate set of features to discriminate authors writing
style which require deep knowledge of stylometry and linguistics, and the choice of an effective
method to precisely identify the number of authors with the absence of prior knowledge about
the number of authors and their distribution within a document.
   In the future, we plan to investigate deep learning combined with the clustering for tackling
style change detection by exploiting deep features generated from a model embedding layer
and performing the clustering on the generated deep features.


Acknowledgments
This work motivates academic and research interests in authorship analysis and stylometry. All
thanks to Professor Mohamed Menai for his endless support and encouragement to accomplish
this research.


References
 [1] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl,
     R. Ortega-Bueno, P. Pęzik, M. Potthast, et al., Overview of pan 2022: Authorship ver-
     ification, profiling irony and stereotype spreaders, style change detection, and trigger
     detection, in: European Conference on Information Retrieval, Springer, 2022, pp. 331–338.
 [2] E. Stamatatos, A survey of modern authorship attribution methods, Journal of the
     American Society for information Science and Technology 60 (2009) 538–556.
 [3] K. Lagutina, N. Lagutina, E. Boychuk, I. Vorontsova, E. Shliakhtina, O. Belyaeva, I. Para-
     monov, P. Demidov, A survey on stylometric text features, in: 2019 25th Conference of
     Open Innovations Association (FRUCT), IEEE, 2019, pp. 184–195.
 [4] S. Karmakar, Y. Zhu, Visualizing multiple text readability indexes, in: 2010 International
     Conference on Education and Management Technology, IEEE, 2010, pp. 133–137.
 [5] S. Ashraf, H. R. Iqbal, R. M. A. Nawab, Cross-genre author profile prediction using
     stylometry-based approach., in: CLEF (Working Notes), Citeseer, 2016, pp. 992–999.
 [6] M. Kestemont, M. Tschuggnall, E. Stamatatos, W. Daelemans, G. Specht, B. Stein, M. Pot-
     thast, Overview of the author identification task at pan-2018: cross-domain authorship
     attribution and style change detection, in: Working Notes Papers of the CLEF 2018 Evalu-
     ation Labs. Avignon, France, September 10-14, 2018/Cappellato, Linda [edit.]; et al., 2018,
     pp. 1–25.
 [7] M. Tschuggnall, E. Stamatatos, B. Verhoeven, W. Daelemans, G. Specht, B. Stein, M. Potthast,
     Overview of the author identification task at pan-2017: style breach detection and author
     clustering, in: Working Notes Papers of the CLEF 2017 Evaluation Labs/Cappellato, Linda
     [edit.]; et al., 2017, pp. 1–22.
 [8] W. Daelemans, M. Kestemont, E. Manjavacas, M. Potthast, F. Rangel, P. Rosso, G. Specht,
     E. Stamatatos, B. Stein, M. Tschuggnall, et al., Overview of pan 2019: bots and gender
     profiling, celebrity profiling, cross-domain authorship attribution and style change detec-
     tion, in: International conference of the cross-language evaluation forum for european
     languages, Springer, 2019, pp. 402–416.
 [9] J. Bevendorff, B. Ghanem, A. Giachanou, M. Kestemont, E. Manjavacas, I. Markov, M. May-
     erl, M. Potthast, F. Rangel, P. Rosso, et al., Overview of pan 2020: Authorship verification,
     celebrity profiling, profiling fake news spreaders on twitter, and style change detection,
     in: International Conference of the Cross-Language Evaluation Forum for European Lan-
     guages, Springer, 2020, pp. 372–383.
[10] J. Bevendorff, B. Chulvi, G. L. D. L. Peña Sarracén, M. Kestemont, E. Manjavacas, I. Markov,
     M. Mayerl, M. Potthast, F. Rangel, P. Rosso, et al., Overview of pan 2021: Authorship
     verification, profiling hate speech spreaders on twitter, and style change detection, in: In-
     ternational Conference of the Cross-Language Evaluation Forum for European Languages,
     Springer, 2021, pp. 419–431.
[11] N. Schaetti, Unine at clef 2018: Character-based convolutional neural network for style
     change detection, Training 2980 (2018) 1490.
[12] M. Hosseinia, A. Mukherjee, A parallel hierarchical attention network for style change
     detection (2018).
[13] C. Zuo, Y. Zhao, R. Banerjee, Style change detection with feed-forward neural networks.,
     in: CLEF (Working Notes), 2019.
[14] S. Nath, Style change detection by threshold based and window merge clustering methods.,
     in: CLEF (Working Notes), 2019.
[15] D. Castro-Castro, C. A. Rodríguez-Lozada, R. Muñoz, Mixed style feature representation
     and b-maximal clustering for style change detection., in: CLEF (Working Notes), 2020.
[16] J. Bevendorff, B. Chulvi, E. Fersini, A. Heini, M. Kestemont, K. Kredens, M. Mayerl,
     R. Ortega-Bueno, P. Pezik, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wieg-
     mann, M. Wolska, E. Zangerle, Overview of PAN 2022: Authorship Verification, Profiling
     Irony and Stereotype Spreaders, and Style Change Detection, in: M. D. E. F. S. C. M. G. P. A.
     H. M. P. G. F. N. F. Alberto Barron-Cedeno, Giovanni Da San Martino (Ed.), Experimental
     IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth
     International Conference of the CLEF Association (CLEF 2022), volume 13390 of Lecture
     Notes in Computer Science, Springer, 2022.
[17] E. Zangerle, M. Mayerl, M. Potthast, B. Stein, Overview of the Style Change Detection
     Task at PAN 2022, in: CLEF 2022 Labs and Workshops, Notebook Papers, CEUR Workshop
     Proceedings, 2022.
[18] J. A. Khan, Style breach detection: An unsupervised detection model., in: CLEF (Working
     Notes), 2017.
[19] K. Safin, R. Kuznetsova, Style breach detection with neural sentence embeddings., in:
     CLEF (Working Notes), 2017.
[20] R. F. Woolson, Wilcoxon signed-rank test, Wiley encyclopedia of clinical trials (2007) 1–3.
[21] D. Karas, M. Spiewak, P. Sobecki, Opi-jsa at clef 2017: Author clustering and style breach
     detection., in: CLEF (Working Notes), 2017.
[22] D. Zlatkova, D. Kopev, K. Mitov, A. Atanasov, M. Hardalov, I. Koychev, P. Nakov, An
     ensemble-rich multi-aspect approach for robust style change detection, CLEF (Working
     Notes) (2018).
[23] K. Safin, A. Ogaltsov, Detecting a change of style using text statistics, CLEF (Working
     Notes) (2018).
[24] E. Strøm, Multi-label style change detection by solving a binary classification problem, in:
     CLEF (Working Notes), 2021.
[25] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[26] R. Singh, J. Weerasinghe, R. Greenstadt, Writing style change detection on multi-author
     documents, in: CLEF, 2021.
[27] J. A. Khan, A model for style change detection at a glance (2018).
[28] S. Nath, Style change detection by threshold based and window merge clustering methods.,
     in: CLEF (Working Notes), 2019.
[29] D. Castro-Castro, C. A. Rodríguez-Lozada, R. Muñoz, Mixed style feature representation
     and b-maximal clustering for style change detection., in: CLEF (Working Notes), 2020.
[30] C. Zuo, Y. Zhao, R. Banerjee, Style change detection with feed-forward neural networks.,
     in: CLEF (Working Notes), 2019.
[31] A. Iyer, S. Vosoughi, Style change detection using bert., in: CLEF (Working Notes), 2020.
[32] S. Nath, Style change detection using siamese neural networks, in: CLEF (Working Notes),
     2021.
[33] R. Deibel, D. Löfflad, Style change detection on real-world data using lstm-powered
     attribution algorithm, in: CLEF, 2021.
[34] Z. Zhang, Z. Han, L. Kong, X. Miao, Z. Peng, J. Zeng, H. Cao, J. Zhang, Z. Xiao, X. Peng,
     Style change detection based on writing style similarity, Training 11 (1970) 17–051.
[35] Efstathios Stamatatos and Mike Kestemont and Krzysztof Kredens and Piotr Pezik and
     Annina Heini and Janek Bevendorff and Martin Potthast and Benno Stein, Overview of the
     Authorship Verification Task at PAN 2022, in: CLEF 2022 Labs and Workshops, Notebook
     Papers, CEUR Workshop Proceedings, 2022.
[36] E. Loper, S. Bird, Nltk: The natural language toolkit, arXiv preprint cs/0205028 (2002).
[37] J. Hartley, Is time up for the flesch measure of reading ease?, Scientometrics 107 (2016)
     1523–1526.
[38] S. Karmakar, Y. Zhu, Visualizing multiple text readability indexes, in: 2010 International
     Conference on Education and Management Technology, IEEE, 2010, pp. 133–137.
[39] R. Senter, E. A. Smith, Automated readability index, Technical Report, Cincinnati Univ OH,
     1967.
[40] J. C. Brewer, Measuring text readability using reading level, in: Advanced methodologies
     and technologies in modern education delivery, IGI Global, 2019, pp. 93–103.
[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, the
     Journal of machine Learning research 12 (2011) 2825–2830.
[42] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
     Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/
     978-3-030-22948-1\_5.