=Paper= {{Paper |id=Vol-2696/paper_232 |storemode=property |title=Style Change Detection Using BERT |pdfUrl=https://ceur-ws.org/Vol-2696/paper_232.pdf |volume=Vol-2696 |authors=Aarish Iyer,Soroush Vosoughi |dblpUrl=https://dblp.org/rec/conf/clef/IyerV20 }} ==Style Change Detection Using BERT== https://ceur-ws.org/Vol-2696/paper_232.pdf

Style Change Detection Using BERT
Notebook for PAN at CLEF 2020

Aarish Iyer and Soroush Vosoughi

Department of Computer Science, Dartmouth College, Hanover, NH 03755
aarish.ravikumar.iyer.gr@dartmouth.edu
soroush.vosoughi@dartmouth.edu

Abstract The Style Change Detection task is very important in the area of au-
thorship profiling, having one of its main applications in plagiarism detection.
Specifically, the goal of the task is to detect where (if any) stylistic changes
happen in a document which can be used to estimate the number of authors
of a given document. In this paper, we present a method for Style Change De-
tection. We use Google AI’s open source BERT pretrained bidirectional mod-
els to tokenize and generate embeddings for the sentences in each document in
our dataset and use those to train a random forest classifier. We achieved an F1
score of 0.86 for detecting style changes and an F1 score of 0.64 for detect-
ing multi-author documents on the test set, placing us at the top of the competi-
tion for both tasks. The code for this project has been made open source so that
it can be used for further research: https://github.com/aarish407/
Style-Change-Detection-Using-BERT

Keywords: Style Change Detection, BERT, Transfomer-based Models

1 Introduction

Detecting the number of authors involved in writing document by analyzing the writing
style is an important task that has been a focus of research for centuries. This area of
research has traditionally been called Stylometry and is defined by the Oxford dictionary
as, “the statistical analysis of variations in literary style between one writer or genre
and another". It is a centuries-old practice, dating back to the early Renaissance. Its
applications include plagiarism detection and forensics (e.g., Vosoughi et al. [12] use
computational stylometry techniques to link social media accounts operated by the same
user). The main principles of stylometry were compiled and laid out by the philosopher
Wincenty Lutosawski in 1890 in his work “Principes de stylomtrie" [6].
Unsurprisingly, style understanding has become one of the core areas of research in
natural language understanding, leading to the proposal of various computational meth-
ods for understanding and detecting style in written text (e.g., see [7] for a review of
the field of authorship attribution). Accordingly, this has become of one of the staple
Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li-
cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa-
loniki, Greece.
tasks at PAN. The work presented in this paper was developed as a solution to the Style
Change Detection task for the competition PAN @ CLEF 2020 [2,9]. The task is de-
scribed as follows: Given a document, determine whether it has been written by more
than one author (task 1). Furthermore, for a multi-author document, identify the posi-
tions in the document where the style change occurred (task 2). It is assumed that each
paragraph is written by only one author, thus style change can only occur between two
paragraphs. All the documents are in English and each document is written by one to
three authors and can contain from zero to ten style changes. This is more complicated
than the recent editions of the Style Change Detection tasks as those were either binary
detection of single-/multi-authored documents [10,11] or detecting the actual number
of authors in a document [13].
In this paper, we present a solution for this task using a Random Forest classifier in
conjunction with embeddings generated by BERT, an open source large-scale pretrained
language model developed by Google AI. The remaining of the paper is organized as
follows. First, we introduce the dataset, next we describe our approach, including all
data cleaning and pre-processing steps. Next, we describe our experiments and results.
Finally, we wrap up by discussing future work and summarizing our findings.

2 Dataset
There were two types of datasets [3] that were provided for this task - a narrow dataset
and wide dataset. The narrow dataset comprised of documents from similar domains,
while the wide dataset did not have any restriction on its contents. For each document in
each dataset there was an appropriate truth file given that had a label for task 1 (whether
the document was written by more than one author) and a changes array for the task 2 (a
list of 1s and 0s indicating if style change occurred between consecutive paragraphs).
The F1 metric was used to calculate the score for task 1, and the micro-averaged F1
metric was used to calculate the score for task 2. Both the metrics will be referred to as
accuracy hereafter. The results of both datasets were evaluated independently, and were
then averaged to produce a final score for each task. The final score was calculated on
the test dataset. Both the narrow and wide datasets were mined from the Stack family
of websites.
Here is some information about the dataset:

1. Table 1 shows the number of documents in the narrow and wide train and validation
sets. For each document, there was an appropriate truth file.
2. Table 2 shows the statistics of number of sentences and number of paragraphs in
each document for the train narrow and wide datasets.
3. The truth files had the following data: number of authors, order of authors, source
site and the results for task 1 and 2. For our solution, we did not make use of the
first three keys.
4. Figures 1a and 1b show the distribution of number of style changes for the train
narrow and wide datasets.
5. All datasets were balanced for task 1, i.e., detecting if a document is written by
more than one author.
(a) Wide dataset. (b) Narrow dataset.

Figure 1: Distribution of number of style changes in different datasets.

Table 1: Number of documents in each dataset
Number of Documents Narrow Wide
Train 3,442 8,138
Validation 1,722 4,078

Table 2: Statistics of number of sentences and paragraphs in each document for the two
train datasets.
Sentences Paragraphs
Narrow Wide Narrow Wide
Min 18 17 3 2
Max 276 375 82 74
Mean 110.19 106.98 25.28 18.03
Median 108 103 24 16

3 Approach

The general approach to both tasks was to generate embeddings of the words in each
document at the sentence level and then use these embeddings for the classification.
This is highlighted in Fig. 2.

3.1 Paragraph split

The first step was to split each document into paragraphs, since paragraphs are guaran-
teed to be atomic (i.e., only a single author has written a paragraph). This is important
as the second task involves identifying style change between consecutive paragraphs.
Figure 2: Our approach for generating feature vectors for the two tasks using pretrained
BERT.

3.2 Sentence Split

On first thought, the idea of splitting the paragraphs into sentences seems fairly straight-
forward – split on characters such as ’.’, ’?’ and ’!’. However, a lot of sentences would
be generated that weren’t sentences originally. For example, the prefixes ’Dr.’ or ’Mr.’
would have their own sentences. Thus, it was important to ensure that the sentences
were split in a manner that is robust to the variations in usage of the aforementioned
punctuation marks. A regular expression approach was used, for which each occurrence
of "." which was not meant as a sentence delimiter is identified and replaced with a spe-
cial token. The following structures were identified and replaced accordingly:

– Prefixes (Mr., Mrs., Dr., Ms., Prof., Capt., Cpt., Lt., Mt.)
– Website domains (.com, .net, .org, .io, .gov, .me, .edu)
– Acronyms (U.S.A., etc.)
– Suffixes (Inc., Ltd., Jr., Sr., Co.)
– Abbreviations (e.g., i.e., ...)
– Any digits separated by a period

The above approach doesn’t take into consideration the different usage of ’.’, ’,’, ’?’
and ’!’ written in code, which is likely to come up in a dataset mined from the Stack
family of websites. This can be added in the future to further improve this solution.

3.3 Embeddings

Before generating the embeddings, the sentence had to first be tokenized, which was
done by using Google AI’s BERT [1] tokenizer (the type of tokenizer depends on the
BERT model used, which is described below). Note that BERT can only process sen-
tences of length <= 512 tokens.
In order to generate embeddings for the tokenized sentences, Google AI’s BERT
[1] pretrained deep bidirectional models were used. BERT offers various models, and
for this task, the BERT Base Cased model was used (layers=12, hidden size=768, self-
attention heads=12, total parameters=110M). The authors of BERT recommend that
the BERT Base Uncased model should be used for most situations, unless it is cer-
tain that having a case-sensitive model would aid the task. We were able to report a
0.94% increase in the accuracy for the first task for the Wide dataset between the Cased
and Uncased models, and thus the Cased model was used for the other tasks as well.
The BERT Large model (layers=24, hidden size=1024, self-attention heads=16, total
parameters=340M) was not explored for this work due to its computationally inten-
sive nature. Furthermore, the BERT Large model in most cases only reported a 1-2%
increase in accuracy over the BERT Base model on other NLP benchmarks [1].
Although BERT is used to capture context rather than style, the authors of this work
found out that the information captured by these embeddings works well for the style
change detection task as well.

3.4 Processing of the Dataset
Although BERT was used to generate the embeddings, they had to be combined in a
specific way to fit both the tasks. The following is the method followed:
1. Each individual sentence was processed by the BERT Tokenizer, and truncated to
512 tokens if needed.
2. The tokenized sentence was then processed by BERT, which generated embeddings
for each layer. This generated a tensor of dimensions 12 × l × 768, where l is the
length of the sentence.
3. The authors of BERT found out that the best results were obtained when the em-
beddings of the last 4 layers were combined, either by summing them, producing
a tensor of dimensions l × 768, or by concatenating them and producing a tensor
of dimensions l × 3072. We chose to sum the embeddings of the last 4 layers in
order to prevent the dimensions of the tensor from becoming too big. Thus the final
dimensions of the tensor at this step are l × 768.
4. The first dimension of the current tensor is the length of the sentence, and can
thus change from sentence to sentence. In order to prevent this, the embeddings
of the sentence are summed over the first dimension, thus producing a final vector
of length 768. Summing the embeddings over the first dimension as opposed to
averaging them can lead to large difference in embedding values between sentences
that are long and short. However, the length of a sentence is an important factor in
detecting style change, and thus it is important to capture that information.
At this stage, we change our approach of combining embeddings for the two tasks.

Detecting style change at the document-level (Task 1) To produce a final tensor for
the whole document, all the sentence vectors of the document were averaged. At the
document-level, the following approaches were tested:
1. Generate the sentence vectors by summing the embeddings over the length dimen-
sion + summing all the sentence vectors of the document to produce a document-
level tensor
2. Generate the sentence vectors by summing the embeddings over the length dimen-
sion + averaging all the sentence vectors of the document to produce a document-
level tensor
3. Generate the sentence vectors by averaging the embeddings over the length dimen-
sion + summing all the sentence vectors of the document to produce a document-
level tensor
4. Generate the sentence vectors by averaging the embeddings over the length dimen-
sion + averaging all the sentence vectors of the document to produce a document-
level tensor

The second approach produced the best results for the style change detection task at
the document-level. We have described why summing the embeddings over the length
dimension as opposed to averaging them works better. While producing the document-
level tensor, averaging all the sentence vectors seems to work better. This can be at-
tributed to the fact that the length of the document doesn’t really factor into determin-
ing whether or not style change occurred in the document, as all style changes occur
between paragraphs. Thus, it makes no difference if the document is relatively short
or long, as long as it has at least two paragraphs. Thus, there is no need to capture this
information. It must be noted that the difference in accuracy for all four approaches was
within 2% for the validation wide dataset.

Detecting style change at the paragraph-level (Task 2) Since style change had to
be determined between paragraphs, the paragraph-level data points were calculated by
averaging the embeddings of two consecutive paragraphs. Thus, the data point was gen-
erated by adding the embeddings of all sentences in both paragraphs and then dividing
it by the sum of both paragraph lengths (in sentences). The labels were the entries in the
changes array of the truth file. It is important to note that the labels of the paragraph-
level data points are now imbalanced, as a document with no style change will have all
paragraph-level labels as 0, while a document with style-change may still have some
consecutive paragraphs that were written by the same author, and thus the labels for
those data points would also be 0.
After this step, we essentially have two datasets - one with data points at the document-
level and the other with data points at the paragraph-level.

3.5 Classifier
Using Python’s off-the-shelf ML library Scikit-learn [8], various supervised models
were tested for binary classification, such as Logistic Regression, Decision Trees, Ran-
dom Forest, Support Vector Machines and Naive Baye’s (Multinomial and Gaussian).
The Random Forest classifier produced the best results by far for both tasks and on
both data sets. Furthermore, once the Random Forest classifier was decided upon, a
grid search on the hyperparameters was performed (for both tasks and both datasets)
which increased the accuracy by almost 3%. However, the number of estimators for
the grid-searched classifier was significantly larger than the default number of estima-
tors, which in turn increased the time the classifier took to generate predictions on the
validation set.
Finally, we had 4 classifiers:

1. Document-level classifier for the wide dataset.
2. Document-level classifier for the narrow dataset.
3. Paragraph-level classifier for the wide dataset.
4. Paragraph-level classifier for the narrow dataset.

The final set of hyperparameters for each classifier are given in Table 3

Table 3: Hyperparameters for all four classifiers.
Document Wide Document Narrow Paragraph Wide Paragraph Narrow
Criterion entropy gini gini gini
Min Samples Per Leaf 1 1 1 1
Min Samples Per Split 2 2 2 2
Estimators 400 1800 400 250

4 Results

Here we show the performance of our model on the validation and test sets. The valida-
tion set was made available during the development of the model, while the test results
show the performance of our model in the competition.
Table 4 shows the performance of our model on the validation set and Table 5 shows
the performance of our model on the test set. Note that for the test set, we only have
cumulative information of the two datasets for the two tasks

Table 4: F1 scores calculated on the validation set for Document-level (task 1) and
Paragraph-level (task 2) predictions.
Narrow Wide
Document-level 0.7661 0.7575
Paragraph-level 0.8805 0.8306

As can be observed, there is a discrepancy between the results reported on the test
set and the validation set. This is because of the difference between the environments
in which both the tests were carried out. During the development of this project, the
Table 5: Average F1 scores calculated on the test set for Document-level (task 1) and
Paragraph-level (task 2) predictions.
Average for both datasets
Document-level 0.6401
Paragraph-level 0.8566

BERT model was run using a GPU, which greatly increased the speed of computa-
tion. However, since the virtual machine offered by TIRA did not provide a GPU, all
computations were significantly slower. The authors of this paper decide to clip the
computations after a certain time in order to prevent the session from crashing and not
being able to submit our solution.

5 Other Approaches & Future Work
During the course of this project, a number of different approaches were tried. For those
approaches, a unique dataset was generated, where each data point was a combination
of two sentences from consecutive paragraphs of a document. Thus, if the sentences
were from the same paragraph, then the corresponding label would be 0, while two
sentences from different paragraphs would have a label of 1 if there was a style change
between the two paragraphs. This approach is also susceptible to producing an imbal-
anced dataset, and hence the dataset was balanced before moving on with the classifica-
tion task. The dataset produced had nearly 3 million data points by just using the wide
dataset. A couple of the approaches have been described below:

Fine-Tuning BERT In this method, the goal was to fine tune BERT using the training
data so that it could produce results that were at par or better than the submitted solution.
However, it was empirically observed that accuracy plateaued after a point, and was thus
not explored further.

Convolutional Neural Network This method is inspired by prior work on sentence
classification using convolutional neural networks [5]. In this method, each data point
had dimensions (l1 + l2) × 768 where l1 and l2 are the lengths of the two sentences.
Note is that the data points were allowed to have variable lengths (as long as their
individual lengths were <= 512). The tensors were then passed through a set of parallel
convolutional filters, with Kernel sizes of (2, 768), (3, 768), ..., (5, 768) . These were
meant to capture n-gram stylistic features (i.e., bigrams, trigrams, etc). The results after
applying all convolution filters were globally pooled and then combined to form a vector
of length n where n is the number of convolutional filters. At the end, a Fully Connected
Layer is used to generate the final label.
Due to a lack of time, this approach could not be explored fully. However, the au-
thors of this paper believe that there is merit to this approach, and intend to study it
further in the future.
6 Conclusion
In this paper, we have shown how BERT can be used for the Style Change Detection
task. Although BERT is used to capture context, this project shows that the information
captured by its embeddings can be used for other NLP tasks as well. We intend to work
on this project further by expanding on the other methods mentioned in Section 5. The
code for the project can be found at [4].

References
1. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
2. Eva Zangerle, Maximilian Mayerl, G.S.M.P.B.S.: Overview of the Style Change Detection
Task at PAN 2020. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020
Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2020)
3. Eva Zangerle, Maximilian Mayerl, M.T.G.S.M.P.B.S.: (2020), https://zenodo.org/
record/3660984#.XxLhEihKhPY
4. Iyer, A., Vosoughi, S.: (2020), https://github.com/aarish407/
Style-Change-Detection-Using-BERT
5. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceed-
ings of the 2014 Conference on Empirical Methods in Natural Language Pro-
cessing (EMNLP). pp. 1746–1751. Association for Computational Linguistics, Doha,
Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1181, https://www.aclweb.org/
anthology/D14-1181
6. Lutosławski, W.: Principes de stylométrie (1890)
7. Malyutov, M.B.: Authorship attribution of texts: A review. In: General Theory of Information
Transfer and Combinatorics, pp. 362–380. Springer (2006)
8. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. the
Journal of machine Learning research 12, 2825–2830 (2011)
9. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architec-
ture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World.
Springer (Sep 2019)
10. Stamatatos, E., Rangel, F., Tschuggnall, M., Stein, B., Kestemont, M., Rosso, P., Potthast,
M.: Overview of PAN 2018: Author Identification, Author Profiling, and Author Obfusca-
tion. In: Bellot, P., Trabelsi, C., Mothe, J., Murtagh, F., Nie, J., Soulier, L., SanJuan, E., Cap-
pellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and In-
teraction. 9th International Conference of the CLEF Initiative (CLEF 2018). Springer, Berlin
Heidelberg New York (Sep 2017)
11. Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Pot-
thast, M.: Overview of the Author Identification Task at PAN 2017: Style Breach Detection
and Author Clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working
Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings, vol. 1866.
CEUR-WS.org (Sep 2017), http://ceur-ws.org/Vol-1866/
12. Vosoughi, S., Zhou, H., Roy, D.: Digital stylometry: Linking profiles across social networks.
In: International Conference on Social Informatics. pp. 164–177. Springer (2015)
13. Zangerle, E., Tschuggnall, M., Specht, G., Potthast, M., Stein, B.: Overview of the Style
Change Detection Task at PAN 2019. In: Cappellato, L., Ferro, N., Losada, D., Müller,
H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019),
http://ceur-ws.org/Vol-2380/