Style Change Detection with Feed-forward Neural
                         Networks
                         Notebook for PAN at CLEF 2019

                      Chaoyuan Zuo, Yu Zhao, and Ritwik Banerjee

                             Department of Computer Science
                {chzuo,yuzhao1,rbanerjee}@cs.stonybrook.edu
                 Stony Brook University, Stony Brook New York 11794, USA


        Abstract The majority of previous authorship attribution studies mainly focus
        on a dataset of documents (or parts of documents) with labeled authorship. This
        scenario, however, is not applicable to documents written by more than one author.
        Detecting the authorship switches within multi-author documents has been shown
        to be a challenging task in previous PAN tasks. A simplified version of the style
        change task is thus organized by PAN 2019, which aims at identifying the number
        of authors in a given document. To this end, we present a system consisting of
        two modules, one for distinguishing the single-author documents from the multi-
        author documents and the other for determining the exact number of authors in
        the multi-author documents.


1     Introduction
Authorship attribution is a difficult task with a long history [10]. With the advent of
the web and social media, however, the nature of authorship is changing. On one hand,
collaborative writing is becoming more commonplace, while on the other, the easy
availability of vast amounts of source material makes plagiarism easy to carry out but
hard to detect. There have been several settings of the authorship attribution task, with
the most common scenario focusing on a closed set of documents and candidate authors
with the assumption that each document is written by a single author from among the
candidates [4,18]. These models are not applicable if a document is written by more
than one author, however. The style breach detection task at PAN 2017 was designed
to bridge this gap. The task was to find the border positions where authorship changes,
but the results showed it to be an extremely challenging task [20]. A simplified version
was presented in the following year at PAN, where the task was to detect whether or not
there was any stylistic change, i.e., whether or not the document had multiple authors.
The results of this task were quite promising, attaining accuracy up to 0.893. The current
task [22] goes further and aims at detecting the exact number of authors in a document.
    This paper reports on the PAN 2019 shared task on style change detection. A two-
step pipeline is presented to solve this problem. The first step of the system aims at
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons
    License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
distinguishing multi-author documents from the documents with a single author, and the
second identifies the exact number of authors within the multi-author document. The
evaluation results of this task suggest that automated detection of writing style change
remains a challenging task.


2   Related Work
Earlier research in author profiling worked under the assumption that each article has
exactly one author [18] where authorship attribution could largely be done on the basis
of lexical, syntactic, and semantic features [2,14]. Documents may have multiple authors,
however. For instance, in collaborative work, different sections may be written by
different authors. The stylistic cues linking authors to text can be used to partition a
document into stylistic clusters, thus providing insight into the number of authors of
a document, and who those authors might be. This is known as the author diarization
problem, an important component of which is to identify the number of authors within a
document.
    Prior research on the author diarization task at PAN-2016 [19] indicates that capturing
stylometric changes is perhaps the most promising approach on author clustering within
documents [9,16]. The text is split into sentences, and features including word frequency,
selected part-of-speech (POS) tag counts, and average word length are calculated for
each sentence. The K-means clustering algorithm is applied to generate clusters based
on the distances computed with these features [16]. The style breach detection task at
PAN-2017 [20] is to find the exact position of authorship changing with a document.
Here, too, the three submissions extract shallow stylometric features like character n-
grams, word frequency, function words and punctuation. Such features are explored
at the level of sentences [6,12] as well as paragraphs [5]. The similarity between the
consecutive objects is then evaluated to detect a change of authorship.
    Given the evident difficulty of author diarization, PAN-2018 presented a simpler
binary classification task of identifying whether or not a document was written by a single
author. Going beyond typical stylometric features [7,13], several other approaches were
explored. The winning submission by Zlatkova et al. [23] split each document into three
segments of equal length and use several classifiers to obtain the final results. Hosseinia
and Mukherjee [3] relied solely on grammatical structures and lexical features where
each sentence was represented by a collection of features extracted from its parse tree.
This representation was provided as input to two recurrent neural networks in the original
or reversed order of the sentences of the document, respectively. Multiple similarity
measures were then computed the difference between the two network representations,
yielding the final binary classification. Another approach was taken by Schaetti [15],
based on a character embedding layer used in a convolutional neural network.


3   Style Change Detection
This paper is the result of our participation in the ‘Style Change Detection’ taks as part
of PAN at CLEF 2019. The task is defined as follows: Given a document, determine
whether it contains style changes or not, i.e., if it was written by a single or multiple
authors. If it is written by more than one author, determine the number of involved
collaborators. As is evident from the task definition, an individual author is implicitly
associated with a particular style of writing. Earlier PAN tasks have shown that complete
author diarization is a particularly difficult problem [19,20]. As such, this task sets the
relatively modest goal of detecting the number of authors in each document, omitting
author attribution. The task definition lends itself naturally to a two-step pipeline process:
(i) binary classification to determine whether a document has a single author or multiple
authors, and (ii) if a document has multiple authors (as determined by the first step),
identify how many. Next, we present the details of the data used in this work and the
evaluation framework.

Data: Each document in the dataset for Table 1. Distribution of the number of authors of
this task is a post (or a concatenation the documents in the dataset
of multiple posts) from StackExchange1 . # authors                   1   2 3 4 5
The training set comprises 2,546 docu-
                                               # docs in training 1273 325 313 328 307
ments, and a separate validation set of
                                               # docs in validation 636 179 152 160 145
1,272 documents is also provided. For
each document, the gold-standard labels
are provided in the form of (i) the number of authors, and (ii) annotations marking who
authored exactly which portions of the document. Exactly half of the training documents
have a single author. The remaining half have a nearly uniform distribution over the
number of authors, ranging from 2 to 5 authors. This is shown in Table 1.The test set
has the same size as the validation set, but has been provided without any labels, i.e., no
information about where within a document authorship switched, or how many authors
are there for a given document.

Evaluation: In this task, the performance of a model is evaluated by combining (i) the
accuracy, which serves as a measurement for the binary classification of single against
multiple authorship, with (ii) the ordinal classification index (OCI) [1], which measures
the error of predicting the number of authors for documents with multiple authors. Thus,
the final rank r that captures both is the arithmetic mean of the accuracy and the inverted
OCI is given by
                                   1
                              r = (accuracy + (1 − OCI))                                (1)
                                   2
To evaluate the submission, the participants are asked to submit the created software for
this task through a virtual machine in TIRA [11], a web platform that supports software
submission for shared tasks.


4      Methodology

Given the evaluation framework, which consists of combining two independent mea-
surement, we build a system that treats them separately by applying a two-step pipeline.
Th first step separartes single-author from multi-author documents, and the second step
 1
     https://stackexchange.com/
                                Figure 1. Overview of our system


identifies the number of authors in multi-author documents, In our approach, we divide
the whole document into several segments, and cluster the segments based on writing
style. The objective is to have the number of clusters be equal to the the number of
authors in the document. Figure 1 illustrates this process2 .

Data preprocessing: About 1% of the documents in the dataset are in Spanish instead
of English. Since it is difficult to extract features and build models for them separately
within the scope of this task, we randomly assign the number of authors (ranging from 1
to 5) for these documents. For the rest, we filter out some frequent phrases that carry
little or no linguistic relevance. These are typically URL or technical specifications like
“OSX 10.11.2”.

Binary classification of documents (single vs multiple authors): Here we describe
how we identify whether a document is authored by one person or not. We treat all
documents with multiple authors as one category, and build binary classifiers for separat-
ing them from documents with a single author. We use Keras3 to implement this. Each
documents is first tokenized and then converted into a term-document matrix where
each word is encoded with its term-frequency inverse-document-frequency (TF - IDF)
score. Further, we set the maximum number of words to keep as 40,000. Only the most
common 40,000 words on the dataset are used, based on word frequency.
    This classification is along the lines of the PAN-2018 task. Given the success of
non-linear neural networks in that task (e.g., Zlatkova et al. [23]), we decide to adopt a
neural network as well. We use a simple feedforward neural network as the classifier,
with the term-document matrix being used as the input. The network has only one
hidden layer with 128 units and a softmax output layer with two nodes for the binary
classification. For activation, we use the sigmoid function since it achieved better results
when compared to rectified linear units (ReLU) on validation, and for optimization we
 2
     The implementation is available at https://github.com/chzuo/PAN_2019.
 3
     https://keras.io/
use Adam [8]. To avoid overfitting, we use dropout [17] with the probability value set to
p = 0.5.

Detecting the number of authors in multi-author documents: Our key idea for this
step is to divide the whole document into several parts and then cluster them to groups.
Since some documents are poorly structured (e.g.), upon tokenizing using the NLTK
tokenizer, some documents yield more than 200 sentences, we divide each document
into paragraph-level segments instead of sentences. Further, upon studying the test and
validation sets, we notice that almost all authorship changes happen near an empty line or
newline symbol, based on the observation that half of the switches happen after an empty
line while 80% of the writing by new authors starts after the newline symbols. Thus, to
divide a document into segments, firstly the empty line is used to separate the segments,
and if we can get more than 15 segments in the document after splitting, we then use
these segments for the next clustering step, otherwise, we use the newline symbol for
splitting segments in the text and use that results for the next step. In general, using the
newline symbol results in more segments than using an empty line. Finally, segments
with less than 20 tokens are discarded.

Feature Extraction: Then the feature extraction module is conducted on each segment
after splitting. Following the feature design of the winning submission of PAN-2018 [23],
we use the following features:
  – Token and POS distribution features. This includes the distribution of token length
     in the segment, the distribution of POS-tags in the segment, the number of sentences
     in the segment, the number of each special character and punctuation in the segment.
     The special characters included are ‘#’, ‘$’, ‘%’, ‘&’, ‘*’, ‘@’, parentheses and the
     forward and backward slashes.
  – Contracted word forms. Writers may have their own preferences about the usage
     of contractions like “I’ll” instead of “I will” or “I’m” instead of “I am”. Thus, we
     maintain two lists, one containing the original words and the other containing their
     corresponding contracted forms. We count the total number of occurrences of the
     words in each list and use these two number as the feature.
  – British/American English spelling. We use a list of spelling variations4 , and encode
     this feature as the number of occurrence of variant spellings in a single segment.
  – Function word frequencies. We combine the list of function words from NLTK and
     the list used by Zlatkova et al. [23], and encode the frequency of each function words
     as additional features.
  – Readability. We use Textstat5 to obtain the Flesch reading ease score, SMOG grade,
     Flesch-Kincaid grade, Coleman-Liau index, automated readability index, Dale-Chall
     readability score, Linsear Write readability metric, Gunning-Fog index, and the
     number of difficult words in the text. We use all the above measures of readability
     and keep them as separate features.
Further, we also use the TF - IDF scores of the words in each segment, as are calculated
for the creation of the term-document matrix earlier.
 4
     https://en.wikipedia.org/wiki/Wikipedia:List_of_spelling_variants (Accessed May 2019).
 5
     https://github.com/shivam5992/textstat
Segment Clustering: To cluster the segments into groups, we build an ensemble system
of different algorithms on various combinations of features. It consists of three models: (i)
the K-means clustering where each segment is represented as a bag-of-words vector with
TF - IDF encoding, (ii) hierarchical clustering with all the rest of the features mentioned in
the above section, and (iii) a feedforward neural network classifier to detect the similarity
between segments and create the similarity matrix for all the segments using the output
of the NN. Then we use K-means clustering with silhouette analysis to determine the
number of clusters in this similarity matrix. The three models share the same weight
when determining the final results of the number of authors.

K-means clustering: For each document, all the segments are represented as a bag-
of-words vector with TF - IDF encoding. We use the scikit-learn 6 tool to implement the
K-means clustering for these segments. The silhouette analysis is then used to select the
number (ranging from 2 to 5) of best clusters.

Hierarchical clustering: The hierarchical clustering algorithm is used to group the
segments in each document. We use the extracted features for clustering except for the
TF - IDF encoding. The tool we use for the implementation of hierarchical clustering is the
SciPy7 . We use the Ward’s minimum variance method [21] for computing the distance
between the nodes . Once the distance between nodes has been calculated, the linkage
function is used for paring the objects that are close to the binary clusters (clusters
consisting of two objects), and it links the newly formed clusters to each other to create
bigger clusters based on the distance information until all the nodes are linked in a
hierarchical tree. The distance information of each clustering step in this tree can be used
as the cutoff argument to determine the number of clusters. Moreover, the inconsistent
coefficient for each link in the hierarchical clustering tree can be used for determining
the number of clusters as well. It is calculated by comparing the height of a link with the
average height of other links at the same cluster hierarchy. The lower the coefficient, the
smaller the difference between the object with those around it.
     The number of clusters for each document, however, is hard to determine, as there
are no methods to set a universal cut-off value for all documents. Thus, we build a
feedforward neural network with one hidden layer consisting of 20 units. The input
vector with the dimension as 50 is created for each document, using the distance and
inconsistency value for the last 25 clustering step on a linkage matrix. If the number
of segments is less than 25 in the document, we add zeros to the start of the vectors to
increase its length to 50. The output target of the NN classifier is the number of authors
in the document. A softmax layer with 4 output nodes is added. We use the ReLU for
activation function and Adam for stochastic optimization, as well as the dropout with
p = 0.5.
     We train the network on all the documents with more than two authors in the training
and validation set. During the test phase, only the documents that get categorized as
multi-author documents after the first step of our process (i.e., the binary classification)
is sent as the input to the NN model for prediction.
 6
     https://scikit-learn.org/stable/
 7
     https://www.scipy.org/
                Team                  Accuracy       OCI      Rank     Runtime
                zuo19                   0.604       0.808     0.397    00:25:13
                nath19                  0.847       0.865     0.491    02:45:13
                Random Baseline         0.500       0.876     0.312        -

Table 2. Results for the submission from all participants in style change detection task. We
participated under the name zuo19. The best results for each metric is shown in bold. For OCI, the
lower the value, the better the performance.


K-means with similarity matrix We create a dataset by splitting the documents into
several parts using the provided authors switches information from the gold-standard
label and pairing the obtained segments. Over 40k segment pairs are selected from the
documents. For half of the pairs, the two segments are written by the same author and
we treat them as one category. We build a binary NN classifier for separating them with
pairs written by different authors. We use all the features mentioned above except for
the TF - IDF representation. The network has 2 hidden layers with 50 units and 8 units in
them. We use ReLU for activation function. For each pair of segments, we define the
similarity of this pair is the probability that the two segments are written by the same
author. Then for a document containing n segments, we generate a similarity matrix
M of size n ∗ n where Mi,j is the similarity of segments i, j, using the output the NN
classifier. Finally, we employ the K-means clustering method with silhouette analysis
for determining the best number of authors in this similarity matrix.


5    Results
The results of our system are shown in Table 2. Two teams participated in this task, and
our submission outperforms the other in the OCI evaluation measure, where the lower
the value, the better the performance. For the first metric - accuracy, the performance of
our binary classifier reaches the value as 0.6, 20 percent increase of 0.5, which serves
as the baseline for a random guess. And for the second metric - OCI for multi-authors
detection, our system achieves 0.808, slightly better than the random guess. The results
of our system suggest that style change detection for multi-author documents is remains
a challenging task and requires significant further research to be adequately resolved.


6    Conclusion
We develope a two-step pipeline system to detect the number of authors in the given
document. The purpose of the first step is to identify whether or not the document is
written by more than one author. This is achieved by using a feedforward neural network
as a binary classifier. The second step is an ensemble model of different clustering
methods. This identifies the number of authors for the multi-author documents. This
task, however, is quite challenging, and there is scope for significant improvement in
this direction.
References

 1. Cardoso, J.S., Sousa, R.: Measuring the performance of ordinal classification. International
    Journal of Pattern Recognition and Artificial Intelligence 25(08), 1173–1195 (2011)
 2. Feng, S., Banerjee, R., Choi, Y.: Characterizing Stylistic Elements in Syntactic Structure. In:
    Proceedings of the 2012 Joint Conference on Empirical Methods on Natural Language
    Processing and Computational Natural Language Learning. pp. 1522–1533. Association for
    Computational Linguistics (2012)
 3. Hosseinia, M., Mukherjee, A.: A Parallel Hierarchical Attention Network for Style Change
    Detection — Notebook for PAN at CLEF 2018. In: Cappellato, L., Ferro, N., Nie, J.Y.,
    Soulier, L. (eds.) CLEF 2018 Evaluation Labs and Workshop – Working Notes Papers, 10-14
    September, Avignon, France. CEUR-WS.org (2018)
 4. Iqbal, F., Binsalleeh, H., Fung, B.C., Debbabi, M.: Mining writeprints from anonymous
    e-mails for forensic investigation. digital investigation 7(1-2), 56–64 (2010)
 5. Karaś, D., Śpiewak, M., Sobecki, P.: OPI-JSA at CLEF 2017: Author Clustering and Style
    Breach Detection—Notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot,
    L., Mandl, T. (eds.) CLEF 2017 Evaluation Labs and Workshop – Working Notes Papers,
    11-14 September, Dublin, Ireland. CEUR-WS.org (Sep 2017)
 6. Khan, J.: Style Breach Detection: An Unsupervised Detection Model—Notebook for PAN at
    CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017
    Evaluation Labs and Workshop – Working Notes Papers, 11-14 September, Dublin, Ireland.
    CEUR-WS.org (Sep 2017)
 7. Khan, J.A.: A model for style breach detection at a glance: Notebook for PAN at CLEF 2018.
    In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon,
    France, September 10-14, 2018. (2018)
 8. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
    arXiv:1412.6980 (2014)
 9. Kuznetsov, M.P., Motrenko, A., Kuznetsova, R., Strijov, V.V.: Methods for intrinsic
    plagiarism detection and author diarization. In: Working Notes of CLEF 2016 - Conference
    and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016. pp. 912–919 (2016)
10. Mendenhall, T.C.: The characteristic curves of composition. Science 9(214), 237–249 (1887)
11. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.
    In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World -
    Lessons Learned from 20 Years of CLEF. Springer (2019)
12. Safin, K., Kuznetsova, R.: Style Breach Detection with Neural Sentence
    Embeddings—Notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L.,
    Mandl, T. (eds.) CLEF 2017 Evaluation Labs and Workshop – Working Notes Papers, 11-14
    September, Dublin, Ireland. CEUR-WS.org (Sep 2017)
13. Safin, K., Ogaltsov, A.: Detecting a change of style using text statistics: Notebook for PAN at
    CLEF 2018. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation
    Forum, Avignon, France, September 10-14, 2018. CEUR-WS.org (2018)
14. Sapkota, U., Bethard, S., Montes, M., Solorio, T.: Not All Character N-grams Are Created
    Equal: A Study in Authorship Attribution. In: Proceedings of the 2015 Conference of the
    North American chapter of the Association for Computational Linguistics: Human Language
    Technologies. pp. 93–102 (2015)
15. Schaetti, N.: Character-based Convolutional Neural Network and ResNet18 for Twitter
    Author Profiling — Notebook for PAN at CLEF 2018. In: Cappellato, L., Ferro, N., Nie, J.Y.,
    Soulier, L. (eds.) CLEF 2018 Evaluation Labs and Workshop – Working Notes Papers, 10-14
    September, Avignon, France. CEUR-WS.org (2018)
16. Sittar, A., Iqbal, H., Nawab, R.: Author Diarization Using Cluster-Distance
    Approach—Notebook for PAN at CLEF 2016. In: Balog, K., Cappellato, L., Ferro, N.,
    Macdonald, C. (eds.) CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers,
    5-8 September, Évora, Portugal. CEUR-WS.org (Sep 2016)
17. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple
    way to prevent neural networks from overfitting. The Journal of Machine Learning Research
    15(1), 1929–1958 (2014)
18. Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the
    American Society for Information Science and Technology 60(3), 538–556 (2009)
19. Stamatatos, E., Tschnuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B.,
    Potthast, M.: Clustering by Authorship Within and Across Documents. In: Working Notes
    Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian
    [edit.]; et al. pp. 691–715 (2016)
20. Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B.,
    Potthast, M.: Overview of the author identification task at pan-2017: style breach detection
    and author clustering. In: Working Notes Papers of the CLEF 2017 Evaluation
    Labs/Cappellato, Linda [edit.]; et al. pp. 1–22 (2017)
21. Ward Jr, J.H.: Hierarchical grouping to optimize an objective function. Journal of the
    American statistical association 58(301), 236–244 (1963)
22. Zangerle, E., Tschuggnall, M., Specht, G., Potthast, M., Stein, B.: Overview of the Style
    Change Detection Task at PAN 2019. In: Cappellato, L., Ferro, N., Losada, D., Müller, H.
    (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)
23. Zlatkova, D., Kopev, D., Mitov, K., Atanasov, A., Hardalov, M., Koychev, I., Nakov, P.: An
    Ensemble-Rich Multi-Aspect Approach for Robust Style Change Detection — Notebook for
    PAN at CLEF 2018. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) CLEF 2018
    Evaluation Labs and Workshop – Working Notes Papers, 10-14 September, Avignon, France.
    CEUR-WS.org (2018)