A Parallel Hierarchical Attention Network
                  for Style Change Detection
                        Notebook for PAN at CLEF 2018

                         Marjan Hosseinia and Arjun Mukherjee

                                   University of Houston
                            mhosseinia@uh.edu, arjun@cs.uh.edu


       Abstract We propose a model for the new problem of style change detection.
       Given a document, we verify if it contains at least one style change. In other
       words, the task is to investigate whether it is written by one or multiple authors.
       The model is composed of two parallel attention networks. Unlike the conven-
       tional recurrent neural networks that use the character or word sequences to learn
       the underlying language model of documents, our model focuses on the hierar-
       chical structure of the language and observes the parse tree features of a sentence
       using a pre-trained statistical parser. Besides, our model is independent of style
       change positions although they are given during the training phase. The reason
       is to have a more applicable approach to the real world problems where such in-
       formation is not available. PAN 2018 results show that it achieves 82% accuracy
       and stays at the second rank.


1   Introduction

Given one document, the problem of style change detection is to find if the writing style
of the document has changed. In other words, we investigate whether it is written by
multiple authors or one author. This is relatively a new problem in the area of mining
writing style of text documents and is similar to the authorship verification and attribu-
tion problems where the task is to decide who has written them by comparing how they
are written. Similarly, we need to study the writing style of a document by focusing on
its linguistic aspects. Actually, the writing style expresses the selection of words and
(grammatical) structure of a sentence.
     According to the literature, there are two frequent NLP approaches for document
representation. First, the bag-of-words model that is independent of the word order of
a sentence and expresses the word selection [5]. Second, the sequence models such as
word embedding that are sensitive to the order of words of a sentence [6]. Although
the English language is recognized to have a latent hierarchical, tree-based structure
[1], none of the two approaches deploy the hierarchical structure of a sentence for its
representation.
     In the problem of style change detection, we consider the latent structure of English
language to represent a sentence. We use a context-free grammar parser to predict the
tree-based structure, Parse Tree, of a sentence. Then, we extract the ordered features of
                Figure 1. Parallel hierarchical attention network architecture


a parse tree to be used by a parallel hierarchical attention network to find the determi-
native parts of a sentence and a document. Finally, a fusion layer is used to compare a
document with its reverse version to predict the class label. The results show that our
model achieves promising results for PAN 2018 dataset.
    The model is described in details in Section 2. In Section 3 we provide the results
and discussion.


2   Parallel Hierarchical Attention Network

Our model is inspired by the two successful neural network architectures in authorship
verification and document classification. Similar to [2] it has a parallel structure: two
columns of recurrent neural networks, one fusion layer, and one softmax layer. How-
ever, each column is not a simple RNN anymore but is a hierarchical attention network
with two levels of attention mechanism proposed in [8]. To be for specific, each column
includes a Parse Tree Feature (PTF) embedding, a PTF-level LSTM, a PTF-level atten-
tion, a PT sentence-level LSTM, and a PT sentence-level attention layer. Here, the key
difference is that the LSTM input is not the conventional character/word sequence but
is the sequence of Parse Tree Features (PTFs) extracted from the tree-based structure of
a sentence. The model architecture is shown in Figure 1. Each part will be described in
the following sections.
               Figure 2. parse tree for “Computers defeated chess players in 1980s.”


2.1     PTF Embedding


We use Stanford PCFG parser1 to retrieve the hierarchical structure of a sentence [4].
Figure 2 shows the underlying parse tree of sentence “Computers defeated chess play-
ers in 1980s.”. To use this tree structure by the following LSTMs in our model we need
to preserve the word sequence of a sentence. We define Parse Tree Feature (PTF) of each
word in a sentence as a path starting from the root to the corresponding leaf (word) of
its parse tree. The path is a set of all rules of the form parent → child1 ...childn from
root to a leaf (word). Here, punctuation marks are considered as word unigrams. For
example, the PTF of “computers” consists of three rules and is [S → NP VP ., NP →
NNS, NNS → computers] and PTF of “chess” with four rules is [S → NP VP ., VP →
NP, NP → NN, NN → chess] (Figure 2). We ignore rule ROOT → S because
it is a shared rule among all PTFs. Accordingly, the Parse Tree (PT) representation
of a sentence is the set of PTF of all its word unigram. For the above example, s
has seven PTFs where P T Fcomputer is the first and P T F. is the last feature i.e.,
s = [[S → NP VP ., NP → NNS, NNS → computers], ..., [S → ., . → .]]
     Let dpt = [si |i ∈ [0, n]] be Parse Tree (PT) representation of document d with n
sentences where s = [P T Fj |j ∈ [0, ls ]] is PT representation of sentence s with size
ls . As we mentioned earlier, we preserve the order of sentences and words according
to their occurrence in a document. We define dpt    r to be the reverse PT representation
of dpt . To make dptr , first, we reverse the order  of sentences of dpt then the order of
PTFs of each sentence. In other words, the last PTF of the last sentence of dpt is the
first PTF of dpt        pt
              r . So, dr = [sr,i |i ∈ [n, 0]] where sr = [P T Fj |j ∈ [ls , 0]] is the reverse
of s. In our example, sr = [[S → ., . → .], ..., [S → NP VP ., NP → NNS, NNS →
computers]].

 1
     https://stanfordnlp.github.io/CoreNLP/
2.2   LSTMs and Attention Mechanism
Here, the PT document representation and its reverse (dpt , dpt r ) are the inputs of our
model, each for one of the two columns. Later, we will explain why the PT reverse
version of a document is one of the two inputs. Each column is a hierarchical attention
network proposed in [8]. It has two layers of LSTM each followed by an attention layer.
However, the input of the network is no longer word unigrams but is the PTFs. Besides,
we use unidirectional LSTM instead of a bidirectional as we feed the reverse version of
a document, almost similar to the backward pass of a bidirectional LSTM, to the second
column of the network.
and includes a Parse Tree Feature (PTF) embedding, a PTF-level LSTM, a PTF-level
attention, a PT sentence-level LSTM, and a PT sentence-level attention layer.
Parse Tree Feature Embedding
For a sentence s = {P T F1 , ..., P T Fls } of ls PTFs, we embed PTFs using PTF matrix
embedding W that is initialized randomly and learned during the training phase. Here,
xi = WP T Fi is the PTF embedding of the ith feature.
PTF-level LSTM and PTF-level Attention
We choose LSTM in our model that is known to achieve promising results for long-
term dependencies while its forget and update gates control the flow of information
effectively. Here, hi is the LSTM hidden state after ith feature of sentence s. Although
we believe that PTFs express the writing style(s) of a given document, some are more
determinative than others in predicting the class label. To highlight the importance of
each PTF, we apply the attention mechanism proposed in [8] to the hidden state of PTF-
level LSTM at step i. This mechanism computes a weight vector αi using a Multi-Layer
Perceptron and a softmax function. Here, Wpt and b are weight matrix and bias vector
respectively, upt is a random context vector and will be adjusted during the training
phase and spta is the sum of weighted hidden states of sentence s:

                                ui = tanh(Wpt hi + b)                                 (1)
                                       exp(u>i upt )
                                αi = P         >
                                                                                      (2)
                                        i exp(ui upt )
                                          X
                                   spt
                                    a =      αi hi                                    (3)
                                            i
PT Sentence-level LSTM and PT Sentence-level Attention
The sequence of weighted sentences (spt a ) are the inputs of the PT sentence-level LSTM.
Again, we would like to find which part of a document is more important for classifi-
cation. In other words, we need to know in which sentence the style of the document is
changed significantly. So, the PT sentence-level attention layer is applied to the hidden
                                                                   0
state of the PT sentence-level LSTM [8]. Similarly, Ws and b are weight matrix and
bias vector respectively, us is a random context vector and will be adjusted during the
training phase, βj is the weight vector and dpt
                                              a is the final weighted document vector:
                                                      0
                                usj = tanh(Ws hsj + b )                               (4)
                                                >
                                      exp(usj us )
                                βj = P        s>
                                                                                      (5)
                                       j exp(uj us )
      Table 1. Similarity functions. a, b: document vectors, n: number of features in a and b

             Metric          Description                         Metric            Description
                                         P (ai −bi )2                              ab     T
             Chi2 kernel     exp(−γ         i
                                                [ (ai +bi ) ]) Cosine similarity ||a||||b||
                                                     0.5
                                     (ai − bi )2 )                                  T
                                 P
             Euclidean       (   i
                                                                 Linear kernel     aP b
                                                                                      n
                                                                                          |ai −bi |
             RBF kernel     exp(−γ||a − b||2 )                   Mean of L1 norm      i
                                                                                              n
             Sigmoid kernel tanh(γaT b + c0 )


                                                      X
                                           dpt
                                            a =                βj hsj                                 (6)
                                                           j

    The reverse version of a document passes the same process through the second
column of layers simultaneously. At this step, we have two weighted document vectors,
the original (dpt                            pt
               a ) and its reverse version (da,r ), from two parallel columns of layers.
Next, we explain how to use the two document vectors for classification.

2.3    Fusion and Output
The last and important step is to investigate the style change in a document. In this prob-
lem, the number of authors is unknown to us and it is an open set problem with respect
to the authors. Indeed, learning writing styles and observing a change may not be appli-
cable solely. So, we need a mechanism independent of the number of authors/writing
styles. Here, learning the difference between the two versions of a document is the key
to find the existence of a style change. To do so, we use the fusion layer of our previous
work [2] where several similarity functions compute the similarity/difference between
a pair of documents in a fully connected neural network layer. The similarity functions
are listed in Table1. The weighted document vector and its reverse version dpt       pt
                                                                                a , dr,a are
compared in the fusion layer to learn the existence of a style change in documents by
different authors and with various writing styles:
                                      Vf = [simi (dpt   pt
                                                   a , dr,a )]i                                       (7)
where Vf is the similarity vector and simi belongs to one of the functions in Table1.
For documents with no style change, it compares the language model of one author be-
tween the forward and backward pass. However, for documents by multiple authors,
there are two possible cases. In the first case, the order of different writing styles
differs in both versions of a document. For example, document d by three authors
d = [author1 , author2 , author3 ] and its reverse dr = [author3 , author2 , author1 ].
In the second case, the order of writing styles is the same in both regular and the
reverse version of a document. For example, d = [author1 , author2 , author1 ] and
dr = [author1 , author2 , author1 ]. For both cases, the fusion layer compares the lan-
guage model of multiple authors and the model learns the transition from one writing
style to another. However, in the first case, the PT representation of the two documents
are much more different than the second case as the order of authors differs in both ver-
sions of the document. Finally, the similarity vector Vf is given to a softmax function
for binary classification.
                        Table 2. PAN2018 dataset statistics and results

                                     Train set Validation set Test set
                           Size     2980       1492           1352
                           Accuracy 100        83.78          82.47


3    Results and Analysis
We participate in PAN 2018 style change detection task [7,3]. The task contains two
training and validation sets that are publicly available before the competition and one
test set which is not visible to the participants and is used to evaluate the participating
models including our parallel attention network. In the training phase, we use the neg-
ative log-likelihood as our loss function and RMSprop with learning rate = 1e − 03 as
the optimizer. We initialize the 100-dimensional PTF embedding from a uniform dis-
tribution over [0, 1). The size of the hidden layer of the two LSTMs is 8 with the batch
size = 1. We also apply a dropout of 0.3 on the output of the fusion.
    The accuracy and the size of each set are listed in Table2. It shows that our model
achieves 82% accuracy on test dataset and 83.78% on the validation set and stays at the
second rank in PAN 2018. As there is not much difference (less than 1.4%) between
the accuracy of the two sets it indicates that the model is generalized well. Besides,
our model is independent of style change positions although they are given during the
training phase. The reason is to have a more applicable approach to the real world
problems where such information is not available. To see the effect of utilizing PTFs in
style change detection and fusion layer we do some experiments on PAN 2018 dataset.
PTF vs Word Unigram
To show that PTFs are effective elements to represent one’s style of writing, we train our
model using only word unigram features instead of PTFs. We keep all settings intact and
take the advantage of Glove pre-trained word vectors 2 as the input of embedding layer.
Here, the reverse of a document is created as before and contains the set of reverse
sentences from the last to the first sentence of the document. Results show that the
accuracy of the validation set3 is 71%, almost 13% less than PTFs.
Fusion vs Concatenation
The effect of the fusion layer that includes several well-known similarity metrics can be
addressed with ablation. We replace the fusion layer with a fully connected layer that
takes the concatenation of the two weighted document vectors produced from the two
columns of attention networks. The accuracy on the validation dataset reduced by 10%.
    The downside of this method is its expensive pre-processing phase that results in
producing a huge PTF embedding dimensionality. The size of PTFs of the training set
is around 1, 300, 000 compared to 70, 402 word unigram features which is almost 19
times larger. However, this huge dimensionality helps the attention mechanism to find
and focus on discriminative features for class label prediction. 4 We believe the model
 2
   https://nlp.stanford.edu/projects/glove/
 3
   the test set was not released.
 4
   Producing PTFs using Stanford standalone parser made it slow and took around 10 hours in
   PAN evaluation process (test phase). It is much faster if one uses CoreNLP server available at
can be improved if we feed the style change positions to the network in an appropriate
way. Or instead of using a pre-trained tree structure of a sentence we can learn the latent
hierarchical structure of a sentence.

4     Conclusion
We propose a model to solve the new problem of style change detection in PAN 2018.
We use parse tree features to deploy the hierarchical structure of a sentence and ex-
tract them such that the order of the corresponding words will be preserved to be used
by a parallel hierarchical attention network. The results show that our model achieves
promising results although we do not use the style change positions for training the
model and only rely on the raw text of the dataset.

Acknowledgments
This work is supported in part by NSF 1527364. We also thank anonymous reviewers
for their helpful feedback.

References
1. Chomsky, N.: Syntactic structure. Mouton (1957)
2. Hosseinia, M., Mukherjee, A.: Experiments with neural networks for small and large scale
   authorship verification. arXiv preprint arXiv:1803.06456 (2018)
3. Kestemont, M., Tschugnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B.,
   Potthast, M.: Overview of the Author Identification Task at PAN-2018: Cross-domain
   Authorship Attribution and Style Change Detection. In: Cappellato, L., Ferro, N., Nie, J.Y.,
   Soulier, L. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR
   Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2018)
4. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st annual
   meeting of the association for computational linguistics (2003)
5. Landauer, T.K., Dumais, S.T.: A solution to plato’s problem: The latent semantic analysis
   theory of acquisition, induction, and representation of knowledge. Psychological review
   104(2), 211 (1997)
6. Mikolov, T.: Statistical language models based on neural networks. Presentation at Google,
   Mountain View, 2nd April (2012)
7. Stamatatos, E., Rangel, F., Tschuggnall, M., Kestemont, M., Rosso, P., Stein, B., Potthast,
   M.: Overview of PAN-2018: Author Identification, Author Profiling, and Author
   Obfuscation. In: Bellot, P., Trabelsi, C., Mothe, J., Murtagh, F., Nie, J., Soulier, L., Sanjuan,
   E., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality,
   and Interaction. 9th International Conference of the CLEF Initiative (CLEF 18). Springer,
   Berlin Heidelberg New York (Sep 2018)
8. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks
   for document classification. In: Proceedings of the 2016 Conference of the North American
   Chapter of the Association for Computational Linguistics: Human Language Technologies.
   pp. 1480–1489 (2016)

    https://stanfordnlp.github.io/CoreNLP/corenlp-server.html. However, we were not allowed to
    use any external resources during PAN evaluation phase.