<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Author Identification using multi-headed Recurrent Neural Networks</article-title>
      </title-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Recurrent neural networks (RNNs) are very good at modelling the flow of text, but typically need to be trained on a far larger corpus than is available for the PAN 2015 Author Identification task. This paper describes a novel approach where the output layer of a character-level RNN language model is split into several independent predictive sub-models, each representing an author, while the recurrent layer is shared by all. This allows the recurrent layer to model the language as a whole without over-fitting, while the outputs select aspects of the underlying model that reflect their author's style. The method proves competitive, ranking ifrst in two of the four languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>A recurrent neural network (RNN) is a feed forward neural network that shares
parameters across time. At each time step t, a simple RNN has a hidden vector ht derived
from the input vector xt and the previous hidden state ht 1. The hidden vector is
usually obtained via an afine transform ofset by a bias vector bh, followed by a non-linear
“activation” function, fh, which operates in a point-wise manner, giving the formula
ht = fh(Whhht 1 + Wxhxt + bh). The output vector yt is similarly obtained from the
hidden state, with yt = fy (Why ht + by ), though the non-linear function fy is often not
point-wise. When a discrete probability distribution is desired, the “softmax” function
(z)j = ∑k ezk
ezj
is common because it gives a correct distribution (all values positive and summing to 1)
and has useful properties when it comes to training the network. See figure 1.</p>
      <p>
        The self-referential hidden state allows the network to model complex time series
processes. Training it to do so involves iteratively adjusting the weights and biases,
usually using some form of gradient descent and back-propagation through time (BPTT).
For the sake of brevity, the details of these algorithms are elided. Tomáš Mikolov’s PhD
thesis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] ofers a good introduction to these algorithms and the use of simple recurrent
neural networks for language modelling, a task he shows they excel at.
      </p>
      <p>A language model predicts the flow of text, one symbol at a time, estimating a
probability distribution for the i-th symbol xi given its predecessors, or p(xijxi 1; xi 2; : : : ; x1),
where the symbols belong to a predetermined vocabulary. Language models for
alphabetic languages are usually word-based or character-based, with the former having a
large vocabulary with inevitable gaps, while the latter have a small vocabulary with few
omissions. Word-based models achieve better results that in general, but require a lot of
text and time to train. Character-based models perform better with tiny amounts of text
and cope better with idiosyncratic usages. Figure 2 shows a character-based language
model trying to navigate its way through the character sequence cat.</p>
      <p>The accuracy of a language model for a document can be measured as bits of cross
entropy which is the mean negative binary log probability the model assigns to the
symbols that actually occur. For a document of N characters, this is N1 ∑iN log2 p(xij
xi 1; xi 2; : : : ; x1). Cross-entropy can be thought of a measure of the information the
model fails to model. A language model that predicts every character in a document with
probability 1 will result in a zero cross-entropy; on the other hand assigning probability
0 to an occurring character is very costly.</p>
      <p>
        Supposing training is efective, a language model will better predict the flow of
documents similar to the ones it was trained on—if that similarity is capturable by
the model—and this will show up as reduced cross entropy relative to unrelated
documents. The PAN author identification problem consists of 100 mini-corpora for each
language.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]1 Each mini-corpus contains 1 to 5 documents known to be by a single
author. These documents amount to a few thousand characters—generally fewer than in
this paper—and the task is to decide whether another short document is by the same
author. The unknown documents difer from the known ones in topic or genre.
1 http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/author-identification.html
has more information on the task. This paper was written in the context of that challenge and
assumes some awareness of it.
      </p>
      <p>It is plausible that a character-level RNN language model trained on an author’s
known output will match that author’s unknown output more closely than it matches text
written by others. Conventional language models are trained on millions of characters,
and will severely over-fit on a corpus of a few thousand. To combat this, a multi-headed
character-level language model is introduced, which shares a common recurrent state but
multiple independent softmax output groups. Each softmax group is trained
predominantly on one author’s corpus, causing the recurrent layer to model a combination of all
the author’s texts, approximating the language as a whole. The output groups learn to
weight the recurrent values in a way that best reflects their authors tendencies. Figure 3
attempts to illustrate this.</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <sec id="sec-2-1">
        <title>Text preprocessing</title>
        <p>The known and unknown texts are mapped to a smaller character set to reduce
computational complexity and remove the overwhelming self-importance of extremely rare
characters. A separate mapping is used for each language.</p>
        <p>The text is first converted into the NFKD unicode normal form, which decomposes
accented letters into the letter followed by the a combining accent (for example, the
code point &lt;U+00E0&gt; (“à”) becomes &lt;U+0061&gt;&lt;U+0300&gt; (“a” followed by a combining
grave marker). Capital letters are further decomposed into an uppercase marker followed
by the corresponding lowercase letter. Thus À becomes the triplet &lt;right-combining
uppercase modifier&gt; a &lt;left-combining grave accent&gt;.</p>
        <p>Various rare characters that seemed largely equivalent are mapped together; for
example the en-dash (“–”) and em-dash (“—”) are rare and appear to be used
interchangeably in practice so these are mapped together. Some punctuation marks—such as the
various forms of apostrophes and single quotation marks—are collapsed into canonical
forms. It is likely that this discards some usable information—indications of finger habits
or choice of software—but the avoidance of extremely rare characters is important. If
only one author in the corpus uses a character, it may be assigned an excessive weight.</p>
        <p>For the Greek text, all Latin characters are mapped to the letter s, arbitrarily chosen
because it doesn’t resemble a Greek character. The rationale is that foreign quotations
and references appear too rarely for their content to be valuable and an attempt to model
them would be wasteful, but the tendency to use them might be a useful signal.
Following similar logic, all digits in all languages are mapped to 7. Runs of whitespace are
collapsed into a single space.</p>
        <p>At the end of this processing, any character with a frequency lower than 1 in 10,000
is discarded. Any characters occurring in a text but not in the resultant alphabet are
ignored—there is no “unknown” token. Alphabet sizes are 39 for English, 47 for Dutch
and Greek, and 51 for Spanish.</p>
        <p>Finally, runs of more than 5 identical characters are truncated at 5. This is mainly
aimed at the latin stretches in Greek text; there is little value in having the model try
to guess the exact word length in a language it can’t directly see. As an example, the
following Greek text (the last paragraph of GR094/known02.txt in the training set):
«Τhis», μουρμουρίζει χαμογελώντας μελαγχολικά ο Αμερικάνος,
ακούγοντας το Σαλονικιό «is the beginning of a beautiful
friendship».</p>
        <p>Παίξτο ξανά.
maps to this:
«¹τsss», μουρμουρι²ζει χαμογελω²ντας μελαγχολικα² ο ¹αμερικα²νος,
ακου²γοντας το ¹σαλονικιο² «ss sss sssss ss s sssss sssss».
¹παι²ξτο ξανα².
where the superscript ¹ represents a capital marker attaching to the next character, while
the superscript ² is an acute belonging to the previous character. In this case the
transformation reveals that the author has spelt “This” with an uppercase τ (tau) instead of
the visually identical uppercase t.</p>
        <p>The character mappings were settled before training started and no attempts were
made to test their eficacy.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Multi-headed recurrent neural network language model</title>
        <p>This concept was briefly introduced in section 1. A single recurrent neural network is
trained to predict the flow of text by many authors while sharing a collective model
of the complete language. The output layer has a softmax group for each author, and
sometimes another for a “control corpus”—a large body of text intended to help prevent
over-fitting in the recurrent layer. For convenience the PAN 2014 training set is used
for the control corpora. It turns out that there is some overlap between the 2014 and
2015 training sets (and possibly test sets), so the control corpora are not completely
independent. Empirically they seem to have a very small positive efect.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>ReSQRT activation</title>
        <p>The activation function used for the recurrent layer is a rectified shifted square root, or
ReSQRT, defined as
f (x) =
{px + 1 1 if x
0 otherwise.</p>
        <p>0</p>
        <p>
          The derivative is 2px1+1 for x &gt; 0, and otherwise 0. In terms of y = f (x) (which
is of practical use in training) the non-zero part is 2(y1+1) . This follows the model of
the widely used rectified linear unit (ReLU,[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] defined as f (x) = max(x; 0)) in that
the output and derivative is zero for all non-positive numbers, which ofers performance
and propagative benefits and allows neurons to opt out of opining on areas outside their
speciality. ReLU can be dificult to use in recurrent neural networks as the lack of
inherent scale means it can slip into an explosive cycle of amplification more easily than
conventional squashing activations like tanh. ReSQRT steers a middle course, and
empirically works well for character based language models. The ReSQRT activation may
not have been described before.
        </p>
        <p>The training uses a form of adagrad learning, where the rate at which a weight can
change is inversely proportional to the L2 distance that it has already changed. This
amounts to a monotonically decreasing per-weight learning rate.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Training</title>
        <p>Training is structured as a number of “sub-epochs”. In a sub-epoch each author provides
a training text. Where there is one text for every author, a sub-epoch is the same as a
conventional epoch; otherwise the author’s texts are used in a cyclical way. In some runs
all the texts of each author are concatenated—in these cases the sub-epoch is also a true
epoch. In another mode, each sub-epoch is “balanced”, with documents being drawn
from each author until each has been trained around the same amount.</p>
        <p>At each training step there is a chance of the training example “leaking” and afecting
other authors, as if they had also made that particular choice of character. The initial
leakage rate is in the order of 1/N where N is the number of authors, and it decays
exponentially with each sub-epoch. Towards the end of training, the leakage rate is very
low and each author’s sub-model is being trained almost entirely on its own text. The
parameters are thus roughly set during early training with high leakage and high learning
rate, then refined to specialise on the author’s style.</p>
        <p>A mini-batch size of 40 is used, meaning the weights are modified by the
accumulated deltas after every 40 characters. The training gradient for first 10 characters of
each text is ignored. BPTT is truncated at a depth of 70, or sooner when the gradient
diminishes below an adaptive threshold. The hidden layer is quite small; experiments
with larger sizes showed no benefit.</p>
        <p>Where there is a single “known” text, and it is significantly shorter than the
“unknown” text, the two are swapped around and the RNN is trained on the unknown text.
Insuficient training text, which causes over-fitting, is worse than insuficient test text
which only increases uncertainty.</p>
        <p>The recurrent neural network code is written in C while the PAN specific parts are
Python.2
2.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Ensemble of variations</title>
        <p>The final results combine several runs together, using a variety of configurations of
metaparameters. Approximate ranges are shown in table 1. Each run consists of a training
phase lasting several minutes or hours followed by the calculation of cross entropies for
the unknown documents, which takes less than a second.</p>
        <p>Sets of seemingly reasonable meta-parameters are chosen per-language via a
haphazard search, and random selections of these configurations are used in the the
evaluation ensembles. 3 The evaluations are run with a time-out; after a set number of hours
all the test scores obtained are averaged together. The length of time dedicated to each
ensemble was determined by the looming submission deadline.</p>
        <p>In tests with the training set (and in line with received wisdom), ensembles
performed slightly better than the majority of their constituents.
2 See https://github.com/douglasbagnall/recur and https://github.com/pan-webis-de/caravel for
software details.
3 The configuration pools from which the ensembles were chosen can be found at
https://github.com/pan-webis-de/caravel/tree/master/config. Some decoding is necessary.
initial adagrad learning scale
initial leakage between classes
leakage decay (per sub-epoch)
hidden neurons
presynaptic noise
sub-epochs
text direction
text handling
initialisation
control corpus
typical values
With N predictive output clusters representing the N known authors in the corpus, the
system produces N cross entropy scores for each document. The scores are not directly
comparable to each other as there is both random variation in the various sub-models’
performance and inherent variation in the texts’ predictability. The scores were
normalised on both axes. For each author’s sub-model, the ranking of the score for the
unknown text determines the probability that the author wrote the text. For example, if
a text ranks first (i.e. has the lowest score) it is likely to be that author’s work; if it ranks
last, it is probably not. This system is likely not the best, but it worked well enough and
got forgotten about.</p>
        <p>The final score must take the form of a probability between 0 and 1, with 0.5 having
special significance for the C@1 calculation. The task is designed so that in exactly half
the cases the unknown documents are by the known author, so (supposing confidence
in the model) the correct way to align the 0.5 point is to take the median. Documents
ranking better than the median receive a score above 0.5 linearly related to their ranking,
while those ranking worse than the median receive a lower score. For the Spanish, Greek,
and English tasks a small radius around the median was also assigned an 0.5 score.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>Results are shown in table 2.</p>
      <p>The evaluation score for the Greek task is higher than expected, based on
experiments with the training set, while that for Spanish is lower. The English and Dutch
results are no surprise.</p>
      <p>language
Dutch
English
Greek
Spanish
0.70
0.81</p>
      <p>The consistently bad results for the Dutch task seem to reflect a drastic genre
diference between the known and unknown texts. Many competitors did a lot better in this
task, presumably by not concentrating on the local flow of characters and instead using
linguistic knowledge or otherwise deriving longer range features.</p>
      <p>One the other hand, where the genre and topic seem closer, this model performs very
well despite using no specialist linguistic or natural language processing techniques. It
ought to extrapolate well to other languages. It also ought to perform very well in concert
with unrelated methods.</p>
      <p>Without an active recurrent layer the model falls to predicting unigram character
probabilities per author via the output bias vector. The cross entropy in this case amounts
to weighting the frequencies of characters in the text. When (accidentally) run this way
on the training set, the AUC score was 0.85 for English and 0.91 for Spanish; for English
this was one of the best training results. While this is humiliating for the RNN, it
conifrms the validity of the training and scoring methods. Leakage between authors allows
each sub-model to learn background probabilities for the language as a whole, which
is presumably better than the common technique of giving unseen symbols a fixed low
chance. By being entirely additive, cross entropy sidesteps the curse of dimensionality.
It also eschews (perhaps too much) the possibility of an aha moment involving strong
positive evidence—positive evidence is typically accumulated a fraction of a bit at a
time over the course of the entire document. This is in contrast to a typical human
approach of identifying idiosyncratic usages and ignoring the boring bits in between; this
contrast means the RNN might do well in ensembles with people. An RNN that directly
reported an authorship probability distribution is conceivable but is unlikely to be easy
to train. The character level language model learns to concentrate on the text; a more
direct approach is likely to be more distracted. Tirelessly trying to understand the text is
the strength of this method.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          , T.:
          <source>Statistical Language Models Based on Neural Networks. Ph.D. thesis, Ph. D. thesis</source>
          , Brno University of Technology (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Rectified linear units improve restricted boltzmann machines</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Machine Learning (ICML-10)</source>
          . pp.
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            <given-names>Lopez</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Overview of the Author Identification Task at PAN 2015</article-title>
          .
          <article-title>In: Working Notes Papers of the CLEF 2015 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2015</year>
          ), http://www.clef-initiative.eu/publication/working-notes
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>