Introduction

Author Identification using multi-headed Recurrent Neural Networks

2015

Recurrent neural networks (RNNs) are very good at modelling the flow of text, but typically need to be trained on a far larger corpus than is available for the PAN 2015 Author Identification task. This paper describes a novel approach where the output layer of a character-level RNN language model is split into several independent predictive sub-models, each representing an author, while the recurrent layer is shared by all. This allows the recurrent layer to model the language as a whole without over-fitting, while the outputs select aspects of the underlying model that reflect their author's style. The method proves competitive, ranking ifrst in two of the four languages.

Introduction

A recurrent neural network (RNN) is a feed forward neural network that shares parameters across time. At each time step t, a simple RNN has a hidden vector ht derived from the input vector xt and the previous hidden state ht 1. The hidden vector is usually obtained via an afine transform ofset by a bias vector bh, followed by a non-linear “activation” function, fh, which operates in a point-wise manner, giving the formula ht = fh(Whhht 1 + Wxhxt + bh). The output vector yt is similarly obtained from the hidden state, with yt = fy (Why ht + by ), though the non-linear function fy is often not point-wise. When a discrete probability distribution is desired, the “softmax” function (z)j = ∑k ezk ezj is common because it gives a correct distribution (all values positive and summing to 1) and has useful properties when it comes to training the network. See figure 1.

The self-referential hidden state allows the network to model complex time series processes. Training it to do so involves iteratively adjusting the weights and biases, usually using some form of gradient descent and back-propagation through time (BPTT). For the sake of brevity, the details of these algorithms are elided. Tomáš Mikolov’s PhD thesis [ 1 ] ofers a good introduction to these algorithms and the use of simple recurrent neural networks for language modelling, a task he shows they excel at.

A language model predicts the flow of text, one symbol at a time, estimating a probability distribution for the i-th symbol xi given its predecessors, or p(xijxi 1; xi 2; : : : ; x1), where the symbols belong to a predetermined vocabulary. Language models for alphabetic languages are usually word-based or character-based, with the former having a large vocabulary with inevitable gaps, while the latter have a small vocabulary with few omissions. Word-based models achieve better results that in general, but require a lot of text and time to train. Character-based models perform better with tiny amounts of text and cope better with idiosyncratic usages. Figure 2 shows a character-based language model trying to navigate its way through the character sequence cat.

The accuracy of a language model for a document can be measured as bits of cross entropy which is the mean negative binary log probability the model assigns to the symbols that actually occur. For a document of N characters, this is N1 ∑iN log2 p(xij xi 1; xi 2; : : : ; x1). Cross-entropy can be thought of a measure of the information the model fails to model. A language model that predicts every character in a document with probability 1 will result in a zero cross-entropy; on the other hand assigning probability 0 to an occurring character is very costly.

Supposing training is efective, a language model will better predict the flow of documents similar to the ones it was trained on—if that similarity is capturable by the model—and this will show up as reduced cross entropy relative to unrelated documents. The PAN author identification problem consists of 100 mini-corpora for each language.[ 3 ]1 Each mini-corpus contains 1 to 5 documents known to be by a single author. These documents amount to a few thousand characters—generally fewer than in this paper—and the task is to decide whether another short document is by the same author. The unknown documents difer from the known ones in topic or genre. 1 http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/author-identification.html has more information on the task. This paper was written in the context of that challenge and assumes some awareness of it.

It is plausible that a character-level RNN language model trained on an author’s known output will match that author’s unknown output more closely than it matches text written by others. Conventional language models are trained on millions of characters, and will severely over-fit on a corpus of a few thousand. To combat this, a multi-headed character-level language model is introduced, which shares a common recurrent state but multiple independent softmax output groups. Each softmax group is trained predominantly on one author’s corpus, causing the recurrent layer to model a combination of all the author’s texts, approximating the language as a whole. The output groups learn to weight the recurrent values in a way that best reflects their authors tendencies. Figure 3 attempts to illustrate this.

Method Text preprocessing

The known and unknown texts are mapped to a smaller character set to reduce computational complexity and remove the overwhelming self-importance of extremely rare characters. A separate mapping is used for each language.

The text is first converted into the NFKD unicode normal form, which decomposes accented letters into the letter followed by the a combining accent (for example, the code point <U+00E0> (“à”) becomes <U+0061><U+0300> (“a” followed by a combining grave marker). Capital letters are further decomposed into an uppercase marker followed by the corresponding lowercase letter. Thus À becomes the triplet <right-combining uppercase modifier> a <left-combining grave accent>.

Various rare characters that seemed largely equivalent are mapped together; for example the en-dash (“–”) and em-dash (“—”) are rare and appear to be used interchangeably in practice so these are mapped together. Some punctuation marks—such as the various forms of apostrophes and single quotation marks—are collapsed into canonical forms. It is likely that this discards some usable information—indications of finger habits or choice of software—but the avoidance of extremely rare characters is important. If only one author in the corpus uses a character, it may be assigned an excessive weight.

For the Greek text, all Latin characters are mapped to the letter s, arbitrarily chosen because it doesn’t resemble a Greek character. The rationale is that foreign quotations and references appear too rarely for their content to be valuable and an attempt to model them would be wasteful, but the tendency to use them might be a useful signal. Following similar logic, all digits in all languages are mapped to 7. Runs of whitespace are collapsed into a single space.

At the end of this processing, any character with a frequency lower than 1 in 10,000 is discarded. Any characters occurring in a text but not in the resultant alphabet are ignored—there is no “unknown” token. Alphabet sizes are 39 for English, 47 for Dutch and Greek, and 51 for Spanish.

Finally, runs of more than 5 identical characters are truncated at 5. This is mainly aimed at the latin stretches in Greek text; there is little value in having the model try to guess the exact word length in a language it can’t directly see. As an example, the following Greek text (the last paragraph of GR094/known02.txt in the training set): «Τhis», μουρμουρίζει χαμογελώντας μελαγχολικά ο Αμερικάνος, ακούγοντας το Σαλονικιό «is the beginning of a beautiful friendship».

Παίξτο ξανά. maps to this: «¹τsss», μουρμουρι²ζει χαμογελω²ντας μελαγχολικα² ο ¹αμερικα²νος, ακου²γοντας το ¹σαλονικιο² «ss sss sssss ss s sssss sssss». ¹παι²ξτο ξανα². where the superscript ¹ represents a capital marker attaching to the next character, while the superscript ² is an acute belonging to the previous character. In this case the transformation reveals that the author has spelt “This” with an uppercase τ (tau) instead of the visually identical uppercase t.

The character mappings were settled before training started and no attempts were made to test their eficacy. 2.2

Multi-headed recurrent neural network language model

This concept was briefly introduced in section 1. A single recurrent neural network is trained to predict the flow of text by many authors while sharing a collective model of the complete language. The output layer has a softmax group for each author, and sometimes another for a “control corpus”—a large body of text intended to help prevent over-fitting in the recurrent layer. For convenience the PAN 2014 training set is used for the control corpora. It turns out that there is some overlap between the 2014 and 2015 training sets (and possibly test sets), so the control corpora are not completely independent. Empirically they seem to have a very small positive efect. 2.3

ReSQRT activation

The activation function used for the recurrent layer is a rectified shifted square root, or ReSQRT, defined as f (x) = {px + 1 1 if x 0 otherwise.

The derivative is 2px1+1 for x > 0, and otherwise 0. In terms of y = f (x) (which is of practical use in training) the non-zero part is 2(y1+1) . This follows the model of the widely used rectified linear unit (ReLU,[ 2 ] defined as f (x) = max(x; 0)) in that the output and derivative is zero for all non-positive numbers, which ofers performance and propagative benefits and allows neurons to opt out of opining on areas outside their speciality. ReLU can be dificult to use in recurrent neural networks as the lack of inherent scale means it can slip into an explosive cycle of amplification more easily than conventional squashing activations like tanh. ReSQRT steers a middle course, and empirically works well for character based language models. The ReSQRT activation may not have been described before.

The training uses a form of adagrad learning, where the rate at which a weight can change is inversely proportional to the L2 distance that it has already changed. This amounts to a monotonically decreasing per-weight learning rate. 2.4

Training

Training is structured as a number of “sub-epochs”. In a sub-epoch each author provides a training text. Where there is one text for every author, a sub-epoch is the same as a conventional epoch; otherwise the author’s texts are used in a cyclical way. In some runs all the texts of each author are concatenated—in these cases the sub-epoch is also a true epoch. In another mode, each sub-epoch is “balanced”, with documents being drawn from each author until each has been trained around the same amount.

At each training step there is a chance of the training example “leaking” and afecting other authors, as if they had also made that particular choice of character. The initial leakage rate is in the order of 1/N where N is the number of authors, and it decays exponentially with each sub-epoch. Towards the end of training, the leakage rate is very low and each author’s sub-model is being trained almost entirely on its own text. The parameters are thus roughly set during early training with high leakage and high learning rate, then refined to specialise on the author’s style.

A mini-batch size of 40 is used, meaning the weights are modified by the accumulated deltas after every 40 characters. The training gradient for first 10 characters of each text is ignored. BPTT is truncated at a depth of 70, or sooner when the gradient diminishes below an adaptive threshold. The hidden layer is quite small; experiments with larger sizes showed no benefit.

Where there is a single “known” text, and it is significantly shorter than the “unknown” text, the two are swapped around and the RNN is trained on the unknown text. Insuficient training text, which causes over-fitting, is worse than insuficient test text which only increases uncertainty.

The recurrent neural network code is written in C while the PAN specific parts are Python.2 2.5

Ensemble of variations

The final results combine several runs together, using a variety of configurations of metaparameters. Approximate ranges are shown in table 1. Each run consists of a training phase lasting several minutes or hours followed by the calculation of cross entropies for the unknown documents, which takes less than a second.

Sets of seemingly reasonable meta-parameters are chosen per-language via a haphazard search, and random selections of these configurations are used in the the evaluation ensembles. 3 The evaluations are run with a time-out; after a set number of hours all the test scores obtained are averaged together. The length of time dedicated to each ensemble was determined by the looming submission deadline.

In tests with the training set (and in line with received wisdom), ensembles performed slightly better than the majority of their constituents. 2 See https://github.com/douglasbagnall/recur and https://github.com/pan-webis-de/caravel for software details. 3 The configuration pools from which the ensembles were chosen can be found at https://github.com/pan-webis-de/caravel/tree/master/config. Some decoding is necessary. initial adagrad learning scale initial leakage between classes leakage decay (per sub-epoch) hidden neurons presynaptic noise sub-epochs text direction text handling initialisation control corpus typical values With N predictive output clusters representing the N known authors in the corpus, the system produces N cross entropy scores for each document. The scores are not directly comparable to each other as there is both random variation in the various sub-models’ performance and inherent variation in the texts’ predictability. The scores were normalised on both axes. For each author’s sub-model, the ranking of the score for the unknown text determines the probability that the author wrote the text. For example, if a text ranks first (i.e. has the lowest score) it is likely to be that author’s work; if it ranks last, it is probably not. This system is likely not the best, but it worked well enough and got forgotten about.

The final score must take the form of a probability between 0 and 1, with 0.5 having special significance for the C@1 calculation. The task is designed so that in exactly half the cases the unknown documents are by the known author, so (supposing confidence in the model) the correct way to align the 0.5 point is to take the median. Documents ranking better than the median receive a score above 0.5 linearly related to their ranking, while those ranking worse than the median receive a lower score. For the Spanish, Greek, and English tasks a small radius around the median was also assigned an 0.5 score. 3

Results

Results are shown in table 2.

The evaluation score for the Greek task is higher than expected, based on experiments with the training set, while that for Spanish is lower. The English and Dutch results are no surprise.

language Dutch English Greek Spanish 0.70 0.81

The consistently bad results for the Dutch task seem to reflect a drastic genre diference between the known and unknown texts. Many competitors did a lot better in this task, presumably by not concentrating on the local flow of characters and instead using linguistic knowledge or otherwise deriving longer range features.

One the other hand, where the genre and topic seem closer, this model performs very well despite using no specialist linguistic or natural language processing techniques. It ought to extrapolate well to other languages. It also ought to perform very well in concert with unrelated methods.

Without an active recurrent layer the model falls to predicting unigram character probabilities per author via the output bias vector. The cross entropy in this case amounts to weighting the frequencies of characters in the text. When (accidentally) run this way on the training set, the AUC score was 0.85 for English and 0.91 for Spanish; for English this was one of the best training results. While this is humiliating for the RNN, it conifrms the validity of the training and scoring methods. Leakage between authors allows each sub-model to learn background probabilities for the language as a whole, which is presumably better than the common technique of giving unseen symbols a fixed low chance. By being entirely additive, cross entropy sidesteps the curse of dimensionality. It also eschews (perhaps too much) the possibility of an aha moment involving strong positive evidence—positive evidence is typically accumulated a fraction of a bit at a time over the course of the entire document. This is in contrast to a typical human approach of identifying idiosyncratic usages and ignoring the boring bits in between; this contrast means the RNN might do well in ensembles with people. An RNN that directly reported an authorship probability distribution is conceivable but is unlikely to be easy to train. The character level language model learns to concentrate on the text; a more direct approach is likely to be more distracted. Tirelessly trying to understand the text is the strength of this method.

1. Mikolov , T.: Statistical Language Models Based on Neural Networks. Ph.D. thesis, Ph. D. thesis , Brno University of Technology ( 2012 )

2. Nair , V. , Hinton , G.E.: Rectified linear units improve restricted boltzmann machines . In: Proceedings of the 27th International Conference on Machine Learning (ICML-10) . pp. 807 - 814 ( 2010 )

3. Stamatatos , E. , Daelemans , W. , Verhoeven , B. , Juola , P. , Lopez

Lopez

, A. , Potthast , M. , Stein , B. : Overview of the Author Identification Task at PAN 2015 . In: Working Notes Papers of the CLEF 2015 Evaluation Labs . CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2015 ), http://www.clef-initiative.eu/publication/working-notes