Improving Authorship Verification using Linguistic Divergence Yifan Zhanga , Dainis Boumbera , Marjan Hosseiniaa , Fan Yanga and Arjun Mukherjeea a University of Houston Abstract We propose an unsupervised solution to the Authorship Verification task that utilizes pre-trained deep language models to compute a new metric called DV-Distance. The proposed metric is a measure of the difference between the two authors comparing against pre-trained language models. Our design addresses the problem of non-comparability in authorship verification, frequently encountered in small or cross-domain corpora. To the best of our knowledge, this paper is the first one to introduce a method designed with non-comparability in mind from the ground up, rather than indirectly. It is also one of the first to use Deep Language Models in this setting. The approach is intuitive, and it is easy to understand and interpret through visualization. Experiments on four datasets show our methods matching or surpassing current state-of-the-art and strong baselines in most tasks. Keywords Authorship Verification, Unsupervised Learning, Language Modeling, Spam/Troll Detection 1. Introduction Authorship Attribution (AA) [1] and Verification (AV) [2] are challenging problems important in this age of ”Fake News”. The former attempts to answer who wrote a specific document; the latter concerns itself with the problem of finding out whether the same person authored several documents or not. Ultimately, the goal of AV is to determine whether the same author wrote any two documents of arbitrary authorship. These problems have attracted renewed attention as we urgently need better tools to combat content farming, social bots and other forms of communication pollutions. An interesting aspect of authorship problems is that technology used elsewhere in NLP has not yet penetrated it. Up until the very recent PAN 2018 and PAN 2020 Authorship event [3, 4], the most popular and effective approaches still largely relies on n-gram features and traditional machine learning classifiers, such as support vector machines (SVM) [5] and trees [6]. Elsewhere, these methods recently had to give up much of their spotlight to deep neural networks. This phenomenon may be mostly attributed to the fact that authorship problems are often data constrained — as the amount of text from a particular author is often very limited. From what we know, only a few deep learning models have been proposed and shown to be ROMCIR 2021: Workshop on Reducing Online Misinformation through Credible Information Retrieval, held as part of ECIR 2021: the 43rd European Conference on Information Retrieval, March 28 – April 1, 2021, Lucca, Italy (Online Event) Envelope-Open yzhang114@uh.edu (Y. Zhang); dainis.boumber@gmail.com (D. Boumber); ma.hosseinia@gmail.com (M. Hosseinia); fyang11@uh.edu (F. Yang); arjun@cs.uh.edu (A. Mukherjee) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) effective in authorship tasks [7, 8, 9], and even these networks require a good amount of text to perform well. Likewise, transfer learning may not have been utilized to its full potential, as some of the recent work in deep language models shows it to be a silver bullet for tasks lacking training data [10]. We propose a deep authorship verification method that uses a new measurement, DV-Distance. It estimates the magnitude and the direction of deviation of a document from the Normal Writing Style (NWS) by modeling it with state-of-the-art language models such as the AWD-LSTM and RoBERTa architecture introduced in [11, 12]. We proposed an unsupervised method which directly utilize the DV-Distance and an supervised neural architecture which projecting these vectors into a separate space. These proposed models have an intuitive and theoretically sound architecture and comes with good interpretability. Experiments conducted on four PAN Authorship Verification datasets show our method surpass state-of-the-art in three and competitive in one. 2. Authorship Verification and Non-comparability Problem In the following sections, we use the symbol 𝑃 to denote an authorship verification problem. Each problem 𝑃 consists of two elements: a set of known documents 𝐾, and unknown documents, 𝑈. Similarly, 𝑘 and 𝑢 represent a single known and unknown document, respectively. The task is then to find a hypothesis, ℎ, that takes in both components and correctly estimates the probability that the same author writes them. Important in many forensic, academic, and other scenarios, AV tasks remain very challenging due to several reasons. For one, in a cross-domain authorship verification problem, the documents in 𝐾 and 𝑢 could be of entirely different genre and type. More specifically, 𝐾 could contain several novels written by a known author, while 𝑢 is a twitter post. Another example demonstrating why a cross-domain model may be necessary is the case of a death note [13], as it is implausible to obtain a set of 𝐾 containing death notes written by the suspect. Furthermore, solving an authorship verification problem usually involves addressing one or more types of limited training data challenges: a limited amount of training problems 𝑃, out-of-set documents and authors appearing in test data, or a limited amount of content in the document sets {𝐾 , 𝑈 } of a particular problem 𝑃. Many methods use sophisticated forms of test-time processing, data augmentation, or ensembling to successfully minimize these challenges’ impact and achieve state-of-the-art results [7, 14]. However, such solutions typically result in prohibitively slow performance, most require a considerable amount of tuning, and almost all of them, to the best of our knowledge, require labeled data. As a result, existing methods are not relevant in many real-world scenarios. k: I suppose that was the reason. We were waiting for you without knowing it. Hallo! u: He maketh me to lie down in green pastures; he leadeth me beside the still waters. Figure 1: Sample document fragment from PAN-2015 Based on our observations, it is not unusual for an authorship verification model to identify some salient features in either 𝐾 or 𝑈, yet fail to find a directly comparable case in the other member of the pair. An example consisting of two brief segments from different authors is shown in Figure 1. We can immediately notice that document 𝑢 contains unusual words “maketh” and “leadeth” which are Old English. In contrast, document 𝑘 is written in relatively colloquial and modern English. A naive method of authorship verification one may devise in this scenario is to detect whether document 𝐾 contains the usage of “makes”, the modern counterpart to “maketh”. If there are occurrences of “makes” in 𝐾, we may be able to conclude that the two documents are from different authors. The issue with this approach however, is the non-zero probability of 𝐾 containing no usages of “makes” at all. Although it is possible to overcome the problem of non-comparability hand-crafted features, feature engineering is often a labor-intensive process that requires manual labeling. It is also improbable to design all possible features that encode all characteristics of all words. On the other hand, while some modern neural network based methods built upon the concept of distributed representations (word embeddings), and was able to encode some of the essential features, there is no existing approach explicitly attempt to address the non-comparability problem. To address the non-compatibility, we formulate Normal Writing Style (NWS), which can be seen as a universal way to distinguish between a pair of documents and solve the AV task in most scenarios in an unsupervised manner. The documents difference or similarity is determined with respect to NWS; to this end, we establish a new metric called Deviation Vector Distance (DV-Distance). To the best of our knowledge, the proposed approach is the first model designed with non-compatibility in mind from the ground up. 3. Normal Writing Style and Deviation Vector To make a small and often cross-domain document pair comparable, we propose to compare both documents to the Normal Writing Style instead of directly comparing the pair. We can define the Normal Writing Style or NWS, loosely as what average writers would write on average, given a specific writing genre, era, and language. From a statistical perspective, the NWS can be modeled as the averaged probability distribution of vocabulary at a location, given its context. As manifested in Figure 1, the reason words maketh and leadth stand out in the documents 𝑢 is because they are rarely used in today’s writing. They are hence deviant from the Normal Writing Style. We hypothesize that we can utilize modern neural language models to model NWS, and the predicted word embedding at a given location is a good semantic proxy of what an average writer would write at that location. And we also hypothesize that, generally, an author has a consistent direction of deviance in the word embedding space. Consequently, if two documents 𝑘 and 𝑢 have the same direction of deviation, then the two documents are likely from the same author. Conversely, if two documents have a significantly different direction of deviation, then they are probably from different authors. Previous empirical evidence shows that word embedding constructed using neural language models are good at capturing syntactic and semantic regularities in language [15, 16, 17]. The vector offsets encode properties of words and relationships between them. A famous example demonstrating these properties is the embedding vector operation: “King - Man + Woman = Queen”, which indicates that there is a Figure 2: Sample document fragments from PAN-2015 specific vector offset that encodes the difference in gender. Given the above context, we theorize it is possible to encode the deviance of maketh from makes as “Maketh - Makes” in a similar manner. We shall refer to the offset vector calculated this way as the Deviation Vector (DV). Figure 2 shows an illustrative example that visualizes the roles of Normal Writing Style modeling and the DVs. In the upper part of the figure, a document 𝑘 by a male author is suggested, containing a sentence, ”I hate shaving my beard.” At the bottom half of the figure, we can see a document 𝑢 written by a female author: ”My favorite gift is a dress.” Assuming we have a NWS model that is able to correctly predict all the words except at locations marked using a question mark. In place of those words, NWS may predict very general terms, such as “do” or “thing”. The actual words at these locations deviate from these general terms in the direction of the DV, represented in the figure using arrows. This specific example contains the words “beard” and “dress”, usually associated with a particular gender, while the general terms are gender-less. The DV then must have a component along the direction of the gender axis in embedding space but in the opposite direction. 4. Language Model and Implementation Details We used the AWD-LSTM architecture [11], implemented as part of Universal Language Model (ULMFit) [10], and RoBERTa [12] to model the Normal Writing Style. AWD-LSTM is a three- layered LSTM-based language model that trained by predicting the next word given the preced- ing sequence. Meanwhile, RoBERTa is a BERT-based model trained by predicting the masked word given an input sequence. Both of these language models are pre-trained on large corpuses and thus their predicted embedding for the unseen words can be used as a proxy of statistical distribution of Normal Writing Style. Assuming these language models can adequately model the Normal Writing Style, the Devi- ation Vectors can be calculated by subtracting the actual embeddings of the words from the predicted word embeddings. More formally, for an input sequence consist of 𝑛 tokens {𝑤1 , ..., 𝑤𝑛 }. We use 𝐸𝑀𝐵 to denote the embedding layer of the language models, and use 𝐿𝑀 to denote the language model itself. Then 𝐸𝑀𝐵(𝑤𝑖 ) and 𝐿𝑀(𝑤𝑖 ) will correspond to the embedding of the actual token at location 𝑖 and the predicted embedding by the language model at location 𝑖 when the corresponding token is the next token (AWD-LSTM) or is masked (RoBERTa). The DV at location 𝑖 can then be calculated as: 𝐷𝑉𝑖 = 𝐿𝑀(𝑤𝑖 ) − 𝐸𝑀𝐵(𝑤𝑖 ) (1) Figure 3: A demonstration of the process of calculating DV using AWD-LSTM (left) and RoBERTa (right) Figure 3 demonstrates the respective processes of calculating the DVs for a given input sequence using AWD-LSTM and RoBERTa. For AWD-LSTM, at each token location 𝑖, the deviation vector is calculated by subtracting the predicted embedding generated at previous token location 𝑖 − 1, by the embedding of the current word at 𝑖. Consequently, for a document of 𝑛 words, a total of 𝑛 − 1 DVs can be generated. For RoBERTa, the predicted embedding at location 𝑖 is obtained by feeding the model complete sequence of text with the token at 𝑖 replaced by the “[mask]” token. A total of 𝑛 such inference need to be conducted to obtain all the predicted embeddings at each location. The DVs can then be calculated by subtracting the predicted embeddings using the actual token embeddings, resulting in a total of 𝑛 DVs. 4.1. Unsupervised Method: DV-Distance To compare the direction of a deviation between two documents, we calculate the element- wise mean of all the DVs throughout each document to obtain the “Averaged DVs”. For a given document of 𝑛 tokens, 𝐴𝐷𝑉 (𝑑𝑜𝑐) = Σ𝑛𝑖=1 𝐷𝑉𝑖 /𝑛. Notice that for locations with a deviance between 𝐿𝑀 and 𝐸𝑀𝐵, the corresponding 𝐷𝑉 shall exert a larger influence on the document level 𝐴𝐷𝑉. Averaged DVs are calculated for both 𝐾 and 𝑈, then the DV-Distance can be calculated as the cosine similarity between 𝐴𝐷𝑉 (𝐾 ) and 𝐴𝐷𝑉 (𝑈 ). 𝐴𝐷𝑉 (𝐾 ) ⋅ 𝐴𝐷𝑉 (𝑈 ) 𝐷𝑉 𝐷𝑖𝑠𝑡(𝐾 , 𝑈 ) = (2) ‖𝐴𝐷𝑉 (𝐾 )‖ ‖𝐴𝐷𝑉 (𝑈 )‖ Since the DV-Distance method is completely unsupervised, the resulting distance values are relative instead of absolute. I.E., it is difficult to determine the classification result of a single document pair. Instead, a threshold value needs to be determined such that we can then classify all the document pairs with DV-Distance values greater than the threshold as ”Not same author” and vice versa. To determine the threshold, we follow previous PAN winners such as [7] and use the median of DV-distance value between all 𝐾 and 𝑢 pairs within the dataset as the threshold. Using this scheme is reasonable because PAN authorship verification datasets are guaranteed to be balanced. During our experiments, we found that the threshold value is relatively stable for a particular model in a given dataset, but can be quite different between LSTM and Bert-based models. For real-world applications, the threshold value can be determined ahead of time using a large dataset of similar genre and format as the problem to be evaluated. Figure 4: Network architecture of the DV-Projection method. Vectors 𝐸𝑀𝐵, 𝐿𝑀 and 𝐷𝑉 are represented using rounded rectangle shape. Fully connected layers are represented using trapezoid shape. Element- wise math operations are represented using circles. 4.2. Supervised Method: DV-Projection One of the major deficiencies of our Deviation Vector theory is that it assumes all differences in the DV hyperspace are relevant. However, one can imagine this assumption does not always hold in all the authorship verification settings. For example, the gender dimension shift shown in Figure 2 can be a useful clue when conducting authorship verification on a Twitter dataset or in the context of autobiographies. It may be less relevant if the gender shift occurs in a novel, as the vocabularies used in the novel are more relevant to its characters’ genders instead of the author’s. To address this issue, we propose to use a supervised neural network architecture to project the DVs onto axes that are most helpful for distinguishing authorship features. As we will demonstrate in the results and analysis section of this work, these DV projections are very effective when combining with the original token embeddings generated using the language models. Here we shall formally define the DV-Projection process. Given we have the embeddings and DVs for both a known document and an unknown document, each denoted using 𝐸𝑀𝐵𝑖𝑘 , 𝐷𝑉𝑖𝑘 , 𝐸𝑀𝐵𝑖𝑢 , 𝐷𝑉𝑖𝑢 . We use dense layers 𝑃𝑒 and 𝑃𝑑𝑣 with embeddings and DVs respectively to extract prominent features. These features are then feed together into dense layer 𝑃𝑖𝑛𝑡𝑒𝑟 . The outputs of 𝑃𝑖𝑛𝑡𝑒𝑟 are then average-pooled along the sequence to produce document-level features. Lastly, features from both known and unknown documents are connected to 2 additional fully- connected layers 𝑃𝑑1 , 𝑃𝑑2 to produce the final output. These operations can be summarized in equation 3 and visualized in figure 4, all layers are used in combination with hyperbolic tangent as activation function: 𝑇 𝑜𝑘𝑒𝑛𝐹 𝑒𝑎𝑡𝑢𝑟𝑒𝑖𝑘 = 𝑃𝑖𝑛𝑡𝑒𝑟 (𝑃𝑒 (𝐸𝑀𝐵𝑖𝑘 ), 𝑃𝑑𝑣 (𝐷𝑉𝑖𝑘 )) (3) 𝑇 𝑜𝑘𝑒𝑛𝐹 𝑒𝑎𝑡𝑢𝑟𝑒𝑗𝑢 = 𝑃𝑖𝑛𝑡𝑒𝑟 (𝑃𝑒 (𝐸𝑀𝐵𝑗𝑢 ), 𝑃𝑑𝑣 (𝐷𝑉𝑗𝑢 )) 𝐷𝑜𝑐𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 𝑘 = 𝐴𝑣𝑔𝑃𝑜𝑜𝑙(𝑇 𝑜𝑘𝑒𝑛𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 𝑘 ) 𝐷𝑜𝑐𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 𝑢 = 𝐴𝑣𝑔𝑃𝑜𝑜𝑙(𝑇 𝑜𝑘𝑒𝑛𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 𝑢 ) 𝑙𝑜𝑔𝑖𝑡 = 𝑃𝑑2 (𝑃𝑑1 (𝐷𝑜𝑐𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 𝑘 , 𝐷𝑜𝑐𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 𝑢 )) To allow training of the above model together with RoBERTa, we breaks documents from the original training document pairs into segments of 128 tokens long. We then build smaller training example pairs from these short document segments and label them accordingly. This approach not only allows us to build a lot more training examples to properly train the network parameters, it also forces the model to be more robust by limiting the amount of text it has access to. The training loss used is binary cross entropy loss in combination with the Sigmoid function. Because the DV-Projection method is a supervised model, from a theoretical perspective the model can learn the optimal threshold for classification, therefore eliminating the needs for using median value as threshold. However, the document segment based training pair generation method can generates significantly more “same author” pairs than “different author” pairs. Therefore the resulting trained model is biased and cannot be assumed to have a 0 valued threshold1 . To make it consistent, we also use the testing set median value as the threshold for DV-Projection method2 . 5. Experiments The goal of the empirical study described in the following section is to validate the proposed DV-Distance and DV-Projection method. For this purpose, we use authorship verification datasets released by PAN in 2013 [18], 2014 [19] and 2015 [20]. 5.1. Datasets The 2013 version of PAN dataset consists of 10 training problems and 30 testing problems. PAN 2014 includes two separate datasets, Novels and Essays. PAN 2014N consists of 100 English novel problems for training and 200 English problems for testing. PAN 2014E consists of 200 English essay problems for training and 200 English essay problems for testing. PAN 2015 is a cross-topic, cross-genre author verification dataset, which means known documents and an unknown document may come from different domains. PAN 2015 contains 100 training problems and 500 testing problems. 1 In real-world application this problem can be easily addressed by simply generating a large and balanced training dataset. 2 One can also opt to use training set median value as the threshold. To give an rough impression of how this will impact the performance: On PAN14N dataset, using testing set median value as threshold will produce 61% in accuracy, using training set median value as threshold will produce 65% in accuracy. On PAN14E dataset: using testing set median value as threshold will produce 73% in accuracy, using training set median value as threshold will produce 70% in accuracy. 5.2. Evaluation Metrics For each PAN dataset, we follow that year’s challenge rules. PAN 2013 uses accuracy, Receiver- Operating Characteristic (ROC) and 𝑆𝑐𝑜𝑟𝑒 = 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 × 𝑅𝑂𝐶. PAN 2014 introduces the c@1 measure to replace accuracy to potentially reward those contestants who choose not to provide an answer in some circumstances. This metric was proposed in [21], and it is defined as 1 𝑛 𝑐@1 = ( ) × (𝑛𝑐 + (𝑛𝑢 × 𝑐 )), (4) 𝑛 𝑛 where 𝑛𝑐 is the number of problems correctly classified, and 𝑛𝑢 is the number of open problems. The Score for PAN 2014 and 2015 is calculated as the product of c@1 and ROC, 𝑐@1 × 𝑅𝑂𝐶. PAN14E PAN14N Category Method c@1 ROC Score c@1 ROC Score Baseline GNB 0.675 0.741 0.5 0.56 0.743 0.416 Baseline LR 0.675 0.728 0.491 0.515 0.604 0.311 Baseline MLP 0.7 0.768 0.538 0.54 0.782 0.422 PAN FCMC [22] 0.58 0.602 0.349 0.71 0.711 0.508 PAN Frery [23] 0.71 0.723 0.513 0.59 0.61 0.36 TE [8] 0.67 0.675 0.452 0.695 0.7 0.487 2WD-UAV [14] 0.73 0.761 0.555 0.68 0.801 0.552 Our model DV-Dist. L 0.58 0.575 0.334 0.82 0.79 0.648 Our model DV-Dist. R 0.52 0.526 0.274 0.71 0.739 0.525 Our model DV-Proj. R 0.73 0.778 0.569 0.61 0.668 0.41 PAN13 PAN15 Category Method Acc. ROC Score c@1 ROC Score Baseline GNB 0.633 0.795 0.503 0.552 0.78 0.431 Baseline LR 0.7 0.781 0.547 0.544 0.796 0.433 Baseline MLP 0.533 0.5 0.267 0.554 0.687 0.381 PAN MRNN [7] - - - 0.76 0.81 0.61 PAN Castro [24] - - - 0.69 0.75 0.52 PAN GenIM [25] 0.8 0.792 0.633 - - - PAN CNG [26] - 0.842 - - - - TE [8] 0.8 0.835 0.668 0.748 0.75 0.561 2WD-UAV [14] 0.82 0.825 0.677 0.75 0.822 0.617 Our model DV-Dis. L 0.7 0.763 0.534 0.76 0.834 0.634 Our model DV-Dis. R 0.63 0.746 0.472 0.716 0.767 0.548 Table 1 Authorship Verification results for PAN datasets. 5.3. Baselines Classic Models with N-gram Features: In our study we use a set of baselines reported in [8]. They are produced using seven sets of features, including word n-grams, POS n-grams, and character 4-gram. The features need to be transformed because baselines are standard classification algorithms. According to the authors, simple concatenation of two documents’ features produces poor results, and use seven different functions to measure the similarity between feature vectors from both documents, including Cosine Distance, Euclidean Distance, and Linear Kernel. Several common classifiers are trained and evaluated using these similarity measurements, providing a reasonable representation of the performance that is achievable using classic machine learning models and n-gram feature sets. Out of all the baseline results, three classifiers with the highest performance are reported along with the other PAN results for comparison. The selected classifiers are Gaussian Naive Bayes (GNB), Logistic Regression (LR) and Multi-Layer Perceptron (MLP). We compare them with the proposed approach along with the state-of-the-art methods. PAN Winners: We compare our results to the best performing methods submitted to PAN each year. The evaluation results of the participant teams are compiled in the overview reports of PAN 2013 [27], 2014 [28] and 2015 [13]. In PAN 2013, the best-performing methods are the General Imposters Method (GenIM) proposed by [25] and the Common N-Gram (CNG) dissimilarity measure proposed by [26]. In PAN 2014 challenge, the best method for English Essay dataset is proposed by [23] (Frery), and the best method for English Novel dataset is by [22] which uses Fuzzy C-Means Clustering (FCMC). In PAN 2015, the Multi-headed Recurrent Neural Networks (MRNN) proposed in [7] outperforms the second best submission (Castro) [24] of the same year by a large margin. Transformation Encoder: In [8], an auto-encoder based authorship verification model performed competitively on PAN. We include its results to evaluate our model against one of the newest and strongest performers. 2WD-UAV: A language modeling based approach that relies on transfer learning an ensemble of heavily regularized deep classification models and data augmentation shows state-of-the-art performance, surpassing all verification methods evaluated on PAN that we are aware of [14]. Like our approach, it is based on a deep language model; however, it is otherwise similar to the majority of solid AV performers. 6. Results and Discussion Table 1 shows the results from experiments on PAN datasets, detailed in Section 5. The proposed unsupervised DV-Distance method conducted using AWD-LSTM and RoBERTa is denoted as “DV-Dist. L” and “DV-Dist. R”, respectively. The proposed supervised DV-Projection method is trained using DVs produced by RoBERTa and is labeled as “DV-Proj. R” in the table. We were only able to train the projection model on PAN14E and PAN14N due to both of them have relatively large training set. For PAN 2013, our results are slightly below the best performer of that year in terms of accuracy and AUC-ROC; the 0.1 difference in accuracy translates to 3 problems difference out of 30 testing problems. The PAN 2013 corpus are text segments from published Computer Science textbooks. The best performing model in this dataset is the neural network-based model from 2WD-UAV. For PAN 2014, we observed some interesting results. For the Novels part of the challenge, our unsupervised DV-Distance method based on LSTMs drastically improves upon previous state- of-the-art models, surpasses the previous best result by 18 percent. On the other hand, for the Essay dataset, both unsupervised DV-Distance methods failed to capture the feature necessary to complete the task, showing only 58% and 52% in accuracy. However, the supervised DV- Projection method successfully projects the DVs generated using RoBERTa into a hyperspace that is suitable for the essay AV problems, resulting in significant performance improvement over the unsupervised models and slightly outperforms the previous best result from 2WD-UAV. PAN 2015 edition places its focus on cross-genre and cross-topic authorship verification task. Based on our observations, the corpus mainly consists of snippets of novels of different genres and sometimes poems. Our proposed DV-Distance method based on multi-layer LSTMs once again shows excellent performance in this dataset, slightly outperforms the previous best model MRNN [7]. In cross-domain settings like PAN 2015, the problem of non-comparability is likely to be very pronounced. The strong performance of our methods in this dataset therefore verifies that these methods are quite robust against domain shift and non-comparability. Overall, we have observed two consistent trends in our experiments. First, we find that the AWD-LSTM based DV-Distance method consistently performs better than the RoBERTa based DV-Distance method. At first glance, this may seems counter-intuitive, as BERT-based models are generally regarded as one of the best performing model for language modeling. We theorize that this is precisely the culprit: RoBERTa was able to predict the target word much more accurately, both due to its architectural advantage and it simply has access to more contextual information. However, if the language model is performing “too accurate”, it failed to act as a model which represents averaged writing style, but instead mimicking the author’s tone and style. From a mathematical perspective, predictions that are “too accurate” will cause 𝐷𝑉s calculated using equation (1) to have a magnitude close to zero, then later steps in equation (2) or (3) will have very little information to work with. Second, we find that our proposed methods are most suitable for novel and fiction-type documents. Our methods demonstrated state-of-the-art performance in both PAN 2014 Novel and PAN 2015; both consist of mainly novel documents. On the other hand, PAN 2013 and PAN 2014 essay contains writing styles that are more formal and academic-oriented, for which our models performed less competitive. We theorize that this is because essay documents are easier to predict, whereas novels are much more “unpredictable”. This difference in predictability means in novel datasets, we can obtain higher quality DVs; while in essay datasets, the language models are once again making predictions that are “too accurate”, corroborating the first theory we discussed above. Deviation vectors of two PAN 2015 document pairs are visualized in Figure 5. Figure 5a shows two documents from different authors while Figure 5b shows two documents by the same author. The plots are generated by conducting PCA on the DVs at each word, projecting the 400 dimension DVs from AWD-LSTM to 2 dimension. A longer line in the plots hence represents a bigger deviation from the NWS. We can observe that in Figure 5a the DVs’ directions are in opposite direction while in Figure 5b their directions are similar. 7. Related Work Much of the existing work in authorship verification is based on vocabulary distributions, such as n-gram frequency. The hypothesis behind these models is that the relative frequencies of words or word combinations can be used for profiling the author’s writing style [1, 29]. One can conclude that two documents are more likely to be from the same author when the distributions of the vocabularies are similar. For example, in one document we may find that the author frequently uses “I like ...”, while in another document the author usually writes “I enjoy ...”. Such a difference may probably indicate that the documents are from different authors. This (a) DVs of a document pair by different authors. (b) DVs of a document pair by the same author. Figure 5: Visualization of deviation vectors in 2D. Each line corresponds to a word level DV and all words in a document is visualized in one subplot. The arrows in each subplot represents the averaged DV direction of that document. well-studied approach has had many successes, such as settling the dispute of ”Federalist Papers” [30]. However, its results are often less than ideal when dealing with a limited data challenge. The amount of documents in 𝐾 and 𝑈 is often insufficient to build two uni-gram word distributions that are comparable, let alone 3-gram or 4-gram ones. The depth of difference between two sets of documents is often measured using the unmasking technique while ignoring the negative examples [31]. This one-class technique achieves high accuracy for 21 considerably large (over 500K) eBooks. A simple feed-forward three layer auto-encoder (AE) can be used for AV, considering it a one-class classification problem [32]. Authors observe the behavior of the AE for documents by different authors and build a classifier for each author. The idea originates from one of the first applications of auto-encoders for novelty detection in classification problems [33]. AV is studied for detecting linguistic traits of sock-puppets to verify the authorship of a pair of accounts in online discussion communities [34]. A spy induction method was proposed to leverage the test data during the training step under ”out-of-training” setting, where the author in question is from a closed set of candidates while appearing unknown to the verifier [35]. In a more realistic setting, we have no specified writing samples of a questioned author, and there is no closed candidate set of authors. Since 2013, a surge of interest arose for this type of AV problem. [36] investigate whether one document is one of the outliers in a corpus by generalizing the Many-Candidate method by [37]. The best method of PAN 2014E optimizes a decision tree. Its method is enriched by adopting a variety of features and similarity measures [6]. For PAN 2014N, the best results are achieved by using fuzzy C-Means clustering [38]. In an alternative approach, [39] generate a set of impostor documents and apply iterative feature randomization to compute the similarity distance between pairs of documents. One of the more exciting and powerful approaches investigates the language model of all authors using a shared recurrent layer and builds a classifier for each author [40]. Parallel recurrent neural network and transformation auto-encoder approaches produce excellent results for a variety of AV problems [8], ranging from PAN to scientific publication’s authorship attribution [9]. In 2017, a non-Machine Learning model comprised of a compression algorithm, a dissimilarity method, and a threshold was proposed for AV tasks, achieving first place in two of four challenges [41]. Among the models mentioned above, MRNN proposed in [7] is the most comparable method to what we have introduced in this work. MRNN is an RNN-based character-level neural language model that models the flow of the known author documents 𝐾 and then is applied to the unknown document 𝑢. If the language model proves to be pretty good at predicting the next word on the unknown document (lower cross-entropy), then one can conclude they are likely written by the same author. While both MRNN and our DV-Distance-based methods utilize neural language modeling, for MRNN the language model represents a specific author’s writing style and need to be trained on the corpus 𝐾. In practice, training a language model on a small corpus without overfitting can be very challenging, if not impossible. In contrast, the DV-Distance methods proposed in this work does not require training a author-specific language model, instead, both known and unknown documents are compared against a common language model, allowing for evaluation on AV problems with shorter documents. 8. Conclusion In this paper, we present a novel approach to the authorship verification problem. Our method relies on using deep neural language models to model the Normal Writing Style and then computes the proposed DV-Distance between the set of known documents and the unknown document. The evaluation shows that authorship style difference strongly correlated with the distance metric we proposed. Our method outperforms several state-of-the-art models on multiple datasets, both in terms of accuracy and speed. 9. Acknowledgement Research was supported in part by grants NSF 1838147, NSF 1838145, ARO W911NF-20-1-0254. The views and conclusions contained in this document are those of the authors and not of the sponsors. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. References [1] E. Stamatatos, A survey of modern authorship attribution methods, Journal of the American Society for Information Science and Technology 60 (2009) 538–556. URL: http: //doi.wiley.com/10.1002/asi.21001. doi:1 0 . 1 0 0 2 / a s i . 2 1 0 0 1 . [2] K. Luyckx, W. Daelemans, Authorship Attribution and Verification with Many Authors and Limited Data, Technical Report, 2008. URL: http://www.cnts.ua.ac.be/. [3] M. Kestemont, M. Tschuggnall, E. Stamatatos, W. Daelemans, G. Specht, B. Stein, M. Pot- thast, Overview of the Author Identification Task at PAN-2018 Cross-domain Authorship Attribution and Style Change Detection, Technical Report, 2018. URL: http://pan.webis.de. [4] J. Bevendorff, B. Ghanem, A. Giachanou, M. Kestemont, E. Manjavacas, I. Markov, M. May- erl, M. Potthast, F. Rangel, P. Rosso, G. Specht, E. Stamatatos, B. Stein, M. Wiegmann, E. Zangerle, Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection, Technical Report, 2020. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 5 8 2 1 9 - 7 _ 2 5 . [5] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297. [6] J. Frery, C. Largeron, M. Juganaru-Mathieu, Ujm at clef in author verification based on optimized classification trees, in: Conference and Labs Evaluation Forum 2014, 2014, p. 7p. [7] D. Bagnall, Author Identification using Multi-headed Recurrent Neural Networks, Arxiv (2015). URL: http://arxiv.org/abs/1506.04891. a r X i v : 1 5 0 6 . 0 4 8 9 1 . [8] M. Hosseinia, A. Mukherjee, Experiments with neural networks for small and large scale authorship verification, CoRR abs/1803.06456 (2018). URL: http://arxiv.org/abs/1803.06456. arXiv:1803.06456. [9] D. Boumber, Y. Zhang, A. Mukherjee, Experiments with convolutional neural networks for multi-label authorship attribution, in: N. C. C. chair), K. Choukri, C. Cieri, T. De- clerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, T. Tokunaga (Eds.), Proceedings of the Eleventh International Confer- ence on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Paris, France, 2018. [10] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, 2018. arXiv:1801.06146. [11] S. Merity, N. S. Keskar, R. Socher, Regularizing and optimizing LSTM language models, CoRR abs/1708.02182 (2017). URL: http://arxiv.org/abs/1708.02182. a r X i v : 1 7 0 8 . 0 2 1 8 2 . [12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv (2019). URL: http://arxiv.org/abs/1907.11692. [13] E. Stamatatos, W. Daelemans, B. Verhoeven, P. Juola, A. López-López, M. Potthast, B. Stein, L. Cappellato, N. Ferro, G. Jones, E. San Juan, Overview of the Author Identification Task at PAN 2015 (2015) 8–11. URL: http://pan.webis.de. [14] D. Boumber, Y. Zhang, M. Hosseinia, A. Mukherjee, R. Vilalta, Robust authorship verifica- tion with transfer learning, EasyChair Preprint no. 865, 2019. doi:1 0 . 2 9 0 0 7 / 9 n f 3 . [15] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, Computing Research Repository abs/1301.3781 (2013). URL: http://arxiv.org/ abs/1301.3781. [16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems, Curran Associates, Inc., 2013, pp. 3111–3119. [17] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [18] P. Juola, E. Stamatatos, Overview of the author identification task at pan 2013., in: CLEF (Working Notes), 2013. [19] E. Stamatatos, W. Daelemans, B. Verhoeven, M. Potthast, B. Stein, P. Juola, M. A. Sanchez- Perez, A. Barrón-Cedeño, Overview of the author identification task at pan 2014, in: CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, Sheffield, UK, 2014, 2014, pp. 1–21. [20] E. Stamatatos, M. Potthast, F. Rangel, P. Rosso, B. Stein, Overview of the pan/clef 2015 evaluation lab, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2015, pp. 518–538. [21] A. Penas, A. Rodrigo, J. del Rosal, A Simple Measure to Assess Non-response., HLT ’11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies 1 (2011) 1415–1424. URL: http://hnk.ffzg.hr/bibl/acl2011/ Long/pdf/ACL-HLT2011142.pdf. [22] P. Modaresi, P. Gross, A Language Independent Author Verifier Using Fuzzy C-Means Clustering Notebook for PAN at CLEF 2014, Technical Report, 2014. URL: https://pdfs. semanticscholar.org/6e7a/326a5075ff3fd2c8d1c692b11c856a2f5f3c.pdf. [23] J. Fréry, C. Largeron, M. Juganaru-Mathieu, UJM at CLEF in Author Verification based on optimized classification trees. Notebook for PAN at CLEF 2014, Technical Report, 2014. URL: http://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-FreryEt2014.pdf. [24] D. Castro, Y. Adame, M. Pelaez, R. Muñoz, Authorship verification, combining linguistic features and different similarity functions, CLEF (Working Notes) (2015). [25] S. Seidman, Authorship Verification Using the Impostors Method, Technical Report, 2013. URL: http://ceur-ws.org/Vol-1179/CLEF2013wn-PAN-Seidman2013.pdf. [26] M. Jankowska, V. Kešelj, E. Milios, Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task, Technical Report, 2013. URL: https: //pdfs.semanticscholar.org/6b31/8009ebee54573e6402a4e5247089accd16a2.pdf. [27] P. Juola, E. Stamatatos, Overview of the Author Identification Task at PAN 2013 (2013). [28] E. Stamatatos, W. Daelemans, B. Verhoeven, M. Potthast, B. Stein, P. Juola, M. Sanchez- Perez, A. Barrón-Cedeño, Overview of the Author Identification Task at PAN 2014 1180 (2014) 877–897. [29] D. L. Hoover, Statistical Stylistics and Authorship Attribution: an Empirical Investigation, Literary and Linguistic Computing 16 (2001) 421–444. URL: https://academic.oup.com/dsh/ article-lookup/doi/10.1093/llc/16.4.421. doi:1 0 . 1 0 9 3 / l l c / 1 6 . 4 . 4 2 1 . [30] J. Rudman, The twelve disputed ‘federalist’papers: A case for collaboration (2012). [31] M. Koppel, J. Schler, Authorship verification as a one-class classification problem, in: Pro- ceedings of the Twenty-First International Conference on Machine learning, Association for Computing Machinery, 2004, p. 62. [32] L. Manevitz, M. Yousef, One-class document classification via neural networks, in: Neurocomputing, volume 70, Elsevier, 2007, pp. 1466–1481. [33] N. Japkowicz, C. Myers, M. Gluck, et al., A novelty detection approach to classification, in: IJCAI, volume 1, 1995, pp. 518–523. [34] S. Kumar, J. Cheng, J. Leskovec, V. Subrahmanian, An army of me: Sockpuppets in online discussion communities, in: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, 2017, pp. 857–866. [35] M. Hosseinia, A. Mukherjee, Detecting sockpuppets in deceptive opinion spam, Computing Research Repository abs/1703.03149 (2017). [36] S. Seidman, Authorship verification using the impostors method, in: Conference and Labs Evaluation Forum 2013 Evaluation Labs and Workshop–Working Notes Papers, Citeseer, 2013, pp. 23–26. [37] M. Koppel, J. Schler, S. Argamon, Authorship attribution in the wild, Language Resources and Evaluation 45 (2011) 83–94. [38] P. Modaresi, P. Gross, A language independent author verifier using fuzzy c-means clus- tering., in: Conference and Labs Evaluation Forum (Working Notes), 2014, pp. 1084–1091. [39] M. Koppel, Y. Winter, Determining if two documents are written by the same author, in: Journal of the Association for Information Science and Technology, volume 65, Wiley Online Library, 2014, pp. 178–187. [40] Bagnall, Douglas, Author identification using multi-headed recurrent neural networks, arXiv preprint arXiv:1506.04891 60 (2009) 9–26. [41] O. Halvani, C. Winter, L. Graner, On the usefulness of compression models for authorship verification, in: Proceedings of the 12th International Conference on Availability, Reliability and Security, ACM, 2017, p. 54.