=Paper=
{{Paper
|id=Vol-3290/long_paper5362
|storemode=property
|title=Boosting Word Frequencies in Authorship Attribution
|pdfUrl=https://ceur-ws.org/Vol-3290/long_paper5362.pdf
|volume=Vol-3290
|authors=Maciej Eder
|dblpUrl=https://dblp.org/rec/conf/chr/Eder22
}}
==Boosting Word Frequencies in Authorship Attribution==
Boosting Word Frequencies in Authorship Attribution Maciej Eder Institute of Polish Language, Polish Academy of Sciences, al. Mickiewicza 31, 31–120 Kraków, Poland Abstract In this paper, I introduce a simple method of computing relative word frequencies for authorship at- tribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more e昀케- cient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings. Keywords authorship attribution, stylometry, relative word frequencies, word vectors, semantic neighbors 1. Introduction In a vast majority of text classi昀椀cation studies aimed at distinguishing unique authorial signal – these include authorship attribution investigations, authorship pro昀椀ling, veri昀椀cation, and sim- ilar tasks – relative frequencies of the most frequent words (MFWs) are routinely used as the language features to betray the authorial “昀椀ngerprint”. A vector of such relative word frequen- cies is then passed to one of the multidimensional machine-learning classi昀椀cation techniques, ranging from simple distance-based lazy learners, such as Delta [2, 5], to sophisticated deep learning neural network setups [6]. Recent advances in machine learning methodology – unheard-of and unprecedented – mas- sively reshaped the 昀椀eld of text classi昀椀cation. Three main methodological directions are ac- tively researched: 昀椀rstly, new classi昀椀ers emerge on the horizon to clearly outperform classical solutions; secondly, feature engineering and dimensionality reduction techniques are intro- duced to overcome the curse of high dimensionality, and thirdly, alternative style-markers that can betray authorial idiosyncrasies are being introduced. The present paper explores none of the above directions, though. Instead, I argue that a reasonable amount of overlooked stylis- tic information resides in a time-proven, standard bag-of-word representation of textual data, which is routinely used in dozens of stylometric studies. CHR 2022: Computational Humanities Research Conference, December 12–14, 2022, Antwerp, Belgium £ maciej.eder@ijp.pan.pl (M. Eder) ç https://maciejeder.org/ (M. Eder) ȉ 0000-0002-1429-5036 (M. Eder) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 387 Certainly, there exist alternative features that prove to be e昀케cient style-markers in author- ship attribution setups. Most notably, letter n-grams have been suggested as a strong authorial indicator [16]. Also, grammatical features, such as POS-tag n-grams, turned out to retain infor- mation about authorial uniqueness [8]. Other intriguing ideas include observing the immedi- ate lexical context around proper nouns [12]. Even if such alternative textual features exhibit a great deal of potential to enhance text classi昀椀cation [3], the standard approach relying of word frequencies continues to be predominant in the 昀椀eld [7, 18]. In this paper, word frequencies will be used as well, yet the step of normalizing them into relative frequencies will be somewhat enhanced. Speci昀椀cally, all the other words used to normalize the frequencies will be evaluated and then reduced, so that a given word in question is normalized by its actual semantic back- ground. However, the general idea of enhancing the frequencies can be extended, I believe, to other style-markers, including extra-lexical ones. 2. Word frequencies The notion of relative word frequencies is fairly simple. We count all the tokens belonging to particular types (e.g. all the attestations of the word “the”, followed by the attestations of “in”, “for”, “of” etc.), and for each word, we divide the number of types by the total number of words in a document. Consequently, each word frequency is equal to its percentage within the document (e.g. “the” = 0.0382), and all the frequencies sum up to 1. The reason of converting occurrences to relative frequencies is obvious: by doing so, one is able to reliably compare texts that di昀昀er in length. The notion of relative word frequencies is so natural and intuitive that one might very easily overlook its methodological implications, as if it was nothing else than a simple normalization procedure. For the sake of this paper, however, it is important to realize that relative frequencies are relative to all the other words in a document in question. Convenient as they are, these values are at the same time very small and – importantly – they are a昀昀ected by hundreds of other word frequencies. Consequently, the 昀椀nal values might not be su昀케ciently precise to capture minute di昀昀erences between word occurrences, because the normalization factor evens them out to some extent. Now, what if we disregard thousands of other words in a text, and instead compute the frequencies in relation to a small number of words that are relevant? An obvious example is the mutual relation between the words “on” and “upon” in one document [15]; essentially, more attestations of “upon” come at the cost of the occurrences of the word “on” – and vice versa. While the classical relative frequency of the word “on” in Emily Bronte’s Wuthering Heights is 0.00687, the proportion of “on” relative exclusively to “upon” is 0.9762. It is assumed in this paper that the latter frequency can betray the authorial signal to a greater extent than the classical approach, because the myriads of other words are not blurring the 昀椀nal value. The idea of looking into semantics is not entirely new, since thesaurus-based approaches have been already proposed in the context of authorship attribution [11, 10]. It has been sug- gested that a list of words organized into near-synonymous sets (“synsets”) and/or into larger hierarchies can be used to extract the authorial signal [9], it has been also demonstrated that pairs of synonyms might contain valuable authorial information [1]. However, the above ap- 388 proaches are focused on identifying meaningful words beyond the usual MFWs, whereas the present study is aimed to show that there is still some room to enhance the very MFWs. 3. Method Given the above “on” and “upon” example, it would be tempting to identify one synonym for each of the words, and to compute the relative proportions in each of the synonym pairs, as suggested in the already cited study [1]. Linguistically speaking, however, such an approach would hardly be feasible. Firstly, only a fraction of words have their proper synonyms. Sec- ondly, some semantic 昀椀elds are rather rich and cannot be reduced to a mere pair of synonyms. Thirdly, in the case of the most frequent words (articles, particles, prepositions) identifying their synonyms doesn’t make much sense, yet still, relevant counterparts for these frequent words obviously exist. On theoretical grounds, however, it is di昀케cult to speculate whether the number of relevant counterparts should be restricted to a single word – as in the example of “on” de昀椀ned by its relation to “upon” – or include, say, a dozen related words. E.g., to deter- mine the relative frequency of the word “make”, one would probably measure its proportion against the sum of occurrences of “do”, “prepare”, “create”, “turn”, “cra昀琀”, “invent” etc. The ef- fective size of the semantic background is, however, very di昀케cult to conceptualize – not only the actual number of related words, but even the order of magnitude are unknown. Take the above example: should the word “make” be calculated against its 10 similar words, or would the semantic background of 100 words be better? Another nontrivial question is related to the very method of extracting synonyms and other semantically related words from a corpus. While thesaurus-based search might prove feasible for single words, it will certainly become more demanding when dozens of seed words are concerned. There exist, however, at least two strategies to approach the issue computationally. One strategy involves wordnet, a manually compiled database of thousands of words with their semantic and syntactic relations [14], while the other relies on distributional semantics meth- ods. In particular, the algorithm word2vec should be mentioned in this context [13], which provides a vector representation of words that allows for identifying their semantic similari- ties. Even if these inferred similarities do not comply with any formal grammar (rather, the relations are known to be fuzzy at times), they usually look convincing to a human observer. In the present study, a word vector model GloVe [17] was used to betray word similarities: it was trained on the benchmark corpus of 99 English novels (as described below), with 100 tar- get dimensions. A semantic background for a given seed word was de昀椀ned as n neighboring vectors. Consequently, the resulting semantic background contained the most similar vectors for a given seed word. E.g., the neighbors for the word “person” were: “woman”, “gentleman”, “man”, “one”, “sort”, “whom”, “thing”, “young”, etc., whereas the neighbors for the word “the” were as follows: “of”, “this”, “in”, “there”, “on”, “one”, “which”, “its”, “was”, “a”, “and”, etc. For each target word, a relative frequency was calculated as the number of occurrences divided by the sum of occurrences of its n semantic neighbors (n being the size of semantic space to be tested). In order to corroborate the above intuitions, a controlled authorship attribution experiment was designed. A benchmark corpus of 99 English novels was used: it consists of 33 authorial 389 Figure 1: The performance (F1 scores) for a benchmark corpus of 99 English novels, and the Delta classifier. Distance measures involve Cosine Delta (top le昀琀), Classic Delta (top right), Eder’s Delta (bottom le昀琀), and Manhattan (bottom right). The results depend on the MFW vector (y axis) and the size of the semantic space expressed in the number of most similar words in a vector model (x axis). classes and 3 novels per author, and is freely available on GitHub repository: https://github.c om/computationalstylistics/100_english_novels. A corpus of (naturally long) novels might be considered inferior for authorship benchmarks, the high number of the authorial classes, however, makes the task di昀케cult enough to su昀케ciently stress-test the classi昀椀er. To make the task even harder, the amount of training data were restricted to 1 text per author, whereas the remaining 2 texts per author were used as the validation set (the proportion of 33 vs. 66 texts were kept in each iteration). Since the size of the semantic background is unknown, a grid-search framework was de- signed to systematically assess tighter (1 relevant counterpart) and broader semantic spaces (up to 10,000 words, inevitably going far beyond synonyms). The tests were performed using the package stylo for R [4]. Di昀昀erent classi昀椀ers, MFW vectors and, most importantly, di昀昀er- ent sizes of the semantic space were tested systematically, in a supervised setup with strati昀椀ed cross-validation. On theoretical grounds, the size of the semantic space n = 80,000 (roughly the total number word types in the benchmark corpus) would be equivalent to classical relative fre- quencies, whereas the space of the size n = 1 means that the frequencies are relative to exactly one other word (e.g. the frequency of the word “the” would be the number of occurrences of 390 Figure 2: The gain of performance (baseline F1 scores subtracted from the obtained F1 scores) for a benchmark corpus of 99 English novels. “the” divided by the total number of “the” and “of”). Independently, an alternative set of tests were performed using a regular relative frequencies. The outcomes of these tests served as a baseline. In each test reported in this paper, the F1 scores are used as a compact and reliable measure of performance. 4. Results The obtained results (Table 1, and Fig. 1) clearly suggest that the new method outperforms the classical relative frequencies solution substantially, no matter which distance measure is used. In agreement with several previous studies, longer MFW vectors worked better than, say, 100 MFWs. Also according to expectation, Cosine Delta proves to be the undisputed winner among the classi昀椀ers. Counter-intuitive, however, was the behavior of di昀昀erent classi昀椀ers with the enhanced word frequencies. As evidenced in Fig. 1 top le昀琀 panel, Cosine Delta works best with frequencies computed against 5–50 semantically similar words, whereas Burrows’s Delta (top right) exhibits its sweet spot for 50–100 neighboring words, and so does Eder’s Delta (bottom le昀琀). When the semantic background is further increased, the behavior of particular classi昀椀ers becomes uniform across the board: it slowly but surely decreases to ultimately reach the baseline level. 391 Figure 3: The absolute performance (top) and the performance gain (bottom) for the corpus of 99 English novels. The distance measures are Cosine Delta (le昀琀) and Burrows’s Delta (right). The semantic space is defined as the words within a given cosine distance from respective source words. The results depend on the MFW vector (y axis) and the size of the semantic space (x axis). Since the introduction of Burrows’s Delta practitioners are aware that scaling (z-scoring) the features is the very factor responsible for the performance boost observed in Delta and its derivatives. Even if Manhattan distance does not scale the features (hence its unpopular- ity in text classi昀椀cation), the improved word frequencies behave di昀昀erently than standard ap- proaches, which in turn might favor simple distances such as Manhattan. And indeed, the scores obtained for the Manhattan Distance are radically better than the respective baseline (Fig. 1, bottom right), yet still, Manhattan still cannot compete with z-scored distances. According to the above results, a recipe for a successful authorship attribution setup seems to be as follows: take roughly 800–900 MFWs, and compute their frequencies using, for each word, the occurrences of their 5–10 semantic neighbors; then use the Cosine Delta classi昀椀er. Since in authorship attribution the results are proven to be unevenly distributed across dif- ferent MFW vectors, let alone di昀昀erent classi昀椀ers, Fig. 2 presents the same outcomes as pre- viously, yet this time de昀椀ned as the improvement (in percentage points) over the baseline F1 scores. While the overall best performance is obtained for ca. 850 MFWs computed against 5– 10 words, the biggest gain over the baseline (more than 10 percentage points!) is provided by the following scenario: 300 MFW frequencies computed against a tight semantic background of 3 neighboring words. Other reasonable improvements are generally associated with short 392 Table 1 The best performance (F1 scores) obtained in each tested scenario. Relative frequencies Enhanced frequencies Cosine Delta 0.908 0.959 Burrows’s Delta 0.823 0.838 Eder’s Delta 0.812 0.830 Manhattan 0.679 0.771 MFW vectors and the semantic background of 5–100 words. In the case of Burrows’s Delta, which worked with 900 MFWs computed against 60 neighboring words (Fig. 2, top right), the improvement over the baseline is biggest for short vectors of MFWs. Interestingly, for Bur- rows’s Delta the new method proves to be worse than the baseline for long MFW vectors and tight semantic spaces of 1–10 neighboring words. The picture for Eder’s Delta (bottom le昀琀) is similar to that for Burrows’s method, even if its hot spot is slightly moved towards longer MFWs vectors. Surprisingly enough, the results for Manhattan Distance turned out to be sub- stantially di昀昀erent from the other methods, and much less predictable. A large and pronounced hot spot of radically improved performance forms for tight semantic spaces, across di昀昀erent MFWs vectors. On the right hand side, the mountain of performance is followed by a deep valley of no improvement at all, and then, counter-intuitively, another hill emerges, indicating the boost of performance for the semantic spaces of 50–100 words. This behavior is di昀케cult to explain. The proposed way of identifying an arbitrarily chosen number of semantic neighbors, might su昀昀er from an uneven distribution of semantic neighbors in a given model (GloVe, word2vec, fastText, etc.). E.g., 50 neighboring lexemes might point to a semantically coherent area around a function word, or indicate but vague associations around a very speci昀椀c technical term. To account for this factor, a second experiment has been conducted, in which I have de昀椀ned a semantic background to be all the words located at a speci昀椀c cosine distance from a given ref- erence word. Consequently, rather than extracting n neighboring words, now I was extracting all the words within the radius of 0.9 cosine similarity in the 昀椀rst iteration, then 0.85, 0.8 etc., all the way to –0.9. The results for Cosine Delta and Burrows’s Delta are shown in Fig. 3. As can be seen, a clear hot spot forms in the area of 0.7–0.5 cosine similarity, despite the number of MFWs or the classi昀椀er, and a昀琀er the distance of 0.3 the performance hits the baseline level. The results con昀椀rm the general picture obtained in the previous experiment (Fig. 1), yet the sweet spot area seems to be more di昀케cult to generalize. 5. Discussion The results presented in the previous section call for further exploration and above all, for a concise discussion. A few general remarks can be formulated here: 1. No matter which classi昀椀cation method was used, the performance improvement turned out to be large, clearly suggesting that bare word occurrences retain much more autho- rial signal than the time-proven relative frequencies are able to betray. It can be safely 393 hypothesized that the method introduced in this paper barely opened a new perspective, rather than o昀昀ered an ultimate solution to the problem. 2. In order to identify the words that matter, a word embedding model was used – and this, again, was far from an optimal solution. As a rough proxy, it nevertheless was able to improve the word frequencies in the range of 5–50 neighboring words. On theoretical grounds, a further improvement should be possible with a more precise method of iden- tifying relevant semantic background. 3. While the new method improves the performance across all the MFW strata, short MFW vectors seem to bene昀椀t more. Interesting from a theoretical point of view, this phe- nomenon has also a practical implication. Namely, since several studies suggest that larger numbers of MFWs should be preferred as they generally exhibit better perfor- mance, it is also believed that they are more likely to be a昀昀ected by genre, topic, and content of the analyzed texts. With this in mind, some practitioners choose to conduct authorship attribution on shorter MFW vectors. The method introduced in this paper can greatly improve the performance in such setups. An observation that requires further investigation, is the discrepancy between classi昀椀ers in how they react to the same semantic background. Contrary to intuition, for Burrows’s Delta the improvement of performance was not simply correlated with the size of the semantic background. Tight neighborhood – less than 20 synonyms and/or other related words – did not outperform standard relative frequencies, whereas broader contextual information of ca. 50– 100 related words showed a signi昀椀cant improvement over the baseline. In the case of Cosine Delta, tight semantic background of ca. 5–10 proved optimal, whereas broader spaces of 50–100 neighboring words were only marginally worse, still outperforming the baseline to a signi昀椀cant degree. 6. Conclusion The paper presented a simple method to improve the performance in di昀昀erent stylometric setups. The method is conceptually straightforward and does not require any NLP tooling. The only external piece of information that is required is a list of semantically related words for each of the most frequent words in the corpus. A controlled experiment showed a signi昀椀cant improvement of classi昀椀cation accuracy in a supervised multi-class authorship attribution setup. Acknowledgments This research is part of the project Large-Scale Text Analysis and Methodological Foundations of Computational Stylistics (2017/26/E/HS2/01019), supported by Poland’s National Science Cen- tre. The code and the datasets to replicate the experiments presented in this study are posted on GitHub repository: https://github.com/computationalstylistics/word_frequencies. 394 References [1] G. Borski and M. Kokowski. “Copernicus, his Latin style and comments to Commentari- olus”. In: Studia Historiae Scientiarum 20 (2021), pp. 339–438. url: https://www.ejournal s.eu/Studia-Historiae-Scientiarum/2021/20-2021/art/19754/. [2] J. Burrows. “‘Delta’: a measure of stylistic di昀昀erence and a guide to likely authorship”. In: Literary and Linguistic Computing 17.3 (2002), pp. 267–287. [3] M. Eder. “Style-markers in authorship attribution: a cross-language study of the authorial 昀椀ngerprint”. In: Studies in Polish Linguistics 6 (2011), pp. 99–114. url: http://www.ejour nals.eu/SPL/2011/SPL-vol-6-2011. [4] M. Eder, J. Rybicki, and M. Kestemont. “Stylometry with R: a package for computational text analysis”. In: R Journal 8.1 (2016), pp. 107–121. doi: 10.32614/rj-2016-007. [5] S. Evert, T. Proisl, F. Jannidis, I. Reger, S. Pielström, C. Schöch, and T. Vitt. “Understanding and explaining Delta measures for authorship attribution”. In: Digital Scholarship in the Humanities 32 (suppl. 2 2017), pp. 4–16. doi: 10.1093/llc/fqx023. [6] H. Gómez-Adorno, J.-P. Posadas-Durán, G. Sidorov, and D. Pinto. “Document embed- dings learned on various types of n-grams for cross-topic authorship attribution”. In: Computing 100.7 (2018), pp. 741–756. doi: 10.1007/s00607-018-0587-8. [7] J. W. Grieve. “Quantitative authorship attribution: An evaluation of techniques”. In: Lit- erary and Linguistic Computing 22.3 (2007), pp. 251–270. doi: 10.1093/llc/fqm020. [8] G. Hirst and O. Feiguina. “Bigrams of syntactic labels for authorship discrimination of short texts”. In: Literary and Linguistic Computing 22.4 (2007), pp. 405–417. [9] P. Juola. “Thesaurus-based semantic similarity judgments”. In: Drawing Elena Ferrante’s pro昀椀le. Ed. by A. Tuzzi and M. A. Cortelazzo. Padova: Padova University Press, 2018, pp. 47–59. [10] M. Koppel, N. Akiva, and I. Dagan. “Feature instability as a criterion for selecting po- tential style markers”. In: Journal of the American Society for Information Science and Technology 57.11 (2006), pp. 1519–1525. [11] H. Love. Attributing authorship: An introduction. Cambridge: Cambridge University Press, 2002. [12] A. Lučić and C. L. Blake. “A syntactic characterization of authorship style surrounding proper names”. In: Digital Scholarship in the Humanities 30.1 (2013), p. 53. doi: 10.1093/l lc/fqt033. [13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. “Distributed representations of words and phrases and their compositionality”. In: Advances in neural information processing systems. 2013, pp. 3111–3119. [14] G. A. Miller. “WordNet: A lexical database for English”. In: Communications of the ACM 38.11 (1995), pp. 39–41. [15] F. Mosteller and D. Wallace. Inference and disputed authorship: The Federalist. Stanford: CSLI Publications, 1964. 395 [16] F. Peng, D. Schuurmans, V. Keselj, and S. Wang. “Language independent authorship at- tribution using character level language models”. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics. 2003, pp. 267–274. [17] J. Pennington, R. Socher, and C. D. Manning. “GloVe: Global vectors for word represen- tation”. In: Empirical Methods in Natural Language Processing (EMNLP). 2014, pp. 1532– 1543. [18] E. Stamatatos. “A survey of modern authorship attribution methods”. In: Journal of the American Society for Information Science and Technology 60.3 (2009), pp. 538–556. A. Function to compute enhanced word frequencies The following code de昀椀nes a function to compute the word frequencies as discussed in this paper. The code is written in generic R and does not require any external R library to run. The function takes three arguments: (i) word_frequencies is a document-term matrix, or a table with raw frequencies (occurrences) or words in a given dataset; unlike typical stylometric applications, where one usually takes a subset of n most frequent words, here all the informa- tion about infrequent words is equally important; (ii) word_vector_similarities is a table containing, for each word, the nearest neighbors in a semantic space, e.g. the row for the word “person” these are the following words: “woman”, “gentleman”, “man”, “one”, “sort”, “whom”, “thing”, “young”, etc.; it is su昀케cient to compute the neighbors for 1000 most frequent words or so, and the semantic depth can be reduced to, say, 100 semantically related words in each case (for the sake of the present study, a set of 1000 most frequent words with their 10,000 semantic neighbors were used); (iii) no_of_similar_words a number (integer) of how many semantic neighbors one wants to take into consideration. compute_subset_frequencies = function(dtm_matrix, word_vector_similarities, no_of_similar_words) { semantic_space = word_vector_similarities[ , 1:no_of_similar_words, drop = FALSE] no_of_words = dim(semantic_space)[1] final_frequency_matrix = matrix(nrow = dim(dtm_matrix)[1], ncol = no_of_words) for(i in 1:no_of_words) { # check if the required word(s) appears in the corpus words_sanitize = semantic_space[i,] %in% colnames(dtm_matrix) words_to_compute = semantic_space[i, words_sanitize] # if the corpus doesn't contain any of the words required # by the model, then grab the the most frequent word # for reference (it should not happen often, though) 396 if(length(words_to_compute) == 0) { words_to_compute = colnames(dtm_matrix)[1] } # add the occurences of the current word being computed; # e.g. for the word "of", add "of" to the equation words_to_compute = c(colnames(dtm_matrix)[i], words_to_compute) # getting the occurrences of the relevant words from # the input matrix of word occurrences: f = dtm_matrix[, words_to_compute] # finally, computing new relative frequencies final_frequency_matrix[,i] = f[,1] / rowSums(f) } # sanitizing again, by replacing NaN values with Os final_frequency_matrix[is.nan(final_frequency_matrix)] = 0 # tweaking the names of the rows and columns rownames(final_frequency_matrix) = rownames(dtm_matrix) colnames(final_frequency_matrix) = rownames(semantic_space) class(final_frequency_matrix) = "stylo.data" return(final_frequency_matrix) } 397