Ordinal measures in authorship identification∗ Liviu P. Dinu Marius Popescu University or Bucharest, Faculty of University or Bucharest, Faculty of Mathematics and Computer Science, Mathematics and Computer Science, 14 Academiei, Bucharest, Romania, 14 Academiei, Bucharest, Romania, ldinu@funinf.cs.unibuc.ro mpopescu@phobos.cs.unibuc.ro Abstract: The goal of this paper is to compare a set of distance/similarity measures, regarding theirs ability to reflect stylistic similarity between authors and texts. To assess the ability of these distance/similarity functions to capture stylistic similarity between texts, we tested them in one of the most frequently employed multivariate statistical analysis settings: cluster analysis. The experiments are done on a corpus of 30 English books written by British, American and Australian writers. Keywords: authorship identification, ordinal measures 1 Introduction al., 2008), SVM based on function words fre- quencies (Koppel et. al., 2007), standard The authorship identification problem is an distances or ordinal distances (Popescu and ancient and omnipresent challenge, and al- Dinu, 2008). most in every culture there are a lot of dis- puted works (Shakespeare’s plays, Moliere vs. The goal of this paper is to compare a Corneille (Labbe and Labbe, 2001), Feder- set of distance/similarity measures, regard- alist Papers (Mosteller and Wallace, 2007), ing theirs ability to reflect stylistic similarity etc.). The problem of authorship identi- between texts. fication is based on the assumption that As style markers we have used the func- there are stylistic features that help distin- tion words frequencies. Function words are guish the real author from any other possi- generally considered good indicators of style bility. Literary-linguistic research is limited because their use is very unlikely to be un- by the human capacity to analyze and com- der the conscious control of the author and bine a small number of text parameters, to because of their psychological and cognitive help solve the authorship problem. We can role (Chung and Pennebaker, 2007). Also surpass limitation problems using computa- function words prove to be very effective in tional methods, which allow us to explore many author attribution studies. various text parameters and characteristics The distance/similarity between two texts and their combinations. Using these meth- will be measured as distance/similarity be- ods (van Halteren et al., 2005) have shown tween the function words frequencies corre- that every writer has a unique fingerprint re- sponding to the respective texts. For this garding language use. The set of language study we selected some similarity/distance use characteristics - stylistic, lexical, syntac- measures. We started with the most natural tic - form the human stylom. distance/similarity measures: euclidean dis- Because in all computational stylistic tance and (taking into account the statistical studies/approaches, a process of comparison nature of data) Pearson’s correlation coeffi- of two or more texts is involved, in a way or cient. Since function words frequencies can another, there was always a need for a dis- also be viewed as ordinal variables, we also tance/similarity function to measure similar- considered for comparison some specific sim- ity (or dissimilarity) of texts from the stylis- ilarity measures: Spearman’s rank-order co- tic point of view. These measures vary a lot, efficient, Spearman’s footrule, Goodman and and in the last years a series of different tech- Kruskal’s gamma, Kendall’s tau. niques were used in authorship identification: To assess the ability of these dis- approaches based on string kernel (Dinu, et tance/similarity functions to capture stylistic similarity between texts, we have tested them ∗ Research supported by CNCSIS, PN2-Idei project in one of the most frequently employed mul- 228 tivariate statistical analysis settings: cluster Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 62-66, 2009. Ordinal Measures in Authorship Identification 63 analysis. Clustering is a very good test bed The Pearson’s correlation coefficient is: for a distance/similarity measure behavior. n     xi −x yi −y We plugged the distance/similarity measures sx sy selected for comparison into a standard hier- r = i=1 archical clustering algorithm and applied it n−1 to a collection of 30 nineteenth century En- where x is the mean of X, y the mean of glish books. The family trees thus obtained Y , sx and sy are the standard deviation of revealed a lot about the distance/similarity X, Y , respectively (Upton and Cook, 2008). measures behavior. The correlation coefficient measures the ten- The main finding of our comparison dency of two variables to change in value to- is that the similarity measures that treat gether (i.e., to either increase or decrease). function words frequencies as ordinal vari- ris related with the Euclidean distance, the ables performed better than the others dis- 2(1 − r) being the Euclidean distance be- tance/similarity measures. Treating function tween the standardized versions of X and Y . words frequencies as ordinal variables means The random variables X, Y representing that in the calculation of distance/similarity texts can also be treated as ordinal data, in function the ranks of function words accord- which data is ordered but cannot be assumed ing to their frequencies in text will be used to have equal distance between values. In this rather than the actual values of these fre- case the values of X (and respectively Y ) will quencies. Usage of the ranking of func- be the ranks of words {w1 , w2 , . . . , wn } ac- tion words in the calculation of the dis- cording to their frequencies in text X rather tance/similarity measure instead of the ac- than of the actual values of these frequen- tual values of the frequencies may seem cies. The most common correlation statistic as a loss of information, but we consider for ordinal data is Spearman’s rank-order co- that the process of ranking makes the dis- efficient (Upton and Cook 2008): tance/similarity measure more robust acting n 6  as a filter, eliminating the noise contained in rsc = 1 − (xi − yi )2 the values of the frequencies. The fact that a n(n2 − 1) i=1 specific function word has the rank 2 (is the second most frequent word) in one text and To be noted that, this time, xi , yi are ranks has the rank 4 (is the fourth most frequent and actually, the Spearman’s rank-order coef- word) in another text can be more relevant ficient is the Pearson’s correlation coefficient than the fact that the respective word ap- applied to ranks. The Spearman’s footrule is pears 34% times in the first text and only the l1 -version of Spearman’s rank-order coef- 29% times in the second. ficient: n In the next section we present the dis- 3  rsf = 1 − |xi − yi | tance/similarity measures involved in the n2 − 1 i=1 comparison study, section 3 briefly describes the cluster analysis, and in section 4 and 5 Another set of correlation statistics for are presented the experiments, the results ob- ordinal data are based on the number of tained, and suggestions for future work. concordant and discordant pairs among two variables. The number of concordant pairs among two variables X and Y is P = |{(i, j) : 2 Similarity Measures 1 ≤ i < j ≤ n, (xi − xj )(yi − yj ) > 0}|. Sim- If we treat texts as random variables whose ilarly, the number of discordant pairs is Q = values are the frequencies of different words |{(i, j) : 1 ≤ i < j ≤ n, (xi − xj )(yi − yj ) < in the respective texts, then various statisti- 0}|. cal correlation measures can be used as sim- Goodman and Kruskal’s gamma(Upton ilarity measures between that texts. For two and Cook 2008) is defined as: texts X and Y and a fixed set of words P −Q {w1 , w2 , . . . , wn } let denote by x1 the rela- γ= P +Q tive frequency of w1 in X, by y1 the relative frequency of w1 in Y and so on by xn the rel- Kendall developed several slightly differ- ative frequency of wn in X, by yn the relative ent types of ordinal correlation as alterna- frequency of wn in Y . tives to gamma. Kendall’s tau-a(Upton and 64 Liviu P. Dinu and Marius Popescu Group Author Book Cook 2008) is based on the number of con- American Hawthorne Dr. Grimshawe’s Secret Novelists House of Seven Gables cordant versus discordant pairs, divided by a Melville Redburn measure based on the total number of pairs Cooper Moby Dick The Last of the Mohicans (n = the sample size): The Spy Water Witch American Thoreau Walden P −Q Essayists A Week on Concord τa = n(n−1) Emerson Conduct Of Life English Traits 2 British Playwrights Shaw Pygmalion Misalliance Getting Married Wilde An Ideal Husband Kendall’s tau-b(Upton and Cook 2008) is Woman of No Importance a similar measure of association based on con- Bronte Sisters Anne Agnes Grey Tenant Of Wildfell Hall cordant and discordant pairs, adjusted for Charlotte The Professor Jane Eyre the number of ties in ranks.It is calculated Emily Wuthering Heights as (P − Q) divided by the geometric mean of Australian Novelists B. Baynton Bush Studies Human Toll the number of pairs not tied on X (X0 ) and Henry Lawson Joe Wilson and His Mates On the Track the number of pairs not tied on Y (Y0 ): Miles While the Billy Boils My Brilliant Career Franklin Some Everyday Folk and Dawn P −Q Up the Country: A Saga of... τb =  Back to Bool Bool (P + Q + X0 )(P + Q + Y0 ) Table 1: The books used in experiments All the above three correlation statistics are very related, if n is fixed and X and Y have no tied, then P , X0 and Y0 are com- by Koppel et al. (2007) in their author- pletely determined by n and Q. ship verification experiments). The experi- ments have shown that the similarity mea- 3 Clustering Analysis sures that treat function words frequencies as ordinal variables (Spearman’s rank-order An agglomerative hierarchical clustering al- coefficient, Spearman’s footrule, Goodman gorithm (Duda et. al. 2001) arranges a set of and Kruskal’s gamma, Kendall’s tau) per- objects in a family tree (dendogram) accord- formed better than the distance/similarity ing to their similarity, similarity which in its measures that use the actual values of func- turn is given by a distance function defined on tion words frequencies (Euclidean distance, the set of objects. The algorithm initially as- Pearson’s correlation coefficient). signs each object to its own cluster and then The aim of the actual experiments was repeatedly merges pairs of clusters until the two-folded. Firstly we wanted to see if the whole tree is formed. At each step the pair of findings in Popescu and Dinu (2009) are con- nearest clusters is selected for merging. Var- firmed in the case of a larger set (more au- ious agglomerative hierarchical clustering al- thors, more books) and secondly to further gorithms differ in the way in which they mea- investigate the ability of some of the simi- sure the distance between clusters. Note that larity measures (Spearman’s rank-order co- although a distance function between objects efficient, Goodman and Kruskal’s gamma, exists, the distance measure between clusters Kendall’s tau) to distinguish between the dif- (set of objects) remains to be defined. In our ferent nationality of English language writers experiments we used the complete linkage dis- by adding to the data set works of Australian tance between clusters, the maximum of the writers from the same period. To the original distances between all pairs of objects drawn data set of Koppel et al. (2007) we added 9 from the two clusters (one object from the works of three Australian authors from the first cluster, the other from the second). same period, resulting a data set of 30 books and 13 authors (Table 1). 4 Experiments To perform the experiments, a set of words In Popescu and Dinu (2009) we have com- must be fixed. The most frequent func- pared the set of distance/similarity mea- tion words may be selected or other crite- sures described here on a collection of 21 ria may be used for selection. In all our ex- nineteenth century English books written by periments we used the set of function words 10 different authors and spanning a variety identified by Mosteller and Wallace (2007) as of genre (the same set of books were used good candidates for author-attribution stud- Ordinal Measures in Authorship Identification 65 ies. We used the agglomerative hierarchical 5 Future Work clustering algorithm coupled with the various In this paper we have compared a set of mea- distance similarity function employed in the sures, regarding theirs ability to reflect stylis- comparison to cluster the works in Table 1. tic similarity between texts. In future work it The dendrograms obtained sustain the re- would be interesting to compare these mea- sults of Popescu and Dinu (2009). The re- sures to other possible similarity measures. If sulted dendrograms for Euclidean distance the frequencies of different words in the texts and Pearson’s correlation coefficient (not are treated as probability distributions in- shown because of lack of space) are very sim- stead as random variables, specific measures ilar, which is no surprise taking into account can be applied: Kullback-Liebler Divergence the close relation between the two measures or Cross Entropy. (see section 2.1). The problem of these fam- ily trees is that the works of Melville are not References grouped together: one being clustered with C. K. Chung, and J. W. Pennebaker. 2007. the essays of Thoreau (Moby Dick) and the The psychological function of function other with the novels of Hawthorne. Also, words. In K. Fiedler, ed., Social commu- ”My Brilliant Career” of M. Franklin is clus- nication: Frontiers of social psychology, tered with the novels of Charlotte Bronte. 343−359. Psychology Press, New York. Apart from authorship relation, the dendro- L.P. Dinu, M. Popescu and A. Dinu. 2008. grams reflect no other stylistic relation be- Authorship Identification of Romanian tween the works (like grouping the works ac- Texts with Controversial Paternity. Proc. cording to genre or nationality of the authors: LREC 2008, Marrakech, Morocco. American / English / Australian). Spearman’s rank-order coefficient, Good- R. O. Duda, P. E. Hart, and D. G. Stork. man and Kruskal’s gamma and Kendall’s tau 2001. Pattern Classification (2nd ed.). produced the same dendrogram (modulo the Wiley-Interscience Publication. scale).Figure 1 shows the dendrogram for H. van Halteren, M. Haverkort, H. Baayen, Kendall’s tau. The dendrogram is perfect: A. Neijt, and F. Tweedie. 2005. New ma- all works are clustered according to theirs chine learning methods demonstrate the author. The nationality of the authors is existence of a human stylome. Journal of not reflected in the dendrogram (the authors Quantitative Linguistics, 12:65−77. with the same nationality are not clustered M. Koppel, J. Schler, and E. Bonchek- together). Dokow. 2007. Measuring differentiabil- We performed a series of experiments to ity: Unmasking pseudonymous authors. test in which cases the nationality of the au- J. of Machine Learning Research, 8,1261 thors can be revealed by a stylistic similar- −1276. ity measure. If only British and Australian writers are selected, the Kendall’s tau pro- C. Labbe and D. Labbe. 2006. A tool for lit- duced the dendrogram presented in Figure erary studies: Intertextual distance and 2. As can be seen the first two branches tree classification. Literary and Linguistic correspond to the nationality of the authors: Computing, 21(3):311−326. British writers on upper branch, Australian F. Mosteller and D.L. Wallace. 2007. Infer- writers on lower branch. The same thing hap- ence and Disputed Authorship: The Fed- pen when British and American writers are eralist. CSLI Publications, Stanford. selected. Again, the writers are clustered ac- M. Popescu, L.P.Dinu, 2008. Rank Distance cording to their nationality: this time, the as a Stylistic Similarity. Proceedings COL- British writers on lower branch and Ameri- ING 2008, Manchester, UK can writers on upper branch. But when the subset of American and Australian writers is M. Popescu, L.P.Dinu, 2009. Comparing clustered using Kendall’s tau, the national- Statistical Similarity Measures for Stylis- ity of the writers is no longer reflected in the tic Multivariate Analysis. Proceedings family tree produced. The works of each au- RANLP 2009, Borovets, Bulgaria thor are clustered together, but there are no G. Upton and I. Cook. 2008. A Dictionary of clear branches corresponding to the two na- Statistics. Oxford Univ. Press, Oxford. tionalities. 66 Liviu P. Dinu and Marius Popescu Figure 1: Dendrogram of 30 nineteenth century English books (Kendal’s tau) Figure 2: Dendrogram of British and Australian writers (Kendal’s tau)