Ordinal measures in authorship identiﬁcation∗
                    Liviu P. Dinu                     Marius Popescu
          University or Bucharest, Faculty of University or Bucharest, Faculty of
          Mathematics and Computer Science, Mathematics and Computer Science,
          14 Academiei, Bucharest, Romania, 14 Academiei, Bucharest, Romania,
              ldinu@funinf.cs.unibuc.ro        mpopescu@phobos.cs.unibuc.ro

         Abstract: The goal of this paper is to compare a set of distance/similarity measures,
         regarding theirs ability to reﬂect stylistic similarity between authors and texts. To
         assess the ability of these distance/similarity functions to capture stylistic similarity
         between texts, we tested them in one of the most frequently employed multivariate
         statistical analysis settings: cluster analysis. The experiments are done on a corpus
         of 30 English books written by British, American and Australian writers.
         Keywords: authorship identiﬁcation, ordinal measures


1     Introduction                                                          al., 2008), SVM based on function words fre-
                                                                            quencies (Koppel et. al., 2007), standard
The authorship identiﬁcation problem is an
                                                                            distances or ordinal distances (Popescu and
ancient and omnipresent challenge, and al-
                                                                            Dinu, 2008).
most in every culture there are a lot of dis-
puted works (Shakespeare’s plays, Moliere vs.                                   The goal of this paper is to compare a
Corneille (Labbe and Labbe, 2001), Feder-                                   set of distance/similarity measures, regard-
alist Papers (Mosteller and Wallace, 2007),                                 ing theirs ability to reﬂect stylistic similarity
etc.). The problem of authorship identi-                                    between texts.
ﬁcation is based on the assumption that                                         As style markers we have used the func-
there are stylistic features that help distin-                              tion words frequencies. Function words are
guish the real author from any other possi-                                 generally considered good indicators of style
bility. Literary-linguistic research is limited                             because their use is very unlikely to be un-
by the human capacity to analyze and com-                                   der the conscious control of the author and
bine a small number of text parameters, to                                  because of their psychological and cognitive
help solve the authorship problem. We can                                   role (Chung and Pennebaker, 2007). Also
surpass limitation problems using computa-                                  function words prove to be very eﬀective in
tional methods, which allow us to explore                                   many author attribution studies.
various text parameters and characteristics                                     The distance/similarity between two texts
and their combinations. Using these meth-                                   will be measured as distance/similarity be-
ods (van Halteren et al., 2005) have shown                                  tween the function words frequencies corre-
that every writer has a unique ﬁngerprint re-                               sponding to the respective texts. For this
garding language use. The set of language                                   study we selected some similarity/distance
use characteristics - stylistic, lexical, syntac-                           measures. We started with the most natural
tic - form the human stylom.                                                distance/similarity measures: euclidean dis-
    Because in all computational stylistic                                  tance and (taking into account the statistical
studies/approaches, a process of comparison                                 nature of data) Pearson’s correlation coeﬃ-
of two or more texts is involved, in a way or                               cient. Since function words frequencies can
another, there was always a need for a dis-                                 also be viewed as ordinal variables, we also
tance/similarity function to measure similar-                               considered for comparison some speciﬁc sim-
ity (or dissimilarity) of texts from the stylis-                            ilarity measures: Spearman’s rank-order co-
tic point of view. These measures vary a lot,                               eﬃcient, Spearman’s footrule, Goodman and
and in the last years a series of diﬀerent tech-                            Kruskal’s gamma, Kendall’s tau.
niques were used in authorship identiﬁcation:                                   To assess the ability of these dis-
approaches based on string kernel (Dinu, et                                 tance/similarity functions to capture stylistic
                                                                            similarity between texts, we have tested them
∗
  Research supported by CNCSIS, PN2-Idei project                            in one of the most frequently employed mul-
228                                                                         tivariate statistical analysis settings: cluster

Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 62-66, 2009.
                                                       Ordinal Measures in Authorship Identification   63

analysis. Clustering is a very good test bed            The Pearson’s correlation coeﬃcient is:
for a distance/similarity measure behavior.                             n 
                                                                                        
                                                                            xi −x    yi −y
We plugged the distance/similarity measures                                    sx        sy
selected for comparison into a standard hier-                     r = i=1
archical clustering algorithm and applied it                                    n−1
to a collection of 30 nineteenth century En-          where x is the mean of X, y the mean of
glish books. The family trees thus obtained           Y , sx and sy are the standard deviation of
revealed a lot about the distance/similarity          X, Y , respectively (Upton and Cook, 2008).
measures behavior.                                   The correlation coeﬃcient measures the ten-
    The main ﬁnding of our comparison                dency of two variables to change in value to-
is that the similarity measures that treat           gether (i.e., to either increase or decrease).
function words frequencies as ordinal vari-          ris related with the Euclidean distance, the
ables performed better than the others dis-             2(1 − r) being the Euclidean distance be-
tance/similarity measures. Treating function         tween the standardized versions of X and Y .
words frequencies as ordinal variables means             The random variables X, Y representing
that in the calculation of distance/similarity       texts can also be treated as ordinal data, in
function the ranks of function words accord-         which data is ordered but cannot be assumed
ing to their frequencies in text will be used        to have equal distance between values. In this
rather than the actual values of these fre-          case the values of X (and respectively Y ) will
quencies. Usage of the ranking of func-              be the ranks of words {w1 , w2 , . . . , wn } ac-
tion words in the calculation of the dis-             cording to their frequencies in text X rather
tance/similarity measure instead of the ac-           than of the actual values of these frequen-
tual values of the frequencies may seem               cies. The most common correlation statistic
as a loss of information, but we consider             for ordinal data is Spearman’s rank-order co-
that the process of ranking makes the dis-            eﬃcient (Upton and Cook 2008):
tance/similarity measure more robust acting                                     n
                                                                         6     
as a ﬁlter, eliminating the noise contained in             rsc = 1 −               (xi − yi )2
the values of the frequencies. The fact that a                       n(n2 − 1) i=1
speciﬁc function word has the rank 2 (is the
second most frequent word) in one text and           To be noted that, this time, xi , yi are ranks
has the rank 4 (is the fourth most frequent          and actually, the Spearman’s rank-order coef-
word) in another text can be more relevant           ﬁcient is the Pearson’s correlation coeﬃcient
than the fact that the respective word ap-           applied to ranks. The Spearman’s footrule is
pears 34% times in the ﬁrst text and only            the l1 -version of Spearman’s rank-order coef-
29% times in the second.                             ﬁcient:
                                                                                   n
    In the next section we present the dis-                                   3 
                                                              rsf = 1 −               |xi − yi |
tance/similarity measures involved in the                                  n2 − 1 i=1
comparison study, section 3 brieﬂy describes
the cluster analysis, and in section 4 and 5             Another set of correlation statistics for
are presented the experiments, the results ob-       ordinal data are based on the number of
tained, and suggestions for future work.             concordant and discordant pairs among two
                                                     variables. The number of concordant pairs
                                                     among two variables X and Y is P = |{(i, j) :
2   Similarity Measures
                                                     1 ≤ i < j ≤ n, (xi − xj )(yi − yj ) > 0}|. Sim-
If we treat texts as random variables whose          ilarly, the number of discordant pairs is Q =
values are the frequencies of diﬀerent words         |{(i, j) : 1 ≤ i < j ≤ n, (xi − xj )(yi − yj ) <
in the respective texts, then various statisti-      0}|.
cal correlation measures can be used as sim-             Goodman and Kruskal’s gamma(Upton
ilarity measures between that texts. For two         and Cook 2008) is deﬁned as:
texts X and Y and a ﬁxed set of words                                          P −Q
{w1 , w2 , . . . , wn } let denote by x1 the rela-                       γ=
                                                                               P +Q
tive frequency of w1 in X, by y1 the relative
frequency of w1 in Y and so on by xn the rel-           Kendall developed several slightly diﬀer-
ative frequency of wn in X, by yn the relative       ent types of ordinal correlation as alterna-
frequency of wn in Y .                               tives to gamma. Kendall’s tau-a(Upton and
64     Liviu P. Dinu and Marius Popescu

                                                    Group         Author       Book
Cook 2008) is based on the number of con-           American      Hawthorne    Dr. Grimshawe’s Secret
                                                    Novelists                  House of Seven Gables
cordant versus discordant pairs, divided by a                     Melville     Redburn
measure based on the total number of pairs                        Cooper
                                                                               Moby Dick
                                                                               The Last of the Mohicans
(n = the sample size):                                                         The Spy
                                                                               Water Witch
                                                    American      Thoreau      Walden
                        P −Q                        Essayists                  A Week on Concord
                  τa = n(n−1)                                     Emerson      Conduct Of Life
                                                                               English Traits
                           2                        British
                                                    Playwrights
                                                                  Shaw         Pygmalion
                                                                               Misalliance
                                                                               Getting Married
                                                                  Wilde        An Ideal Husband
    Kendall’s tau-b(Upton and Cook 2008) is                                    Woman of No Importance
a similar measure of association based on con-      Bronte
                                                    Sisters
                                                                  Anne         Agnes Grey
                                                                               Tenant Of Wildfell Hall
cordant and discordant pairs, adjusted for                        Charlotte    The Professor
                                                                               Jane Eyre
the number of ties in ranks.It is calculated                      Emily        Wuthering Heights

as (P − Q) divided by the geometric mean of         Australian
                                                    Novelists
                                                                  B. Baynton   Bush Studies
                                                                               Human Toll
the number of pairs not tied on X (X0 ) and                       Henry
                                                                  Lawson
                                                                               Joe Wilson and His Mates
                                                                               On the Track
the number of pairs not tied on Y (Y0 ):                          Miles
                                                                               While the Billy Boils
                                                                               My Brilliant Career
                                                                  Franklin     Some Everyday Folk and Dawn

                   P −Q                                                        Up the Country: A Saga of...

     τb = 
                                                                               Back to Bool Bool

           (P + Q + X0 )(P + Q + Y0 )
                                                     Table 1: The books used in experiments
   All the above three correlation statistics
are very related, if n is ﬁxed and X and Y
have no tied, then P , X0 and Y0 are com-          by Koppel et al. (2007) in their author-
pletely determined by n and Q.                     ship veriﬁcation experiments). The experi-
                                                   ments have shown that the similarity mea-
3    Clustering Analysis                           sures that treat function words frequencies
                                                   as ordinal variables (Spearman’s rank-order
An agglomerative hierarchical clustering al-       coeﬃcient, Spearman’s footrule, Goodman
gorithm (Duda et. al. 2001) arranges a set of      and Kruskal’s gamma, Kendall’s tau) per-
objects in a family tree (dendogram) accord-       formed better than the distance/similarity
ing to their similarity, similarity which in its   measures that use the actual values of func-
turn is given by a distance function deﬁned on     tion words frequencies (Euclidean distance,
the set of objects. The algorithm initially as-    Pearson’s correlation coeﬃcient).
signs each object to its own cluster and then         The aim of the actual experiments was
repeatedly merges pairs of clusters until the      two-folded. Firstly we wanted to see if the
whole tree is formed. At each step the pair of     ﬁndings in Popescu and Dinu (2009) are con-
nearest clusters is selected for merging. Var-     ﬁrmed in the case of a larger set (more au-
ious agglomerative hierarchical clustering al-     thors, more books) and secondly to further
gorithms diﬀer in the way in which they mea-       investigate the ability of some of the simi-
sure the distance between clusters. Note that      larity measures (Spearman’s rank-order co-
although a distance function between objects       eﬃcient, Goodman and Kruskal’s gamma,
exists, the distance measure between clusters      Kendall’s tau) to distinguish between the dif-
(set of objects) remains to be deﬁned. In our      ferent nationality of English language writers
experiments we used the complete linkage dis-      by adding to the data set works of Australian
tance between clusters, the maximum of the         writers from the same period. To the original
distances between all pairs of objects drawn       data set of Koppel et al. (2007) we added 9
from the two clusters (one object from the         works of three Australian authors from the
ﬁrst cluster, the other from the second).          same period, resulting a data set of 30 books
                                                   and 13 authors (Table 1).
4    Experiments                                      To perform the experiments, a set of words
In Popescu and Dinu (2009) we have com-            must be ﬁxed. The most frequent func-
pared the set of distance/similarity mea-          tion words may be selected or other crite-
sures described here on a collection of 21         ria may be used for selection. In all our ex-
nineteenth century English books written by        periments we used the set of function words
10 diﬀerent authors and spanning a variety         identiﬁed by Mosteller and Wallace (2007) as
of genre (the same set of books were used          good candidates for author-attribution stud-
                                                      Ordinal Measures in Authorship Identification   65

ies. We used the agglomerative hierarchical       5     Future Work
clustering algorithm coupled with the various     In this paper we have compared a set of mea-
distance similarity function employed in the      sures, regarding theirs ability to reﬂect stylis-
comparison to cluster the works in Table 1.       tic similarity between texts. In future work it
    The dendrograms obtained sustain the re-      would be interesting to compare these mea-
sults of Popescu and Dinu (2009). The re-         sures to other possible similarity measures. If
sulted dendrograms for Euclidean distance         the frequencies of diﬀerent words in the texts
and Pearson’s correlation coeﬃcient (not          are treated as probability distributions in-
shown because of lack of space) are very sim-     stead as random variables, speciﬁc measures
ilar, which is no surprise taking into account    can be applied: Kullback-Liebler Divergence
the close relation between the two measures       or Cross Entropy.
(see section 2.1). The problem of these fam-
ily trees is that the works of Melville are not   References
grouped together: one being clustered with        C. K. Chung, and J. W. Pennebaker. 2007.
the essays of Thoreau (Moby Dick) and the            The psychological function of function
other with the novels of Hawthorne. Also,            words. In K. Fiedler, ed., Social commu-
”My Brilliant Career” of M. Franklin is clus-        nication: Frontiers of social psychology,
tered with the novels of Charlotte Bronte.           343−359. Psychology Press, New York.
Apart from authorship relation, the dendro-
                                                  L.P. Dinu, M. Popescu and A. Dinu. 2008.
grams reﬂect no other stylistic relation be-
                                                     Authorship Identiﬁcation of Romanian
tween the works (like grouping the works ac-
                                                     Texts with Controversial Paternity. Proc.
cording to genre or nationality of the authors:
                                                     LREC 2008, Marrakech, Morocco.
American / English / Australian).
    Spearman’s rank-order coeﬃcient, Good-        R. O. Duda, P. E. Hart, and D. G. Stork.
man and Kruskal’s gamma and Kendall’s tau            2001. Pattern Classiﬁcation (2nd ed.).
produced the same dendrogram (modulo the             Wiley-Interscience Publication.
scale).Figure 1 shows the dendrogram for          H. van Halteren, M. Haverkort, H. Baayen,
Kendall’s tau. The dendrogram is perfect:            A. Neijt, and F. Tweedie. 2005. New ma-
all works are clustered according to theirs          chine learning methods demonstrate the
author. The nationality of the authors is            existence of a human stylome. Journal of
not reﬂected in the dendrogram (the authors          Quantitative Linguistics, 12:65−77.
with the same nationality are not clustered
                                                  M. Koppel, J. Schler, and E. Bonchek-
together).
                                                    Dokow. 2007. Measuring diﬀerentiabil-
    We performed a series of experiments to         ity: Unmasking pseudonymous authors.
test in which cases the nationality of the au-      J. of Machine Learning Research, 8,1261
thors can be revealed by a stylistic similar-       −1276.
ity measure. If only British and Australian
writers are selected, the Kendall’s tau pro-      C. Labbe and D. Labbe. 2006. A tool for lit-
duced the dendrogram presented in Figure             erary studies: Intertextual distance and
2. As can be seen the ﬁrst two branches              tree classiﬁcation. Literary and Linguistic
correspond to the nationality of the authors:        Computing, 21(3):311−326.
British writers on upper branch, Australian       F. Mosteller and D.L. Wallace. 2007. Infer-
writers on lower branch. The same thing hap-         ence and Disputed Authorship: The Fed-
pen when British and American writers are            eralist. CSLI Publications, Stanford.
selected. Again, the writers are clustered ac-    M. Popescu, L.P.Dinu, 2008. Rank Distance
cording to their nationality: this time, the        as a Stylistic Similarity. Proceedings COL-
British writers on lower branch and Ameri-          ING 2008, Manchester, UK
can writers on upper branch. But when the
subset of American and Australian writers is      M. Popescu, L.P.Dinu, 2009. Comparing
clustered using Kendall’s tau, the national-        Statistical Similarity Measures for Stylis-
ity of the writers is no longer reﬂected in the     tic Multivariate Analysis. Proceedings
family tree produced. The works of each au-         RANLP 2009, Borovets, Bulgaria
thor are clustered together, but there are no     G. Upton and I. Cook. 2008. A Dictionary of
clear branches corresponding to the two na-          Statistics. Oxford Univ. Press, Oxford.
tionalities.
66   Liviu P. Dinu and Marius Popescu


      Figure 1: Dendrogram of 30 nineteenth century English books (Kendal’s tau)


          Figure 2: Dendrogram of British and Australian writers (Kendal’s tau)